Skip to main content
Dryad

Estimating accurate gene trees in the presence of intra-locus recombination: A simulation study

Cite this dataset

Ziegler, Kevin; Lemmon, Alan (2022). Estimating accurate gene trees in the presence of intra-locus recombination: A simulation study [Dataset]. Dryad. https://doi.org/10.5061/dryad.6q573n626

Abstract

Accurate gene trees are difficult to estimate with traditional methods due to the effects of recombination. New methods that co-estimate gene trees and recombination breakpoints function differently than the traditional maximum likelihood (ML) framework, and therefore have the potential to alleviate inaccuracies caused by recombination. However, the accuracy of gene trees produced by these methods has yet to be evaluated under a broad range of conditions. Using simulations, we studied gene tree accuracy in the presence of intra-locus recombination. Using a previously published model of human population history, we simulate the process of recombination along large sections of a genome to produce DNA sequence alignments. We varied three parameters that influence gene tree accuracy: recombination rate, population size, and substitution rate. We then compare the accuracy of gene trees estimated from different methodologies, including traditional maximum likelihood estimation of single and concatenated regions, as well as more sophisticated co-estimation methods. Unsurprisingly, we found that traditional approaches can only produce accurate gene trees in narrow regions of parameter space; as the number of sites used to estimate a gene tree increases, recombination becomes more and more problematic. Some, but not all, of the co-estimation methods successfully circumvent this tradeoff and have the potential to produce accurate gene trees in broader regions of parameter space. These results indicate that by adopting co-estimation methods, systematists may be able to improve gene tree accuracy.

Methods

A brief overview is given here; the methods section in the paper goes into more detail.

This is a simulation study of genetic data. Phylogenetic trees were simulated for regions of a certain size, recombination rate, and population size using the program Msprime. Next DNA sequence data were simulated using the program Seq-gen (11 different substitution rates). 

Four programs use these simulated alignments to estimate gene trees: FastTree, Tsinfer, Relate, and Rent+. The trees estimated from these programs were then compared back to the corresponding simulated trees using Robinson-Foulds distance (using R packages ape and phangorn). The resulting RF distances are shown in heat maps to visualize how programs perform in different areas of parameter space.

Scripts to rerun simulations are included along with supplemental materials. Only simulated gene trees and their phylip alignments are included because the size of all input, output, and intermediate files totals around 1.5 terabytes. Upon request, we can submit specific data files such as the output tree files from Tsinfer, Rent+, and Relate.

Usage notes

Data:

There are four compressed tar folders in this submission. Each folder contains a dataset.

Datasets are divided into three separate folders Newick files of simulated trees, phylip files of simulated alignments, and tskit object files (output of msprime for simulated trees)

 

The most used dataset is Oct_16_2020_100k_Compressed.tar.gz

It represents simulated regions totaling in length 100,000 base pairs

 

For only ARG methods we tested simulated regions of less than 100,000 namely a 10,000 base pair dataset:

Oct_16_2020_10k_Compressed.tar.gz

and a 1,000 base pair dataset:

Oct_16_2020_1k_Compressed.tar.gz

 

For only the fixed length approach (312,625,1250,2500,5000,10000) using FastTree we used a different dataset of total length 100,000

This dataset was subsampled for these fixed lengths. This dataset was not used to represent the fixed length points in the MDS plots, only in the heat maps in figure 3.

June_2_2020_100k_Compressed.tar.gz

Supplemental:

Supplemental Materials.doc contains supplemental Figures

Script Overview:

Code used for simulation, tree estimation, and RF comparison are provided.

Code was designed to work on another computer, but this has not been tested. As a result minor errors might need fixing.

Requirements:

Languages used are mainly python, bash for pipelines, and R for RF comparison.

Bash shell scripts will only work in a Linux environment with the parallel command installed.

R packages Required (plus dependencies): ape phangorn

Python3 packages: msprime tskit tsinfer

Rent+, Relate, and Seq-gen must be installed separately, and their paths must be specified in the scripts.

 

Script Files:

There are two bash scripts which are pipelines for Co-Estimation methods/Data generation and Maximum Likelihood methods. There is also a general plotting script which needs minor modifications to run on any given file.

 

1.)

Simulated Data Generation and Co-Estimation methods Can be run with:

./ARG_DATA_PROGRAMS_CSV_Pipe.sh

 

The script contains variables which need to be set: paths to programs (Seq-gen, Relate, RentPlus), directories (code, data), and simulation parameters.

The pipeline runs subpipelines dealing with, Data generation, Relate, Tsinfer, RentPlus, and then organizing the comparison to simulated trees.

A CSV file representing the heatmap results will be output in the Results/ directory.

 

The specific scripts and their purpose are listed below:

 

PipeLine_All.sh

-Subpipeline used to generate simulated data

testmsprime.py

-Uses the python package msprime to simulate gene trees for various parameters for a certain number of parameters

SeqGenAncestralNewick.py

-Creates Newick files with random ancestral sequence for all simulated replicates

createCommandLineToRunSeqGen.py

-Creates a bash file with all commands to run seq on each replicate

AddAncestralToPhylip.py

-Adds the ancestral sequence to the simulated output alignments of Seq gen.

RelatePipeline.sh

-Subpipeline used to run the Co-estimation program Relate

CreateRelateCmdLine.py

-Create bash script with commands to run Relate for all simulated replicates

Relate_Run.py

-Runs Relate on all simulated replicates

RelateRF.py

-Collect Relate outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees

RelateConverter_RF2.py

-calculate RF distance between simulated and estimated trees. Outputs results to a file

TsinferPipeLine.sh

-Subpipeline used to run the Co-estimation program Tsinfer

CreateTsinferCmdLine.py

-Create bash script with commands to run Tsinfer for all simulated replicates

TsinferConverter.py

-Runs Tsinfer on all simulated replicates

TsinferRF.py

-Collect Tsinfer outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees

TsinferConverter_RF2.py

-calculate RF distance between simulated and estimated trees. Outputs results to a file

RentPlusPipeLine.sh

-Subpipeline used to run the Co-estimation program Rent+

CreateRentPlusCmdLine.py

-Create bash script with commands to run Rent+ for all simulated replicates

RentPlusConverterKZ.py

-Runs Rent+ on all simulated replicates

RentPlusSolutionPipeLine.sh

-Subpipeline used to calculate RF distance for Rent+

KendallColijnPrep.py

-Create subfiles setting up calculation of simulated and estimated trees for Rent+

RentPlus_RF.r

-calculate RF distance between simulated and estimated trees. Outputs results to a file

CreateResultsRentPlus.py

-collect result of RF distances and output into a structured file for each replicate.

ResultsToCSV.py

-Collects output of RF distances for all replicates and converts the data into a single CSV file

 

2.)

Maximum Likelihood Tree Estimation can be run with:

./FastTreeScript.sh

 

The script contains variables which need to be set: paths to programs (FastTree), directories (code, data), and simulation parameters.

A CSV file representing the heatmap results will be output in the Results/ directory.

 

The specific scripts and their purpose are listed below:

FastTreeWholeEstTrue.py

-splits simulated alignments in three ways: simulated c-gens, estimated c-gens, and the whole alignment (fixed length 100,000). Runs FastTree on the subsampled alignments and calculates the RF distance between estimated and simulated trees using Rescale_RF_File.r

Rescale_RF_File.r

-Takes in two newickfiles and calculates RF distance between them

FastTreeResultsWTE.py

-Collects the RF results file names and put the names in one file

ResultsToCSV_WTE.py

-Converts the RF results files into a CSV file for plotting

FastTreeScriptCmdLine.py

-Creates a bash script that runs all other Fixed Lengths on simulated data (312,625,1250,2500,5000,10000) using FastTreeScript.py

FastTreeScript.py

-Runs FastTree on a given interval Fixed Length intervals

IqtreeProcessFT_InputFile_RF.py

-Calculates RF distances for Fixed Length estimated trees

RF_File.r

-Takes in two newickfiles and calculates RF distance between them

ResultsToCSV2.py

-Converts the RF results files into a CSV file for plotting

 

3.)

A general Plotting script:

PlotHeatMap.py

 

Line 21 contains the lstFiles variable, which contains the CSV files to plot.