Estimating accurate gene trees in the presence of intra-locus recombination: A simulation study
Data files
Nov 18, 2022 version files 66.37 GB
-
AddAncestralToPhylip.py
-
ARG_DATA_PROGRAMS_CSV_Pipe.sh
-
createCommandLineToRunSeqGen.py
-
CreateRelateCmdLine.py
-
CreateRentPlusCmdLine.py
-
CreateResultsRentPlus.py
-
CreateTsinferCmdLine.py
-
FastTreeResultsWTE.py
-
FastTreeScript.py
-
FastTreeScript.sh
-
FastTreeScriptCmdLine.py
-
FastTreeWholeEstTrue.py
-
IqtreeProcessFT_InputFile_RF.py
-
June_2_2020_100k_Compressed.tar.gz
-
KendallColijnPrep.py
-
Oct_16_2020_100k_Compressed.tar.gz
-
Oct_16_2020_10k_Compressed.tar.gz
-
Oct_16_2020_1k_Compressed.tar.gz
-
PipeLine_All.sh
-
PlotHeatMap.py
-
README_DATA.md
-
README.md
-
Relate_Run.py
-
RelateConverter_RF2.py
-
RelatePipeline.sh
-
RelateRF.py
-
RentPlus_RF.r
-
RentPlusConverterKZ.py
-
RentPlusPipeLine.sh
-
RentPlusSolutionPipeLine.sh
-
Rescale_RF_File.r
-
ResultsToCSV_RentPlus.py
-
ResultsToCSV_WTE.py
-
ResultsToCSV.py
-
ResultsToCSV2.py
-
RF_File.r
-
SampleHeatMap.pdf
-
SeqGenAncestralNewick.py
-
Supplemental_Materials.docx
-
test.pdf
-
testmsprime.py
Abstract
Accurate gene trees are difficult to estimate with traditional methods due to the effects of recombination. New methods that co-estimate gene trees and recombination breakpoints function differently than the traditional maximum likelihood (ML) framework, and therefore have the potential to alleviate inaccuracies caused by recombination. However, the accuracy of gene trees produced by these methods has yet to be evaluated under a broad range of conditions. Using simulations, we studied gene tree accuracy in the presence of intra-locus recombination. Using a previously published model of human population history, we simulate the process of recombination along large sections of a genome to produce DNA sequence alignments. We varied three parameters that influence gene tree accuracy: recombination rate, population size, and substitution rate. We then compare the accuracy of gene trees estimated from different methodologies, including traditional maximum likelihood estimation of single and concatenated regions, as well as more sophisticated co-estimation methods. Unsurprisingly, we found that traditional approaches can only produce accurate gene trees in narrow regions of parameter space; as the number of sites used to estimate a gene tree increases, recombination becomes more and more problematic. Some, but not all, of the co-estimation methods successfully circumvent this tradeoff and have the potential to produce accurate gene trees in broader regions of parameter space. These results indicate that by adopting co-estimation methods, systematists may be able to improve gene tree accuracy.
Methods
A brief overview is given here; the methods section in the paper goes into more detail.
This is a simulation study of genetic data. Phylogenetic trees were simulated for regions of a certain size, recombination rate, and population size using the program Msprime. Next DNA sequence data were simulated using the program Seq-gen (11 different substitution rates).
Four programs use these simulated alignments to estimate gene trees: FastTree, Tsinfer, Relate, and Rent+. The trees estimated from these programs were then compared back to the corresponding simulated trees using Robinson-Foulds distance (using R packages ape and phangorn). The resulting RF distances are shown in heat maps to visualize how programs perform in different areas of parameter space.
Scripts to rerun simulations are included along with supplemental materials. Only simulated gene trees and their phylip alignments are included because the size of all input, output, and intermediate files totals around 1.5 terabytes. Upon request, we can submit specific data files such as the output tree files from Tsinfer, Rent+, and Relate.
Usage notes
Data:
There are four compressed tar folders in this submission. Each folder contains a dataset.
Datasets are divided into three separate folders Newick files of simulated trees, phylip files of simulated alignments, and tskit object files (output of msprime for simulated trees)
The most used dataset is Oct_16_2020_100k_Compressed.tar.gz
It represents simulated regions totaling in length 100,000 base pairs
For only ARG methods we tested simulated regions of less than 100,000 namely a 10,000 base pair dataset:
Oct_16_2020_10k_Compressed.tar.gz
and a 1,000 base pair dataset:
Oct_16_2020_1k_Compressed.tar.gz
For only the fixed length approach (312,625,1250,2500,5000,10000) using FastTree we used a different dataset of total length 100,000
This dataset was subsampled for these fixed lengths. This dataset was not used to represent the fixed length points in the MDS plots, only in the heat maps in figure 3.
June_2_2020_100k_Compressed.tar.gz
Supplemental:
Supplemental Materials.doc contains supplemental Figures
Script Overview:
Code used for simulation, tree estimation, and RF comparison are provided.
Code was designed to work on another computer, but this has not been tested. As a result minor errors might need fixing.
Requirements:
Languages used are mainly python, bash for pipelines, and R for RF comparison.
Bash shell scripts will only work in a Linux environment with the parallel command installed.
R packages Required (plus dependencies): ape phangorn
Python3 packages: msprime tskit tsinfer
Rent+, Relate, and Seq-gen must be installed separately, and their paths must be specified in the scripts.
Script Files:
There are two bash scripts which are pipelines for Co-Estimation methods/Data generation and Maximum Likelihood methods. There is also a general plotting script which needs minor modifications to run on any given file.
1.)
Simulated Data Generation and Co-Estimation methods Can be run with:
./ARG_DATA_PROGRAMS_CSV_Pipe.sh
The script contains variables which need to be set: paths to programs (Seq-gen, Relate, RentPlus), directories (code, data), and simulation parameters.
The pipeline runs subpipelines dealing with, Data generation, Relate, Tsinfer, RentPlus, and then organizing the comparison to simulated trees.
A CSV file representing the heatmap results will be output in the Results/ directory.
The specific scripts and their purpose are listed below:
PipeLine_All.sh
-Subpipeline used to generate simulated data
testmsprime.py
-Uses the python package msprime to simulate gene trees for various parameters for a certain number of parameters
SeqGenAncestralNewick.py
-Creates Newick files with random ancestral sequence for all simulated replicates
createCommandLineToRunSeqGen.py
-Creates a bash file with all commands to run seq on each replicate
AddAncestralToPhylip.py
-Adds the ancestral sequence to the simulated output alignments of Seq gen.
RelatePipeline.sh
-Subpipeline used to run the Co-estimation program Relate
CreateRelateCmdLine.py
-Create bash script with commands to run Relate for all simulated replicates
Relate_Run.py
-Runs Relate on all simulated replicates
RelateRF.py
-Collect Relate outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees
RelateConverter_RF2.py
-calculate RF distance between simulated and estimated trees. Outputs results to a file
TsinferPipeLine.sh
-Subpipeline used to run the Co-estimation program Tsinfer
CreateTsinferCmdLine.py
-Create bash script with commands to run Tsinfer for all simulated replicates
TsinferConverter.py
-Runs Tsinfer on all simulated replicates
TsinferRF.py
-Collect Tsinfer outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees
TsinferConverter_RF2.py
-calculate RF distance between simulated and estimated trees. Outputs results to a file
RentPlusPipeLine.sh
-Subpipeline used to run the Co-estimation program Rent+
CreateRentPlusCmdLine.py
-Create bash script with commands to run Rent+ for all simulated replicates
RentPlusConverterKZ.py
-Runs Rent+ on all simulated replicates
RentPlusSolutionPipeLine.sh
-Subpipeline used to calculate RF distance for Rent+
KendallColijnPrep.py
-Create subfiles setting up calculation of simulated and estimated trees for Rent+
RentPlus_RF.r
-calculate RF distance between simulated and estimated trees. Outputs results to a file
CreateResultsRentPlus.py
-collect result of RF distances and output into a structured file for each replicate.
ResultsToCSV.py
-Collects output of RF distances for all replicates and converts the data into a single CSV file
2.)
Maximum Likelihood Tree Estimation can be run with:
./FastTreeScript.sh
The script contains variables which need to be set: paths to programs (FastTree), directories (code, data), and simulation parameters.
A CSV file representing the heatmap results will be output in the Results/ directory.
The specific scripts and their purpose are listed below:
FastTreeWholeEstTrue.py
-splits simulated alignments in three ways: simulated c-gens, estimated c-gens, and the whole alignment (fixed length 100,000). Runs FastTree on the subsampled alignments and calculates the RF distance between estimated and simulated trees using Rescale_RF_File.r
Rescale_RF_File.r
-Takes in two newickfiles and calculates RF distance between them
FastTreeResultsWTE.py
-Collects the RF results file names and put the names in one file
ResultsToCSV_WTE.py
-Converts the RF results files into a CSV file for plotting
FastTreeScriptCmdLine.py
-Creates a bash script that runs all other Fixed Lengths on simulated data (312,625,1250,2500,5000,10000) using FastTreeScript.py
FastTreeScript.py
-Runs FastTree on a given interval Fixed Length intervals
IqtreeProcessFT_InputFile_RF.py
-Calculates RF distances for Fixed Length estimated trees
RF_File.r
-Takes in two newickfiles and calculates RF distance between them
ResultsToCSV2.py
-Converts the RF results files into a CSV file for plotting
3.)
A general Plotting script:
PlotHeatMap.py
Line 21 contains the lstFiles variable, which contains the CSV files to plot.