Data from: Phylogenomic branch length estimation using quartets
Data files
Dec 03, 2025 version files 15.10 GB
-
ASTRALIII_SU_results.tar.xz
9.29 GB
-
ASTRALIII_SU.tar.gz
52.02 MB
-
biological-mammalian.tar.gz
27.19 MB
-
MVRoot_results.tar.xz
2.31 GB
-
MVRoot_SU.tar.gz
106.42 MB
-
quartet_simulations.zip
2.25 GB
-
quartets_results.tar.gz
1.06 GB
-
README.md
10.50 KB
Abstract
Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy.
This repository contains the datasets used in the following paper:
Y. Tabatabaee, C. Zhang, T. Warnow, S. Mirarab, Phylogenomic branch length estimation using quartets, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i185–i193, https://doi.org/10.1093/bioinformatics/btad221
For experiments in this study, we studied a collection of simulated and biological datasets with incomplete lineage sorting (ILS). We generated a new quartet dataset and regenerated species trees with substitution-unit branch lengths for previously published datasets from Zhang et. al. (2018) and Mai et. al. (2017). We also analyzed the mammalian biological dataset from Song et. al. (2012).
We provide a description of the relevant files included in each dataset below. Note that some log and intermediate files generated during the analyses are also included with the datasets for completeness, but are not listed here.
Simulated datasets
Quartet simulations
This dataset has six different model conditions that differ in the level of ILS (by varying population size) and varying rate heterogeneity multipliers, each with 200 replicates. The model conditions start from a strict molecular clock with no rate variation (i.e., Homogeneous, no_variation) and become successively more complex. Next, we add rate variations across species tree branches only (-hs option, only_hs), creating a model (Sp) akin to MSC + Substitution mentioned in the paper. We then create models that have rate variation only across genes but not species (Loc using -hl, only_hl) and both across species and across genes (Sp, Loc using -hs -hl, hs_hl). Finally, we add rate variations specific to each branch of each gene tree (Sp, Loc, Sp/Loc: -hs -hl -hg, hs_hl_hg), which creates heterotachy. The final condition (hs_hl_hg_highr) has similar rate variations to hs_hl_hg but with a smaller population size and therefore a higher level of ILS. Raw dataset is available in quartet_simulations.zip and includes species trees, true gene trees, and other SimPhy outputs. Results and intermediate data from the experiments in the paper are available in quartets_results.tar.gz.
Below is a description of relevant files in each directory.
s_tree.trees: true species tree in substitution unitstruegenetrees: true gene trees in substitution unitsultrametric-genetrees.tre: ultrametric gene treess_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversall_trees.trees: locus trees in generation unitsad.txt: average RF distance between the model species tree and true gene treescastles_truegenetrees_s_tree.trees: true species tree furnished with CASTLES SU branch lengthspatristic_[MODE]_truegenetrees.mat: patristic distance matrix calculated using minimum (min), average (avg), or all (all) pairwise distanceserable_patristic_all_truegenetrees_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengthsfastme_BalLS_patristic_avg_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with average distancesfastme_BalLS_patristic_min_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum distances
30-taxon MVRoot ILS simulations
This dataset has six model conditions with varying levels of deviation from the molecular clock and inclusion of an outgroup, each with 100 replicates. The model conditions are specified as outgroup.[has-OG].species.[DEV].genes.[DEV] where [has-OG] is 1 when the dataset has an outgroup and 0 otherwise, and [DEV] shows the level of deviation from the clock (parameter α of the gamma distribution) that is set to 5 (low), 1.5 (medium), or 0.15 (high). Original dataset is from Mai at al. (2017) and available at https://datadryad.org/dataset/doi:10.6076/D1RW2G. Species trees with SU branch lengths are available in MVRoot_SU.tar.gz, and results and intermediate data from the experiments in the paper are available in MVRoot_results.tar.xz.
Below is a description of relevant files in each directory.
s_tree.trees: true species tree in substitution unitss_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversaltruegenetrees: true gene trees in substitution unitsestimatedgenetre.gtr: estimated gene treesad.txt: average RF distance between the model species tree and true gene treesgtee_gtr.txt: average RF distance between true and estimated gene treesall-genes.phylipandconcat_align.fasta: concatenation of all gene alignmentscastles_truegenetrees_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on true gene treescastles_estimatedgenetre.gtr_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on estimated gene treeserable_patristic_all_truegenetrees_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on true gene treeserable_patristic_all_estimatedgenetre.gtr_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on estimated gene treesfastme_BalLS_patristic_[MODE]_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on true gene treesfastme_BalLS_patristic_[MODE]_estimatedgenetre.gtr_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on estimated gene treespatristic_[MODE]_truegenetrees.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for true gene treespatristic_[MODE]_estimatedgenetre.gtr.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for estimated gene treesRAxML_result.concat_align_s_tree.trees: true species tree furnished with RAxML SU branch lengths
101-taxon ASTRALIII ILS simulations
This dataset has four model conditions with varying sequence lengths (1600bp, 800bp, 400bp, 200bp) corresponding to different levels of gene tree estimation error (23%, 31%, 42%, and 55%), each with 50 replicates. The original dataset is from Zhang et. al. (2018) and available at https://gitlab.com/esayyari/ASTRALIII/. Species trees with SU branch lengths are available in ASTRALIII_SU.tar.gz. Results and intermediate data from the experiments in the paper are in ASTRALIII_SU_results.tar.xz.
Below is a description of relevant files in each directory.
s_tree.trees: true species tree in substitution unitss_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversaltruegenetrees: true gene trees in substitution unitsfasttree_genetrees_[seq-len]_non: gene trees estimated using FastTree2 from alignments with length [seq-len]bperable_patristic_all_fasttree_genetrees_[seq-len]_non_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on gene trees estimated from alignments with length [seq-len]bppatristic_[MODE]_fasttree_genetrees_[seq-len]_non.mat: patristic distance matrix calculated using minimum (min), average (avg), or all (all) pairwise distances for gene trees estimated from alignments with length [seq-len]bpconcat_for_fasttree_[seq-len].fastaorall-genes_for_fasttree.phylip: concatenation of all gene alignments with length [seq-len]bpRAxML_result.concat_for_fasttree_[seq-len]_s_tree.trees: true species tree furnished with RAxML SU branch lengths from concatenation of alignments with length [seq-len]bpcastles_fasttree_genetrees_[seq-len]_non_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on gene trees estimated from alignments with length [seq-len]bpfastme_BalLS_patristic_[MODE]_fasttree_genetrees_[seq-len]_non_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on gene trees estimated from alignments with length [seq-len]bp
Biological dataset
The preprocessed mammalian dataset (in which genes with mismatching names are removed) is available at https://datadryad.org/dataset/doi:10.5061/dryad.ht76hdrp0 and includes an estimated ASTRAL species tree, gene trees, and alignments. The files generated for our analysis are available in biological-mammalian.tar.gz.
Below is a description of relevant files in each directory.
genetrees.tre: gene trees after filteringconcat-align.fasta: concatenated alignment of all filtered genesastralv5.7.8.tre: main species tree with CU branch lengths estimated using ASTRALRAxML_log.no_GAL.tre: ASTRAL species tree furnished with RAxML SU branch lengths after removing the outgroup Chicken (GAL)castles_rooted.tre: ASTRAL species tree furnished with CASTLES SU branch lengths after removing the outgroup Chicken (GAL)fastme_[MODE].tre: ASTRAL species tree furnished with FastME SU branch lengths with minimum or average distanceserable.tre.length.nwk: ASTRAL species tree furnished with ERaBLE SU branch lengthspatristic_[MODE].mat: patristic distance matrix calculated using minimum (min), average (avg), or all (all) pairwise distances for gene treesmammals-namemap.txt: name mapping for taxamammalian-trees.nexus: FigTree nexus file comparing CASTLES and Concatenation branch lengthsmammalian.pdf:Figure comparing CASTLES and concatenation trees
