Data from: Coalescent-based branch length estimation improves dating of species trees
Data files
Apr 03, 2026 version files 14.19 GB
-
avian-stiller.tar.gz
2.67 MB
-
large.tar.gz
8.28 GB
-
README.md
11.83 KB
-
S101.tar.gz
985.83 MB
-
S30.tar.gz
4.92 GB
-
suboscines-harvey.tar.gz
1.34 MB
Abstract
Species trees need to be dated for many downstream applications. Typical molecular dating methods take a phylogenetic tree with branch lengths in substitution units, as well as a set of calibrations, as input and convert the branch lengths of the species tree to the unit of time, while being consistent with the pre-specified calibrations. When dating species trees from multi-locus genome-scale datasets, the branch lengths and sometimes the topology of the species tree are estimated using concatenation. However, concatenation does not address gene tree heterogeneity across the genome. While Bayesian dating methods can address some forms of gene tree heterogeneity, such as incomplete lineage sorting, they are not scalable to large datasets. In this paper, we introduce a new scalable pipeline for dating species trees that addresses gene tree discordance for both topology and branch length estimation. The pipeline uses discordance-aware methods that account for incomplete lineage sorting for estimating the topology and branch lengths, and maximum likelihood-based methods for the dating step. Our simulation study on datasets with gene tree discordance shows that this pipeline produces more accurate and less biased dates than pipelines that use concatenation. Furthermore, it is substantially more scalable and can handle datasets with thousands of species and genes. Our results on two biological datasets demonstrate that this new pipeline improves the inference of node ages and branch lengths for certain nodes, particularly those closer to the tree tips, and improves the downstream diversification analysis.
Coalescent-based dating datasets
This repository contains the datasets and scripts used in the following manuscript:
- Y. Tabatabaee, S. Claramunt, S. Mirarab (2026). Coalescent-based branch length estimation improves dating of species trees. Systematic Biology (in press). preprint available at https://www.biorxiv.org/content/10.1101/2025.02.25.640207v1.abstract
For experiments in this study, we generated three sets of simulated datasets with gene tree discordance due to incomplete lineage sorting (ILS) and analyzed two avian biological datasets from Harvey et. al. (2020) and Stiller et. al. (2024). The simulated datasets have model species trees with substitution-unit, generation-unit, and time-unit branch lengths.
Simulated datasets
30-taxon dataset (S30.tar.gz)
This dataset has six model conditions with varying deviation from the molecular clock and inclusion of an outgroup, each with 100 replicates. The model conditions are specified as outgroup.[has-OG].species.[DEV].genes.[DEV] where [has-OG] is 1 when the dataset has an outgroup and 0 otherwise, and [DEV] shows the level of deviation from the clock (parameter α of the gamma distribution) that is set to 5 (low), 1.5 (medium), or 0.15 (high). Original dataset is from Mai at al. (2017) and available at https://uym2.github.io/MinVar-Rooting/. Below is a description of files in each directory.
truegenetrees: true gene treesestimatedgenetre.gtr: gene trees estimated under GTR evolution models_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib]_[root_unfixed].txt: calibration information (given in the format of (node, time) pairs)s_tree_tu_5_calib_n[num-calib]_[root_unfixed]_mcmctree.ctl: control file for MCMCTrees_tree_tu_5_calib_n[num-calib]_[root_unfixed]_mcmctree.date.nwk[.normalized]: tree dated with MCMCTree with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree.RAxML_result.concat_align_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengthscastlespro_estimatedgenetre.gtr_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengths[dating-method]_n[num-calib]_[root_unfixed]_RAxML_result.concat_align_s_tree.trees.rooted.labeled.[normalized]: RAxML SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension.[dating-method]_n[num-calib]_[root_unfixed]_castlespro_estimatedgenetre.gtr_s_tree.trees.rooted.labeled.[normalized]: CASTLES-Pro SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension.ad.txt: average RF distance between the model species tree and true gene treesgtee_gtr.txt: average RF distance between true and estimated gene trees
101-taxon dataset (S101.tar.gz)
This dataset has four model conditions with varying sequence lengths (1600bp, 800bp, 400bp, 200bp) corresponding to different levels of gene tree estimation error (23%, 31%, 42%, and 55%). The original dataset is from Zhang et. al. (2018) and available at https://gitlab.com/esayyari/ASTRALIII/. Below is a description of files in each directory.
truegenetrees: true gene treesfasttree_genetrees_[seq-len]_non.[num-genes]: [num-genes] gene trees estimated from sequence alignments with length [seq-len]bps_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib]_[root_unfixed].txt: calibration information (given in the format of (node, time) pairs)RAxML_result.concat_for_fasttree_[seq-len].[num-genes]_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengths given a concatenation of [seq-len]bp sequences for the first [num-genes] genescastlespro_fasttree_genetrees_[seq-len]_non.[num-genes]_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengths given [num-genes] FastTree gene trees estimated from [seq-len]bp sequencesad.txt: average RF distance between the model species tree and true gene trees[dating-method]_n[num-calib]_[root_unfixed]_castlespro_fasttree_genetrees_[seq-len]_non.[num-genes]_s_tree.trees.rooted.labeled.[normalized]: CASTLES-Pro SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points for the model condition corresponding to [num-genes] genes and [seq-len]bp sequences. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension. Files with additional extensions include config files and log files for each method.[dating-method]_n[num-calib]_[root_unfixed]_RAxML_result.concat_for_fasttree_[seq-len].[num-genes]_s_tree.trees.rooted.labeled.[normalized]: RAxML SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points for the model condition corresponding to [num-genes] genes and [seq-len]bp sequences. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension. Files with additional extensions include config files and log files for each method.
Large dataset (large.tar.gz)
This dataset has 8 model conditions with 50, 100, 200, 500, 1K, 2K, 5K, and 10K-taxon trees with 20 replicates in each condition. Below is a description of files in each directory large/[num-taxa]/[rep-num]/ where [num-taxa] is the model condition (number of taxa) and [rep-num] is the replicate index.
truegenetrees: true gene treesestimatedgenetre: estimated gene treess_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib].txt: calibration informationRAxML_result.concat_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengthscastlespro_estimatedgenetre_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengthstreepl_n[num-calib]_RAxML_result.concat_s_tree.trees.rooted.labeled: RAxML SU tree dated with TreePL with [num-calib] calibration pointstreepl_n[num-calib]_RAxML_result.concat_s_tree.trees.rooted.labeled.config: config file for TreePL for the RAxML dated treetreepl_n[num-calib]_castlespro_estimatedgenetre_s_tree.trees.rooted.labeled: CASTLES-Pro SU tree dated with TreePL with [num-calib] calibration pointstreepl_n[num-calib]_castlespro_estimatedgenetre_s_tree.trees.rooted.labeled.config: config file for TreePL for the CASTLES-Pro dated treead.txt: average RF distance between the model species tree and true gene treesgtee.txt: average RF distance between true and estimated gene trees
Biological datasets
- Neoavian (avian-stiller.tar.gz): 363-taxon neoavian dataset from Stiller et al. (2024) with 63,430 single-copy genes. The original data is available here. Results from the analysis in this study is available at /biological/avian-stiller. Below is a description of files in this directory.
mdcat_median_40Kl_caml_stiller.rooted.tre: ASTRAL tree furnished with ConBL branch lengths dated with MD-Catmdcat_median_40Kl_astral4_stiller_nolabel.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with MD-Catwlogdate_median_astral4_stiller.rooted.no_label.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with wLogDatewlogdate_median_astral_63K_concat_bl.rooted.no_label.tre: ASTRAL tree furnished with ConBL branch lengths dated with wLogDatetreepl_median_astral_63K_concat_bl.rooted.tre: ASTRAL tree furnished with ConBL branch lengths dated with TreePLtreepl_median_castlespro_stiller.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with TreePLgenera_castlespro_concat.csv: Age of genera estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topologyfamilies_castlespro_concat.csv: Age of families estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topologyorders_castlespro_concat.csv: Age of orders estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topologyltt_stiller.csv: Lineage-through-time information for different dated treesfossil_list_median.txt: Fossil calibration information for median (50% quantiles)- Suboscines (suboscines-harvey.tar.gz): 1683-taxon suboscines dataset from Harvey et. al. (2020) with 2,389 single-copy genes. The original data is available at https://github.com/mgharvey/tyranni. Results from the analysis in this study is available at /biological/suboscines-harvey. Below is a description of files in this directory.
concat_T400F.examl.rooted.tre: Concatenation tree furnished with CAML branch lengthscastlespro_T400F.examl.rooted.tre: Concatenation tree furnished with CASTLES-Pro branch lengthstreepl_castlespro_T400F.astral.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with TreePLtreepl_examl_T400F.astral.rooted.tre: ASTRAL tree furnished with CAML branch lengths dated with TreePLtreepl_castlespro_T400F.examl.rooted.tre: Concatenation tree furnished with CASTLES-Pro branch lengths dated with TreePLtreepl_concat_T400F.examl.rooted.tre: Concatenation tree furnished with CAML branch lengths dated with TreePLgenera_treepl_castlespro_concat.csv: Age of different genera estimated using TreePL+CASLTES-Pro andTreePL+ConBL on the concatenation topologyfamilies_treepl_castlespro_concat.csv: Age of different families estimated using TreePL+CASLTES-Pro andTreePL+ConBL on the concatenation topologysuboscines_ltt.csv: Lineage-through-time information for the four different dated trees (ASTRAL or CAML furnished with CASTLES-Pro or ConBL)
