The limits of the constant-rate birth-death prior for phylogenetic tree topology inference
Data files
Aug 06, 2023 version files 135.45 MB
-
combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
10.56 KB
-
combined_IQTREE_RevBayes_tree_indices.RData
9.64 KB
-
combined_parameter_distributions.RData
3.53 MB
-
final_timetrees.RData
1.85 MB
-
finaltreeparams_rho1.RData
129.69 MB
-
merged_three_hundred_data.RData
34.78 KB
-
README.md
16.10 KB
-
tree_parameters_rho1.RData
22.37 KB
-
wilcoxon_distance_values_rho1.RData
959 B
-
zscoresdf_rho1.RData
292.22 KB
Oct 31, 2023 version files 135.46 MB
-
combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
10.56 KB
-
combined_IQTREE_RevBayes_tree_indices.RData
9.64 KB
-
combined_parameter_distributions.RData
3.53 MB
-
final_timetrees.RData
1.85 MB
-
finaltreeparams_rho1.RData
129.69 MB
-
merged_three_hundred_data.RData
34.78 KB
-
README.md
16.60 KB
-
tree_parameters_rho1.RData
22.37 KB
-
wilcoxon_distance_values_nooutgroup.RData
959 B
-
wilcoxon_distance_values_rho1.RData
959 B
-
zscoresdf_rho1.RData
292.22 KB
Abstract
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content.
README
The limits of the constant-rate birth-death prior for phylogenetic tree topology inference
This repo contains code to reproduce the simulations, analyses, and figures for "The limits of the constant-rate birth-death prior for phylogenetic tree topology inference".
Environment setup:
R version: 4.2.2
The following libraries are used:
library(adegenet)
library(ape)
library(apTreeshape)
library(Claddis)
library(coda)
library(coefplot)
library(colorspace)
library(cowplot)
library(ctv)
library(devtools)
library(dispRity)
library(dotwhisker)
library(dplyr)
library(gamlss)
library(gamlss.dist)
library(gamlss.add)
library(githubinstall)
library(GGally)
library(ggbiplot)
library(ggExtra)
library(gghalves)
library(gginnards)
library(ggplot2)
library(ggplotify)
library(ggpmisc)
library(ggpp)
library(ggpubr)
library(gridExtra)
library(magrittr)
library(openxlsx)
library(parallel)
library(phangorn)
library(phylobase)
library(phyloTop)
library(phytools)
library(plyr)
library(seqinr)
library(stringr)
library(tibble)
library(tidyverse)
library(treebalance)
library(treeCentrality)
library(TreeSim)
library(viridis)
Other software:
RevBayes (version 1.2.1) (Höhna et al. 2016) and IQ-TREE 2 (version 2.2.0) (Minh et al. 2020) were used for supplementary analyses. Both were run using the shell (.sh) scripts mentioned below.
To reproduce the results:
- Run
Tree_Selection.R
to select the empirical phylogenetic trees to be included from TimeTree. The output filefinal_timetrees.RData
contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels. - Run
Simulation_And_Analysis.R
to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. RunSupplementary_Fig_S1_Analysis.R
to generate data for Supplementary Figure S1. - Run
Meta_analysis.R
to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3). - Run
Imbalance_Simulation.R
to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes:fasta_to_revbayes_code_rate2.sh
,fasta_to_revbayes_code_rate3.sh
,fasta_to_revbayes_code_rate4.sh
These shell scripts use the following .Rev files:MCMC_Revbayes_code_rate2.Rev
,MCMC_Revbayes_code_rate3.Rev
,MCMC_Revbayes_code_rate4.Rev
And rely on the following supplementary .Rev files:tree_BD.Rev
,sub_JC.Rev
,clock_global.Rev
4b. IQ-TREE 2:fasta_to_iqtree_code.sh
- Run
Final_Figures.R
to visualize the results.
Sharing/Access information
SHARING/ACCESS INFORMATION
- Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
- Was data derived from another source? Yes
A. *TimeTree. The uploaded R script selects relevant Newick trees from the TimeTree dataset; the full Newick dataset is available upon request through TimeTree. The
final_timetrees.RData
file includes the final subset of empirical phylogenetic TimeTree trees used for analysis, which is sufficient to recreate the results. Tip names are anonymized. - Recommended citation for this dataset:
Khurana, Mark Poulsen, Scheidwasser-Clow, Neil, Penn, Matthew J., Bhatt, Samir, Duchêne, David A. (2023), The limits of the constant-rate birth-death prior for phylogenetic tree topology inference, Dryad, Dataset, https://doi.org/10.5061/dryad.2fqz612vg
Description of the main appended data, as well as data needed to recreate figures
- File List:
A) final_timetrees.RData
B) tree_parameters_rho1.RData
C) wilcoxon_distance_values_rho1.RData
D) wilcoxon_distance_values_nooutgroup.RData
E) zscoresdf_rho1.RData
F) merged_three_hundred_data.RData
G) combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
H) combined_parameter_distributions.RData
I) combined_IQTREE_RevBayes_tree_indices.RData
J) finaltreeparams_rho1.RData
- Are there multiple versions of the data set? No A. If yes, name of file(s) that was updated: NA i. Why was the file updated? NA ii. When was the file updated? NA
- Brief Description of the Data and Workflow:
A. The
final_timetrees.RData
file includes the final subset of empirical TimeTree trees used for analysis. Not used for any specific figure, but is the departure point for all analyses (i.e., starting fromSimulation_And_Analysis.R
mentioned above). B. Meta-data for all the included empirical trees can be found in the supplementary files. C.tree_parameters_rho1.RData
contains the inferred birth and death parameter values for the empirical trees. Used to create Figure 1. D.wilcoxon_distance_values_rho1.RData
contains the Wilcoxon values comparing empirical and simulated trees across a range of indices. Used to recreate Figure 2a.wilcoxon_distance_values_nooutgroup.RData
is used ot generate Supplementary Figure S1. E.zscoresdf_rho1.RData
contains the Z score values comparing empirical and simulated trees for the top five indices. Used to recreate Figure 2b and Supplementary Figure S2. F.merged_three_hundred_data.RData
contains the metadata and index values for the random subset of 300 trees used for the meta-analysis. Used for Figure 3. G.combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
contains the differences between the IQ-TREE 2 and RevBayes respectively compared to the empirical trees. Used for Figure 4. H.combined_parameter_distributions.RData
contains the index values for the IQ-TREE 2 trees, the idealized prior values, the RevBayes trees, and the values for the entire set of posterior trees from RevBayes. Used for Figure 5 and Supplementary Figure S5. I.combined_IQTREE_RevBayes_tree_indices.RData
contains the index values for the IQ-TREE 2 and RevBayes trees. Used for Supplementary Figure S3. J.finaltreeparams_rho1.RData
contains the full set of index values for both simulated and original (n=1189) empirical trees. Used for Supplementary Figure S4.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: final_timetrees.RData
- Number of variables: NA, contains a list with the final subset of empirical TimeTree trees used for analysis. All are phylo objects.
- Number of cases/rows: 1189 trees
#########################################################################
DATA-SPECIFIC INFORMATION FOR: tree_parameters_rho1.RData
- Number of variables: 4
- Number of cases/rows: 1189
- Variable List: A. birthparameters: Maximum likelihood estimated birth parameter values (λ) inferred for each tree. B. deathparameters: Maximum likelihood estimated death parameter values (μ) inferred for each tree. C. likelihoodparameters: Maximum likelihood estimate for each tree. D. rho: Sampling fraction used for each tree. All = 1.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: wilcoxon_distance_values_rho1.RData
- Number of variables: 3
- Number of cases/rows: 32
- Variable List: A. name: Name of Tree Topology Index B. distance_value: Wilcoxon distance for the association between empirical (n=1189 trees) and simulated trees (n = 1,189,000) for each topology index. C. p_value: P-value associated with each Wilcoxon test.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: wilcoxon_distance_values_nooutgroup.RData
- Number of variables: 3
- Number of cases/rows: 32
- Variable List: A. name: Name of Tree Topology Index B. distance_value: Wilcoxon distance for the association between empirical (n=706 trees) and simulated trees (n = 706,000) for each topology index. C. p_value: P-value associated with each Wilcoxon test.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: zscoresdf_rho1.RData
- Number of variables: 36
- Number of cases/rows: 1189
- Variable List; Each Variable Represents the Z-score for each Tree Topology Index when comparing empirical and their corresponding simulated trees: A. treenumber: Name of Tree Topology Index B. TotalI: Total I index C. MeanI: Mean I index D. Median I E. NormColless: Normalized Colless F. CollessLikeExpMDM: Colless-Like Index G. I2 H. Sackin I. NormSackin: Normalized Sackin J. TotalCophenetic: Total Cophenetic index K. B1 L. B2 M. AvgLeafDepth: Average leaf depth, where depth is the number of leaves in a pending subtree attached to each node. N. leafdepthvariance: Leaf depth variance, where depth is the number of leaves in a pending subtree attached to each node. O. CPrank: Colijn-Plazotta Rank P. normcherry: Normalized cherry index; number of cherries (adjacent pairs of leaves with a common ancestor node) divided by half the number of tips in a tree. Q. Jindex R. SNI S. Sshape: S-shape statistic T. APP, Area per pair index; average distance between all pairs of leaves in a tree, where distance denotes the number of edges that connect leaves. U. normILnumber: Normalized IL number V. normaverageladder: Normalized average ladder length W. maxdepth: Maximum depth, where depth is the number of edges from a leaf to the root. X. maxwidth: Maximum width Y. maxdiffwidth: Maximum difference in widths Z. meandepth: Mean depth, where depth is the number of edges from a leaf to the root. AA. meantopdist: Mean topological distance AB. maxheight: Maximum height (i.e., maximum depth of any leaf) AC. normpitchforks: Normalized pitchforks AD. rootedquartet: Rooted quartet index AE. furnas AF. stairs AG. stairs2 AH. longestpendant: Longest pendant edge length AI. shortestpendant: Shortest pendant edge length AJ. stemminess
#########################################################################
DATA-SPECIFIC INFORMATION FOR: merged_three_hundred_data.RData
- Number of variables: 17
- Number of cases/rows: 299
- Variable List: A. file_name: File name B. TimeTRee / Pubmed ID C. First Author: First author of the study D. Article Title: Article title for the article where the tree is from E. Year: Article publishing year F. DOI / Link: DOI for the relevant article G. Molecular dating method: Dating method used to infer the tree in the study H. Tree Prior (If Specified) I. prior_type: Prior type used: Non-Bayesian, Bayesian-NonBD, Bayesian-BD J. prior_type_and_BD_subgroup: Prior type and extra specification of the BD prior used. K. Median_I: Median I value for each tree L. Norm_Colless: Normalized Colless index value for each tree M. B_2: B2 value for each tree N. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node. O. stairs_2: stairs2 index value for each tree P. Ntips: Number of tips in each tree Q. root_age: Root age for each tree
#########################################################################
DATA-SPECIFIC INFORMATION FOR: combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
- Number of variables: 9
- Number of cases/rows: 600
- Variable List: A. Diff_List_Number: Ignore B. Median_I: Median I value for each tree C. Norm_Colless: Normalized Colless index value for each tree D. B_2: B2 value for each tree E. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node. F. stairs_2: stairs2 index value for each tree G. Topol_Acc: Topological accuracy of the inferred trees compared to the topology of the source trees for the simulated sequences H. Method: Inference method: IQ-TREE2 or RevBayes I. Rate: Substitution Rate: 0.5, 0.05, 0.005
#########################################################################
DATA-SPECIFIC INFORMATION FOR: combined_parameter_distributions.RData
- Number of variables: 9
- Number of cases/rows: 1301200
- Variable List: A. List_Number: Which Source Tree Each Value Belongs To B. Median_I: Difference in Median I value for inferred vs source tree; includes value for entire posterior too C. Norm_Colless: Difference in Normalized Colless value for inferred vs source tree; includes value for entire posterior too D. Diff_B_2: Difference in B2 value for inferred vs source tree; includes value for entire posterior too E. Diff_leaf_depth_variance: Difference in leaf depth variance value for inferred vs source tree; includes value for entire posterior too F. Diff_stairs_2: Difference in stairs2 value for inferred vs source tree; includes value for entire posterior too G. Method: Inference method: IQ-TREE2 or RevBayes H. Rate: Substitution Rate: 0.5, 0.05, 0.005 I. Group: Same as Method
#########################################################################
DATA-SPECIFIC INFORMATION FOR: combined_IQTREE_RevBayes_tree_indices.RData
- Number of variables: 8
- Number of cases/rows: 600
- Variable List: A. List_Number: Which Source Tree Each Value Belongs To B. Median_I: Median I value for each tree C. Norm_Colless: Normalized Colless index value for each tree D. B_2: B2 value for each tree E. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node. F. stairs_2: stairs2 index value for each tree G. Method: Inference method: IQ-TREE2 or RevBayes H. Rate: Substitution Rate: 0.5, 0.05, 0.005
#########################################################################
DATA-SPECIFIC INFORMATION FOR: finaltreeparams_rho1.RData
- Number of variables: 39
- Number of cases/rows: 1190189
- Variable List: A. treenumber: Which Source Tree Each Value Belongs To B. EmpOrSim: Is the tree an empirical or simulated one C. leaves: Number of leaves in each tree D. TotalI: Total I index E. MeanI: Mean I index F. Median I G. NormColless: Normalized Colless H. CollessLikeExpMDM: Colless-Like Index I. I2 J. Sackin K. NormSackin: Normalized Sackin L. TotalCophenetic: Total Cophenetic index M. B1 N. B2 O. AvgLeafDepth: Average leaf depth, where depth is the number of leaves in a pending subtree attached to each node. P. leafdepthvariance: Leaf depth variance, where depth is the number of leaves in a pending subtree attached to each node. Q. CPrank: Colijn-Plazotta Rank R. normcherry: Normalized cherry index, # cherries (adjacent pairs of leaves with a common ancestor node) divided by half the number of tips in a tree. S. Jindex T. SNI U. Sshape: S-shape statistic V. APP, Area per pair index; average distance between all pairs of leaves in a tree, where distance denotes the number of edges that connect leaves. W. normILnumber: Normalized IL number X. normaverageladder: Normalized average ladder length Y. maxdepth: Maximum depth, where depth is the number of edges from a leaf to the root. Z. maxwidth: Maximum width AA. maxdiffwidth: Maximum difference in widths AB. meandepth: Mean depth, where depth is the number of edges from a leaf to the root. AC. meantopdist: Mean topological distance AD. maxheight: Maximum height (i.e., maximum depth of any leaf) AE. normpitchforks: Normalized pitchforks AF. rootedquartet: Rooted quartet index AG. furnas AH. stairs AI. stairs2 AJ. longestpendant: Longest pendant edge length AK. shortestpendant: Shortest pendant edge length AL. stemminess
Paper Citation
TBD
Methods
Empirical trees used in the study are trees from the literature, collected by TimeTree (timetree.org).
Run ```Tree_Selection.R``` to select the empirical phylogenetic trees to be included from TimeTree. The output file ```final_timetrees.RData``` contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels.
2. Run ```Simulation_And_Analysis.R``` to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run ```Supplementary_Fig_S1_Analysis.R``` to generate data for Supplementary Figure S1.
3. Run ```Meta_analysis.R``` to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3).
4. Run ```Imbalance_Simulation.R``` to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes: ```fasta_to_revbayes_code_rate2.sh```, ```fasta_to_revbayes_code_rate3.sh```, ```fasta_to_revbayes_code_rate4.sh``` These shell scripts use the following .Rev files: ```MCMC_Revbayes_code_rate2.Rev```, ```MCMC_Revbayes_code_rate3.Rev```, ```MCMC_Revbayes_code_rate4.Rev``` And rely on the following supplementary .Rev files: ```tree_BD.Rev```, ```sub_JC.Rev```, ```clock_global.Rev``` 4b. IQ-TREE 2: ```fasta_to_iqtree_code.sh```
5. Run ```Final_Figures.R``` to visualize the results.
Usage notes
R version: 4.2.2
Other software: RevBayes (version 1.2.1) (Höhna et al. 2016) and IQ-TREE 2 (version 2.2.0) (Minh et al. 2020) were used for supplementary analyses.