The limits of the constant-rate birth-death prior for phylogenetic tree topology inference

Khurana, Mark Poulsen 1 ; Scheidwasser-Clow, Neil1; Penn, Matthew2; Bhatt, Samir1; Duchêne, David A.1

Published Aug 06, 2023; Updated Nov 06, 2023 on Dryad. https://doi.org/10.5061/dryad.2fqz612vg

Abstract

Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content.

The limits of the constant-rate birth-death prior for phylogenetic tree topology inference

This repo contains code to reproduce the simulations, analyses, and figures for "The limits of the constant-rate birth-death prior for phylogenetic tree topology inference".

Environment setup:

R version: 4.2.2

The following libraries are used:

library(adegenet)
library(ape)
library(apTreeshape)
library(Claddis)
library(coda)
library(coefplot)
library(colorspace)
library(cowplot)
library(ctv)
library(devtools)
library(dispRity)
library(dotwhisker)
library(dplyr)
library(gamlss)
library(gamlss.dist)
library(gamlss.add)
library(githubinstall)
library(GGally)
library(ggbiplot)
library(ggExtra)
library(gghalves)
library(gginnards)
library(ggplot2)
library(ggplotify)
library(ggpmisc)
library(ggpp)
library(ggpubr)
library(gridExtra)
library(magrittr)
library(openxlsx)
library(parallel)
library(phangorn)
library(phylobase)
library(phyloTop)
library(phytools)
library(plyr)
library(seqinr)
library(stringr)
library(tibble)
library(tidyverse)
library(treebalance)
library(treeCentrality)
library(TreeSim)
library(viridis)

Other software:

RevBayes (version 1.2.1) (Höhna et al. 2016) and IQ-TREE 2 (version 2.2.0) (Minh et al. 2020) were used for supplementary analyses. Both were run using the shell (.sh) scripts mentioned below.

To reproduce the results:

Run Tree_Selection.R to select the empirical phylogenetic trees to be included from TimeTree. The output file final_timetrees.RData contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels.
Run Simulation_And_Analysis.R to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run Supplementary_Fig_S1_Analysis.R to generate data for Supplementary Figure S1.
Run Meta_analysis.R to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3).
Run Imbalance_Simulation.R to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows:
4a. RevBayes: fasta_to_revbayes_code_rate2.sh, fasta_to_revbayes_code_rate3.sh, fasta_to_revbayes_code_rate4.sh
These shell scripts use the following .Rev files: MCMC_Revbayes_code_rate2.Rev, MCMC_Revbayes_code_rate3.Rev, MCMC_Revbayes_code_rate4.Rev
And rely on the following supplementary .Rev files: tree_BD.Rev, sub_JC.Rev, clock_global.Rev
4b. IQ-TREE 2: fasta_to_iqtree_code.sh
Run Final_Figures.R to visualize the results.

Sharing/Access information

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
Was data derived from another source? Yes
A. *TimeTree. The uploaded R script selects relevant Newick trees from the TimeTree dataset; the full Newick dataset is available upon request through TimeTree. The final_timetrees.RData file includes the final subset of empirical phylogenetic TimeTree trees used for analysis, which is sufficient to recreate the results. Tip names are anonymized.
Recommended citation for this dataset:

Khurana, Mark Poulsen, Scheidwasser-Clow, Neil, Penn, Matthew J., Bhatt, Samir, Duchêne, David A. (2023), The limits of the constant-rate birth-death prior for phylogenetic tree topology inference, Dryad, Dataset, https://doi.org/10.5061/dryad.2fqz612vg

Description of the main appended data, as well as data needed to recreate figures

File List:

A) final_timetrees.RData
B) tree_parameters_rho1.RData
C) wilcoxon_distance_values_rho1.RData
D) wilcoxon_distance_values_nooutgroup.RData
E) zscoresdf_rho1.RData
F) merged_three_hundred_data.RData
G) combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData
H) combined_parameter_distributions.RData
I) combined_IQTREE_RevBayes_tree_indices.RData
J) finaltreeparams_rho1.RData

Are there multiple versions of the data set? No
A. If yes, name of file(s) that was updated: NA
i. Why was the file updated? NA
ii. When was the file updated? NA
Brief Description of the Data and Workflow:
A. The final_timetrees.RData file includes the final subset of empirical TimeTree trees used for analysis. Not used for any specific figure, but is the departure point for all analyses (i.e., starting from Simulation_And_Analysis.R mentioned above).
B. Meta-data for all the included empirical trees can be found in the supplementary files.
C. tree_parameters_rho1.RData contains the inferred birth and death parameter values for the empirical trees. Used to create Figure 1.
D. wilcoxon_distance_values_rho1.RData contains the Wilcoxon values comparing empirical and simulated trees across a range of indices. Used to recreate Figure 2a. wilcoxon_distance_values_nooutgroup.RData is used ot generate Supplementary Figure S1.
E. zscoresdf_rho1.RData contains the Z score values comparing empirical and simulated trees for the top five indices. Used to recreate Figure 2b and Supplementary Figure S2.
F. merged_three_hundred_data.RData contains the metadata and index values for the random subset of 300 trees used for the meta-analysis. Used for Figure 3.
G. combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData contains the differences between the IQ-TREE 2 and RevBayes respectively compared to the empirical trees. Used for Figure 4.
H. combined_parameter_distributions.RData contains the index values for the IQ-TREE 2 trees, the idealized prior values, the RevBayes trees, and the values for the entire set of posterior trees from RevBayes. Used for Figure 5 and Supplementary Figure S5.
I. combined_IQTREE_RevBayes_tree_indices.RData contains the index values for the IQ-TREE 2 and RevBayes trees. Used for Supplementary Figure S3.
J. finaltreeparams_rho1.RData contains the full set of index values for both simulated and original (n=1189) empirical trees. Used for Supplementary Figure S4.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: final_timetrees.RData

Number of variables: NA, contains a list with the final subset of empirical TimeTree trees used for analysis. All are phylo objects.
Number of cases/rows: 1189 trees

#########################################################################

DATA-SPECIFIC INFORMATION FOR: tree_parameters_rho1.RData

Number of variables: 4
Number of cases/rows: 1189
Variable List:
A. birthparameters: Maximum likelihood estimated birth parameter values (λ) inferred for each tree.
B. deathparameters: Maximum likelihood estimated death parameter values (μ) inferred for each tree.
C. likelihoodparameters: Maximum likelihood estimate for each tree.
D. rho: Sampling fraction used for each tree. All = 1.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: wilcoxon_distance_values_rho1.RData

Number of variables: 3
Number of cases/rows: 32
Variable List:
A. name: Name of Tree Topology Index
B. distance_value: Wilcoxon distance for the association between empirical (n=1189 trees) and simulated trees (n = 1,189,000) for each topology index.
C. p_value: P-value associated with each Wilcoxon test.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: wilcoxon_distance_values_nooutgroup.RData

Number of variables: 3
Number of cases/rows: 32
Variable List:
A. name: Name of Tree Topology Index
B. distance_value: Wilcoxon distance for the association between empirical (n=706 trees) and simulated trees (n = 706,000) for each topology index.
C. p_value: P-value associated with each Wilcoxon test.

#########################################################################

DATA-SPECIFIC INFORMATION FOR: zscoresdf_rho1.RData

Number of variables: 36
Number of cases/rows: 1189
Variable List; Each Variable Represents the Z-score for each Tree Topology Index when comparing empirical and their corresponding simulated trees:
A. treenumber: Name of Tree Topology Index
B. TotalI: Total I index
C. MeanI: Mean I index
D. Median I
E. NormColless: Normalized Colless
F. CollessLikeExpMDM: Colless-Like Index
G. I2
H. Sackin
I. NormSackin: Normalized Sackin
J. TotalCophenetic: Total Cophenetic index
K. B1
L. B2
M. AvgLeafDepth: Average leaf depth, where depth is the number of leaves in a pending subtree attached to each node.
N. leafdepthvariance: Leaf depth variance, where depth is the number of leaves in a pending subtree attached to each node.
O. CPrank: Colijn-Plazotta Rank
P. normcherry: Normalized cherry index; number of cherries (adjacent pairs of leaves with a common ancestor node) divided by half the number of tips in a tree.
Q. Jindex
R. SNI
S. Sshape: S-shape statistic
T. APP, Area per pair index; average distance between all pairs of leaves in a tree, where distance denotes the number of edges that connect leaves.
U. normILnumber: Normalized IL number
V. normaverageladder: Normalized average ladder length
W. maxdepth: Maximum depth, where depth is the number of edges from a leaf to the root.
X. maxwidth: Maximum width
Y. maxdiffwidth: Maximum difference in widths
Z. meandepth: Mean depth, where depth is the number of edges from a leaf to the root.
AA. meantopdist: Mean topological distance
AB. maxheight: Maximum height (i.e., maximum depth of any leaf)
AC. normpitchforks: Normalized pitchforks
AD. rootedquartet: Rooted quartet index
AE. furnas
AF. stairs
AG. stairs2
AH. longestpendant: Longest pendant edge length
AI. shortestpendant: Shortest pendant edge length
AJ. stemminess

#########################################################################

DATA-SPECIFIC INFORMATION FOR: merged_three_hundred_data.RData

Number of variables: 17
Number of cases/rows: 299
Variable List:
A. file_name: File name
B. TimeTRee / Pubmed ID
C. First Author: First author of the study
D. Article Title: Article title for the article where the tree is from
E. Year: Article publishing year
F. DOI / Link: DOI for the relevant article
G. Molecular dating method: Dating method used to infer the tree in the study
H. Tree Prior (If Specified)
I. prior_type: Prior type used: Non-Bayesian, Bayesian-NonBD, Bayesian-BD
J. prior_type_and_BD_subgroup: Prior type and extra specification of the BD prior used.
K. Median_I: Median I value for each tree
L. Norm_Colless: Normalized Colless index value for each tree
M. B_2: B2 value for each tree
N. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node.
O. stairs_2: stairs2 index value for each tree
P. Ntips: Number of tips in each tree
Q. root_age: Root age for each tree

#########################################################################

DATA-SPECIFIC INFORMATION FOR: combined_diff_index_values_to_empirical_index_IQTREE_RevBayes.RData

Number of variables: 9
Number of cases/rows: 600
Variable List:
A. Diff_List_Number: Ignore
B. Median_I: Median I value for each tree
C. Norm_Colless: Normalized Colless index value for each tree
D. B_2: B2 value for each tree
E. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node.
F. stairs_2: stairs2 index value for each tree
G. Topol_Acc: Topological accuracy of the inferred trees compared to the topology of the source trees for the simulated sequences
H. Method: Inference method: IQ-TREE2 or RevBayes
I. Rate: Substitution Rate: 0.5, 0.05, 0.005

#########################################################################

DATA-SPECIFIC INFORMATION FOR: combined_parameter_distributions.RData

Number of variables: 9
Number of cases/rows: 1301200
Variable List:
A. List_Number: Which Source Tree Each Value Belongs To
B. Median_I: Difference in Median I value for inferred vs source tree; includes value for entire posterior too
C. Norm_Colless: Difference in Normalized Colless value for inferred vs source tree; includes value for entire posterior too
D. Diff_B_2: Difference in B2 value for inferred vs source tree; includes value for entire posterior too
E. Diff_leaf_depth_variance: Difference in leaf depth variance value for inferred vs source tree; includes value for entire posterior too
F. Diff_stairs_2: Difference in stairs2 value for inferred vs source tree; includes value for entire posterior too
G. Method: Inference method: IQ-TREE2 or RevBayes
H. Rate: Substitution Rate: 0.5, 0.05, 0.005
I. Group: Same as Method

#########################################################################

DATA-SPECIFIC INFORMATION FOR: combined_IQTREE_RevBayes_tree_indices.RData

Number of variables: 8
Number of cases/rows: 600
Variable List:
A. List_Number: Which Source Tree Each Value Belongs To
B. Median_I: Median I value for each tree
C. Norm_Colless: Normalized Colless index value for each tree
D. B_2: B2 value for each tree
E. leaf_depth_variance: Leaf depth variance for each tree, where depth is the number of leaves in a pending subtree attached to each node.
F. stairs_2: stairs2 index value for each tree
G. Method: Inference method: IQ-TREE2 or RevBayes
H. Rate: Substitution Rate: 0.5, 0.05, 0.005

#########################################################################

DATA-SPECIFIC INFORMATION FOR: finaltreeparams_rho1.RData

Number of variables: 39
Number of cases/rows: 1190189
Variable List:
A. treenumber: Which Source Tree Each Value Belongs To
B. EmpOrSim: Is the tree an empirical or simulated one
C. leaves: Number of leaves in each tree
D. TotalI: Total I index
E. MeanI: Mean I index
F. Median I
G. NormColless: Normalized Colless
H. CollessLikeExpMDM: Colless-Like Index
I. I2
J. Sackin
K. NormSackin: Normalized Sackin
L. TotalCophenetic: Total Cophenetic index
M. B1
N. B2
O. AvgLeafDepth: Average leaf depth, where depth is the number of leaves in a pending subtree attached to each node.
P. leafdepthvariance: Leaf depth variance, where depth is the number of leaves in a pending subtree attached to each node.
Q. CPrank: Colijn-Plazotta Rank
R. normcherry: Normalized cherry index, # cherries (adjacent pairs of leaves with a common ancestor node) divided by half the number of tips in a tree.
S. Jindex
T. SNI
U. Sshape: S-shape statistic
V. APP, Area per pair index; average distance between all pairs of leaves in a tree, where distance denotes the number of edges that connect leaves.
W. normILnumber: Normalized IL number
X. normaverageladder: Normalized average ladder length
Y. maxdepth: Maximum depth, where depth is the number of edges from a leaf to the root.
Z. maxwidth: Maximum width
AA. maxdiffwidth: Maximum difference in widths
AB. meandepth: Mean depth, where depth is the number of edges from a leaf to the root.
AC. meantopdist: Mean topological distance
AD. maxheight: Maximum height (i.e., maximum depth of any leaf)
AE. normpitchforks: Normalized pitchforks
AF. rootedquartet: Rooted quartet index
AG. furnas
AH. stairs
AI. stairs2
AJ. longestpendant: Longest pendant edge length
AK. shortestpendant: Shortest pendant edge length
AL. stemminess

Paper Citation

TBD

Empirical trees used in the study are trees from the literature, collected by TimeTree (timetree.org).

Run ```Tree_Selection.R``` to select the empirical phylogenetic trees to be included from TimeTree. The output file ```final_timetrees.RData``` contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels.

2. Run ```Simulation_And_Analysis.R``` to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run ```Supplementary_Fig_S1_Analysis.R``` to generate data for Supplementary Figure S1.

3. Run ```Meta_analysis.R``` to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3).

4. Run ```Imbalance_Simulation.R``` to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes: ```fasta_to_revbayes_code_rate2.sh```, ```fasta_to_revbayes_code_rate3.sh```, ```fasta_to_revbayes_code_rate4.sh``` These shell scripts use the following .Rev files: ```MCMC_Revbayes_code_rate2.Rev```, ```MCMC_Revbayes_code_rate3.Rev```, ```MCMC_Revbayes_code_rate4.Rev``` And rely on the following supplementary .Rev files: ```tree_BD.Rev```, ```sub_JC.Rev```, ```clock_global.Rev``` 4b. IQ-TREE 2: ```fasta_to_iqtree_code.sh```

5. Run ```Final_Figures.R``` to visualize the results.

The limits of the constant-rate birth-death prior for phylogenetic tree topology inference

Data files

Abstract

README

The limits of the constant-rate birth-death prior for phylogenetic tree topology inference

Environment setup:

Other software:

To reproduce the results:

Sharing/Access information

Description of the main appended data, as well as data needed to recreate figures

Paper Citation

Methods

Usage notes

Works referencing this dataset