The shape of phylogenies under phasetype distributed times to speciation and extinction
Data files
Aug 25, 2022 version files 2.84 MB

Angiosperms.tre

Anguimorpha.txt

Gekkota.txt

Iguania.txt

Magnoliidae.txt

Monocotyledoneae.txt

README.docx

README.pdf

Squamates.txt

Superasteridae.txt

Superrosidae.txt
Abstract
Phylogenetic trees describe relationships between extant species, but beyond that their shape and their relative branch lengths can provide information on broader evolutionary processes of speciation and extinction. However, currently, many of the most widely used macroevolutionary models make predictions about the shapes of phylogenetic trees that differ considerably from what is observed in empirical phylogenies. Here, we propose a flexible and biologically plausible macroevolutionary model for phylogenetic trees where times to speciation or extinction events are drawn from a Coxian phasetype (PH) distribution. First, we show that different choices of parameters in our model lead to a range of tree balances as measured by Aldous' $\beta$ statistic. In particular, we demonstrate that it is possible to find parameters that correspond well to empirical tree balance. Next, we provide a natural extension of the $\beta$ statistic to sets of trees. This extension produces less biased estimates of $\beta$ compared to using the median $\beta$ values from individual trees. Furthermore, we derive a likelihood expression for the probability of observing an edgeweighted tree under a model with speciation but no extinction. Finally, we illustrate the application of our model by performing both absolute and relative goodnessoffit tests for two large empirical phylogenies (squamates and angiosperms) that compare models with Coxian PH distributed times to speciation with models that assume exponential or Weibull distributed waiting times. In our numerical analysis, we found that, in most cases, models assuming a Coxian PH distribution provided the best fit.
Usage notes
Tree shapes based on simulated trees using phasetype distributions
Here, we examine tree shapes via Aldous’ β statistic (Aldous 1996) and the γ statistic (Pybus and Harvey 2000) based on simulated trees generated using agedependent speciation rates and a constant exponential extinction rate. Similarly, we also conduct a simulation where trees are generated using a constant exponential speciation rate and agedependent extinction rates. For both scenarios, we suppose that the agedependent rates are decreased with species age using Example 1 in the main manuscript and implemented in “DecreasingPH_SpeciationExtinction_fixedZ.R”. For this simulation, we generated 300 reconstructed simulated trees with 100 extant taxa under symmetric and asymmetric speciation modes using TreeSimGM package (Hagen and Stadler 2018) on R. For each scenario, we provide the values of the β statistic estimated from a set of trees as explained in the manuscript. If you would like to skip to the result of the simulations as presented in Figure 5 in the manuscript, you can load “DecreasingPH_Workspace_fixedZ.RData” and plot the figure. Code and workspace are available inside “tree_shapes_based_on_simulated_trees_using_phasetype_distributions.zip”.
Estimate of the β statistic based on set of trees
“Sim_PH_Yule_Treesets.R” is used to generate a set of 1000 trees without extinction under Yule process and a set of 1000 trees under phasetype distribution for times to speciation event. The generated sets are stored in “Simulated_PH_Treesets.RData” and “Simulated Yule Treesets.RData”. “PH_Treesets_Ind_Beta_CI.R” takes “Simulated_PH_Treesets.RData” as the input and is used to compute individual beta values from each tree in a given set of trees generated using phasetype distribution and treeset beta values from each tree set. The confidence interval for the treeset beta estimates is also included. More details can be found Section 2.3 and Section 3.1 of the main manuscript. The results are stored in “PH_Individual_Beta.RData” and “PH_Treesets_Beta_and_CI.RData”, and correspond to Figure 4ab in the manuscript. Similar procedure is used for sets of trees generated under Yule process where “Yule_Treesets_Ind_Beta_CI.R” takes “Simulated_Yule_Treesets.RData” as the input and their results are stored in “Yule_Individual_Beta.RData” and “Yule_Treesets_Beta_and_CI.RData”. They correspond to Figure 4cd in the main manuscript. These figures are generated using “Plot_CI_PH_Treeset_Ind_Beta.R” and “Plot_CI_Yule_Treeset_Ind_Beta.R”. Code and data are stored at “estimate_of_the_beta_statistic_based_on_set_of_trees.zip”.
Fitting parameters of phasetype distribution using simulated branch lengths
“Fit_PHdist_SimBranchLengths.R” is the main code where we simulate a set of 50 trees, each with 50 extant tips, and times to speciation are drawn from a phasetype distribution with known parameters. We consider three different simulation schemes: 1) simulate a set of 50 trees without extinction, 2) simulated a set of 50 trees with a constant exponential extinction rate of 0.1, 3) simulate a set of 50 trees with a constant exponential extinction rate of 0.4. Generated trees under scheme (1)(3) are stored in “Sim_PH_Tree_NoExt.RData”, “Sim_PH_Tree_Ext_0.1.RData”, “Sim_PH_Tree_Ext_0.4.RData”, respectively. For each simulation, we gather the branch lengths from their reconstructed trees to fit parameter values of a Coxian phasetype distribution shown in Example 1 in the manuscript. We use the builtin optim package on R with “LBFGSB” method (Byrd et al. 1995) to find parameter values that maximise the probability of observing a set of trees described in Equation 14 in the manuscript. The fitted parameter values under scheme (1)(3) are stored in “Fitted_PH_Param_NoExt.RData”, “Fitted_PH_Param_Ext_0.1.RData”, “Fitted_PH_Param_Ext_0.4.RData”, respectively. Using fitted parameter values from each setup, we simulate trees with the same number of tips as in the simulated data and compare their distribution of branch lengths with that of the simulated data. These fitted trees under scheme (1)(3) are stored in “Fitted_PH_Tree_NoExt.RData”, “Fitted_PH_Tree_Ext_0.1.RData”, “Fitted_PH_Tree_Ext_0.4.RData”. The results of this simulation correspond to Figure 6, 7, and 8 in the manuscript. We also apply the twosample KolmogorovSmirnov tests to compare panel (b) and (c) in those figures. The result can be seen in Table 1 of the manuscript. Fitted and simulated trees are stored inside a folder at “fitting_parameters_of_phasetype_distribution_using_simulated_branch_lengths.zip”.
Model selection based on empirical phylogenies
Using the squamate reconstructed tree, stored in “Squamates.txt”, (Zheng and Wiens 2016) and the angiosperm reconstructed tree, stored in “Angiosperms.tre”, (Zanne et al. 2014), we use the Dendroscope 3 (Huson and Scornavacca 2012) to extract the following clades from the squamate tree: Gekkota (stored in “Gekkota.txt”), Iguania (stored in “Iguania.txt”), and Anguimorpha (stored in “Anguimorpha.txt”). “Model_Selection_Squamate_Tree.R” uses branch lengths from the squamate tree and those clades to fit parameter values of nine different distributions, namely the constant rate BirthDeath (crBD) model from Nee et al. (1994b) (stored in “Fitted_crBD” folder), the Coxian PH distribution with decreasing rate described in Example 1 (stored in “Fitted_Coxian_with_Decreasing_Rate” folder), the Coxian PH distribution with increasing rate described in Example 2 (stored in “Fitted_Coxian_with_Increasing_Rate” folder), the general Coxian distribution defined in Definition 2 with four, five, and six nonabsorbing states (stored in “Fitted_General_Coxian” folder), exponential distribution (stored in “Fitted_Exponential_Distribution” folder), and Weibull distribution (stored in “Fitted_Weibull_Distribution” folder). Then, we perform model selection via AIC (Anderson and Burnham 2004) to select the best model. More details can be found in Section 4 of the main manuscript. The result corresponds to Table 2 in the manuscript. We apply the same procedure using the following clades on the angiosperm tree: Monocotyledoneae (stored in “Monocotyledoneae.txt”), Magnoliidae (stored in “Magnoliidae.txt”), Superasteridae (stored in “Superasteridae.txt”), and Superrosidae (stored in “Superrosidae.txt”). The result corresponds to Table 4 in the manuscript. We also perform absolute goodnessoffit by comparing empirical branch length distribution from each clade from both trees with their fitted branch length densities using KolmogorovSmirnov test (Table 3 and Table 5), shown in Figure 9 and 11. The main code also takes the bestfitting parameter values from the general Coxian distribution with four nonabsorbing states (stored in “Angiosperm_Tree” and “Squamate_Tree” folders) to plot the hazard function for each clade in both trees (shown in Fig.10 and 12). Code and data are stored at “model_selection_based_on_empirical_phylogenies.zip”.
Supplementary Codes
“gn(i)_equiv_large_n.R” are used to generate Figure 13. Code is stored at “supplementary_codes.zip”.
References
 Aldous DJ (1996) Probability distributions on cladograms. In: Random discrete structures, Springer, pp 1–18
 Anderson D, Burnham K (2004) Model selection and multimodel inference. Second NY: SpringerVerlag 63(2020):10
 Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208
 Hagen O, Stadler T (2018) TreeSimGM: simulating phylogenetic trees under general Bellman–Harris models with lineagespecific shifts of speciation and extinction in R. Methods Ecol Evol 9(3):754–760
 Huson DH, Scornavacca C (2012) Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol 61(6):1061–1067
 Nee S, May RM, Harvey PH (1994b) The reconstructed evolutionary process. Philosophical Transactions of the Royal Society of London Series B: Biological Sciences 344(1309):305–311
 Pybus OG, Harvey PH (2000) Testing macro–evolutionary models using incomplete molecular phylogenies. Proc Royal Soc B 267(1459):2267–2272
 Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O’Meara BC, Moles AT, Reich PB, et al. (2014) Three keys to the radiation of angiosperms into freezing environments. Nature 506(7486):89–92
 Zheng Y, Wiens JJ (2016) Combining phylogenomic and supermatrix approaches, and a timecalibrated phylogeny for squamate reptiles (lizards and snakes) based on 52 genes and 4162 species. Molecular phylogenetics and evolution 94:537–547