Commonly used Bayesian diversification methods lead to biologically meaningful differences in branch-specific rates on empirical phylogenies
Data files
Jan 15, 2025 version files 31.15 GB
-
CompDiv.tar.gz
9.13 GB
-
output.tar.gz
22.02 GB
-
README.md
16.88 KB
Abstract
Identifying along which lineages shifts in diversification rates occur is a central goal of comparative phylogenetics; these shifts may coincide with key evolutionary events such as the development of novel morphological characters, the acquisition of adaptive traits, polyploidization or other structural genomic changes, or dispersal to a new habitat and subsequent increase in environmental niche space. However, while multiple methods now exist to estimate diversification rates and identify shifts using phylogenetic topologies, the appropriate use and accuracy of these methods is hotly debated. Here we test whether five Bayesian methods—Bayesian Analysis of Macroevolutionary Mixtures (BAMM), two implementations of the Lineage-Specific Birth-Death-Shift model (LSBDS and PESTO), the approximate Multi-Type Birth-Death model (MTBD; implemented in BEAST2), and the cladogenetic diversification rate shift model (CLaDS2)—produce comparable results. We apply each of these methods to a set of 65 empirical time-calibrated phylogenies and compare inferences of speciation rate, extinction rate, and net diversification rate. We find that the five methods often infer different speciation, extinction, and net-diversification rates. Consequently, these different estimates may lead to different interpretations of the macroevolutionary dynamics. The different estimates can be attributed to fundamental differences among the compared models. Therefore, the inference of shifts in diversification rates is strongly method-dependent. We advise biologists to apply multiple methods to test the robustness of the conclusions or to carefully select the method based on the validity of the underlying model assumptions to their particular empirical system.
README: Commonly used Bayesian diversification methods lead to biologically meaningful differences in branch-specific rates on empirical phylogenies
Overview
This directory contains the code and analysis used in the paper Martínez-Gómez, Song and Tribble et al. 2023. In short, this study tests whether different Bayesian evolutionary models which, aim to estimate branch specific shifts in diversification rates on a phylogeny, infer similar or different shifts using a set of empirically derived phylogenies. In this study we test five different methods, BAMM, PESTO, MTBD, LSBDS, and ClaDS2. Please see the corresponding publication in Evolution Letter for more details.
This repository contains two main files: Output.tar.gz and CompDiv.tar.gz. The Output.tar.gz file contains the outputs from running each of the five methods investigated in the study. The CompDiv.tar.gz file contains files associated with the analysis of the outputs of Output.tar.gz.
If you have any trouble with any aspect of this Dryad or publication please email Jesús Martínez-Gómez (martinezg.jesus at gmail.com).
A few notes
- Each uncompressed file is quite large please have at least 200GB to comfortably open both files.
- Throughout these repository we name of the phylogeny ([Phylogeny_name]) will follow the naming convention, first letter of first authors last name, followed by year of publication. This convention is from Heno Diaz et al. 2019. See Table S1 for a key that will provide the corresponding taxonomic group and publication reference. The two exception to this naming convention are the phylogenies name "Sample_Primates" and "Samples_Whale". These two dataset often used as the example phylogeny in tutorials of these models. The corresponding reference for these phylogenies are both listed Table S1.
- Due to licensing the phylogeny files themselves have not been included in this Dryad. You can find the phylogenies at the following github repository.
- A number of the scripts used in this analysis been rewritten into functions to help user compare these analysis on there own. Specifically code associated with Markov chain Monte Carlo (MCMC) analysis (e.g., taking burnin), calculating summary statistics and plotting rates on trees. A tutorial with further explanation of those function can be found on github/Jesusthebotanist/CompDiv_processing_and_plotting
1. Individual method output
Output.tar.gz: This folders contains the the output of the analysis (i.e., ClaD2, LSBDS, MTBD, BAMM, PESTO). Each folder corresponds to a phylogeny directory (see note above on naming). Within each folder there will be a set of subdirectories corresponding to each method. To understand the output of each method we recommend you consult the respective method publication/ online resource, we've linked them below for your convenience. While most output files are common file types (e.g., .txt., .csv, .tsv), there are two file types of notes, .log and .tree files. LSBDS and MTBD both produce .log files. These are tab-delimited files that record the the iterations of a MCMC. While they can be opened in a spreadsheet program (e.g., Microsoft Excel, Google Sheet) it is more useful to analyze them in free program Tracer. Lastly, MSBD will produce a [Phylogeny_name]_default.rates.trees file. This file type stores phylogenies in either Newick format or Nexus format, and can generally be viewed with the free program such as Icytree or FigTree. However, in this case we do not recommend this, please see the description of the file type below. Proceeding the methods-specific reference is the description of the structure of Output.tar.gz.
- BAMM
- MTBD
- ClaDS2
- LSBDS
- PESTO
- PESTO Program Website
- Has yet to be published
/[Phylogeny_name]
/[Phylogeny_name]/BAMM_[Phylogeny_name]:
This folder contains the following
- output - Contains the BAMM output files. See BAMM website for more information
- run_info.txt - This file contains the run information on BAMM run.
- control_[Phylogeny_name].cr.pr.txt - The BAMM input file.
- cost.marg.csv - Branch specific rates on the tree. Generated by the getMarginalBranchRateMatrix() function in BAMMtools.
- div.bamm.csv - Mean tip rates generated using getTipRates() function in BAMMtools R package.
/[Phylogeny_name]/MSBD_[Phylogeny_name]:
There are three subdirectories, [Phylogeny_name], [Phylogeny_name]_2, and [Phylogeny_name]_3, these correspond to each of the three independent MCMC chain ran. Within each subdirectory the files are the same. The MSBD analyses, implemented in the program BEAST2, were run on CIPRES a API specific for evolutionary analysis that computes on the University of California San Diego Super Computer cluster. Please see the the CIPRES BEAST2 for more information.
- [Phylogeny_name]_default.log - This file contains the general BEAST2 log file.
- [Phylogeny_name]_default.states.log - This file contains a BEAST2 log specific to MSBD, specifically estimate of the type change rate parameter gamma. See the MSBD tutorial.
- [Phylogeny_name]_default.branches.log - This file contains a BEAST2 log specific to MSBD, of interest this file contains the estimate of the shift for the tip edges. See the MSBD tutorial.
- [Phylogeny_name]_default.rates.trees - This contains the inferred rates of the tip sample in the MCMC. See 'MSBD data Wrangle' code snippit in the compProcessing.Rmd file located in our github for more information.
- [Phylogeny_name]_default.states.tree - This file records the lambda and mu rates from each sample of the MCMC. The rates are recorded in a annotated Newick file. See 'MSBD data Wrangle' code snippit in the compProcessing.Rmd file located in our github the for more information.
- [Phylogeny_name]_msbd_rates.csv - This file contains the individual lambda and mu rates sampled in the MCMC, extracted from the annotated newick file stored in [Phylogeny_name]_default.states.tree.
- [Phylogeny_name]_msbd_rates_netDivOnly.csv - This file contain net-diversification ( net-div = lambda-mu) calculated for each sample of the MCMC, based on the [Phylogeny_name]_msbd_rates.csv file.
- infile.xml, infile_altered.xml, infile_altered.xml.states - These three files contain the input files for the BEAST analysis. These included the phylogeny, specification of the MSBD model and MCMC parameters.
- STDOUT - This contains the BEAST2 information typically printed to terminal.
- STERR - This contains the BEAST2 information typically printed to terminal.
- term.txt, start.txt, done.txt,_JOBINFO.TXT,scheduler.conf, _scheduler_stderr.txt - Are default CIPRES files regarding the run. Please reference CIPRES for more information.
/[Phylogeny_name]/LSBDS_[Phylogeny_name]:
This folder contains a number of output files of the LSBDS model implemented in Revbayes. All files contain "run[#]" in there name that corresponds to the independent MCMC chain that was run. Below we explain the file types.
- mcmc_LSBDS_[Phylogeny_name].Rev - This is the RevBayes script specificity the model and MCMC.
- [Phylogeny_name]_LSBDS_rates.log - This contains the diversification rate parameters (i.e., lamba, mu and number shifts) tracked as part of this model. This is the key file used in the manuscript.
- [Phylogeny_name]_LSBDS_model.log - This contains general parameters tracked by Revbayes including the the model posterior probability of the model and model model likelihood.
- The following three files types are Revbayes specific files need to restart a MCMC. One of each is generated per run. For more information see Revbayes website.
- [Phylogeny_name]LSBDS_checkpoint_run[#].state
- [Phylogeny_name]LSBDS_checkpoint_run[#]_moves.state
- [Phylogeny_name]LSBDS_checkpoint_run[#]_mcmc.state
- Occasionally one of the LSBDS_[Phylogeny_name folders will contain multiple subdirectories (e.g., "chain_1", "chain_2" ). This is indicative of MCMCs that were restarted. Since restarted Revbayes MCMC generated a number of new files, we placed these restarted run files in the aforementioned subdirectories. The files created by the restarted are the same as those listed above.
/[Phylogeny_name]/ClaDS2_[Phylogeny_name]
This folder contains:
- [Phylogeny_name].RData - This folder contains a R data file that contains all the input and output information of the ClaDS2 run.
/Bamm_Restart
This directory contains BAMM outputs for phylogenies that did not converge the first time. The files organization follows the same as BAMM_[Phylogeny_name], see above.
/PESTO_results
This directory contains the PESTO output files. Each phylogeny has file [Phylogeny_name].tsv that records the lamda and mu rate estimates.
2. Analysis folder
The CompDiv.tar.gz file contains the organizational structure for the post-method analysis largely done in R. It is organized as follows:
scripts
- compProcessing.Rmd: This is the main Rmarkdown for the analysis
- read_newick_string_ex.R: A helper script used by compProcessing.Rmd
- pesto_analyses: contains scripts used to run PESTO
- MSN_reviewerResponse.R: R scrpit used to generate supplementary Fig. S7
- clads: This contains the scripts used to run ClaDS2 in Julia. The folders mainly consist of scripts for functions used by panda_scriptR.R. This is the principle script used to run ClaDS2.
- panda_scriptR.R - is the main scripted used to generate phylogeny specific scripts to run CLaDS. The remainder of the scripts are small functions used by panda_scriptR
- clads_treecode - This folder contains a series of phylogeny specific R script to run CLaDS2.
compProcessing_output:
This folder generates outputs and figures created by compProcessing.Rmd. There are multiple directiers and two files located (end of descriptoin) here.
/convergencesAssessement:
This folder contains files that contain information regarding convergence assessment. This information is summarized in Table S1. Importantly
- BAMM_convergences.csv - Summarizes convergence for the program BAMM. Some of this information is also found in combined_convergence.csv file.
- combined_convergence.csv - a CSV with information summarized in Table S1.
- HenaoDiaz_legend.csv - a CSV matching the specific [Phylogeny_name] file with the reference in Henao Diaz 2019.
- LSBDS_convergences_combinedOnly.csv - Summarizes convergences for combined MCMC of LSBDS. Some of this information is also found in combined_convergence.csv file.
- LSBDS_convergences.csv - Summarizes convergences for individual MCMC chains of LSBDS. Some of this information is also found in combined_convergence.csv.
- LSBDS_gelmanCI_of_point_estiamte.csv - Contains point estimates and MCMC convergence assessment statistics from LSBDS.
- MSBD_conergences.csv - Summarizes convergence for the program MSBD. Some of this information is also found in combined_convergence.csv file.
/diversification_summary_figure:
This folder contains information regarding Fig. 2A, Table 1, Tables S2, Fig. S3 and Fig. S2
- AllMethods: Contains information regarding files in supplementary to include all 5 methods, see methods section of publication.
- diversification_summary_dataforfigure__LSBDSfast_exceptRevbayes_median.csv - Individual data points used Fig. S4.
- contrasts__LSBDSfast_exceptRevbayes_log_median.csv - Contrast with p-values used to plot significance in Fig. S4
- The remainder of the files are plots that of the residuls for each the summary statistic (i.e., averge or variance) for each model parameter investigated in the study (i.e., Speciation, Extinction, Diverisification)
- ExcludesLSBDS: Contains information in the main text only focusing on 4/5 methods, specifically excluding LSBDS.
- diversification_summary_dataforfigure__LSBDSfast_only_median.csv - Individual data points used Fig. 2A-C.
- contrasts__LSBDSfast_only_log_median.csv - Significant contrast with p-values used to plot significance in Fig. 2A-C and Fig. S3.
/individual_phylo_results:
This information contains summarized posterior distributions
- PosteriorSummaryStatistics: This a set of CSVs each corresponding to a phylogeny that summarize the posterior mean, posterior median, MAP, 95% HPD interval, quantile and other relevant information. There will be two files for every phylogeny. The files with names that include "_LSBDSfast" contain PESTO values as well.
- Note in the compProcessing.Rmd this directory may referred to as "point_estimates"
/MSN_summary_figure:
This folder contains output files for the mean square analysis in Fig. 2B. Note: In compProcessing.Rmd this directory may be refered to as KF_summary_figure.
- MSN_gmm_sigcontrastspartial.csv - Significant contrast with p-values corresponding to Fig. 2D-F
- MSN_dataforfigurepartial.csv - Individual data points used Fig. 2D-F.
- The remaining file are residual plots for the three parameters
- lambda_partial_residuals.pdf - Speciation
- mu_partial_residuals.pdf - Extinction
- div_partial_residuals.pdf - net diversification
/postburnin_Posteriors:
This folder contains 4 directories one for each methods that use MCMC (LSBDS, CLaDS2, BAMM, MTBD) each containing the files corresponding to a combined MCMC chain,from the respective runs, and with a burnin removed. For each method, except BAMM, there are two files. The first with the suffix "comindedPosterior" this is contains MCMC estimates for speciation and speciation and extinction parameters. The file with suffix "comindedPosterior_netDivOnly" contains the parameters net diversification which is the difference between speciation and extinction. Net diversification was calculated for each generation of the the MCMC chain. BAMM only has a file corresponding to "comindedPosterior_netDivOnly". The summarized MCMC of speciation and extinction called 'cost.marg.csv' and is found in the respective phylogeny file of the Output.tar.gz folder
- BAMM - postburnin posteriors for BAMM
- ClaDS2 - postburnin posteriors for ClaDS2
- LSBDS - postburnin posteriors for LSBDS
- MSBD - postburnin posteriors for MTBD
/supplementalFigures:
This folder contains one file for plotting Fig. S1
- FigsS1_data.csv - Contains data to plot Fig. S1.
/uncertainty_overlap:
This folder contains the materials for generating Fig. S6 and Fig. S8. Note: In comp_div_processing.Rmd this directory may be referred to as uncertainty.
- HPDInterval_complete.csv - Data to plot Fig. S8
- HPDInterval_contrast.csv - Significant contrast with p-values used to plot Fig. S8
- HPDInterval_effectSize.csv - Effect size corresponding fo Fig. S8, cited in paper.
- OverlapRatio_partial.csv - Data used to plot Fig. S7
/Summary_statistics_LSBDSfast_exceptRevbayes.csv:
This file contains the MCMC summary statistics combined into a single file for all methods except LSBDS.
/Summary_statistics_LSBDSfast_only.csv:
This file contains the MCMC summary statistics combined into a single CSV for all method.