Fast Bayesian inference of phylogenies from multiple continuous characters
Data files
Jan 31, 2024 version files 25.11 MB
-
README.md
23.41 KB
-
zhang_etal_2023.zip
25.08 MB
Abstract
Time-scaled phylogenetic trees are an ultimate goal of evolutionary biology and a necessary ingredient in comparative studies. The accumulation of genomic data has resolved the tree of life to a great extent, yet timing evolutionary events remains challenging if not impossible without external information such as fossil ages and morphological characters. Methods for incorporating morphology in tree estimation have lagged behind their molecular counterparts, especially in the case of continuous characters. Despite recent advances, such tools are still direly needed as we approach the limits of what molecules can teach us. Here, we implement a suite of state-of-the-art methods for leveraging continuous morphology in phylogenetics, and by conducting extensive simulation studies we thoroughly validate and explore our methods' properties. While retaining model generality and scalability, we make it possible to estimate absolute and relative divergence times from multiple continuous characters while accounting for uncertainty. We compile and analyze one of the most data-type diverse data sets to date, comprised of contemporaneous and ancient molecular sequences, and discrete and continuous characters from living and extinct Carnivora taxa. We conclude by synthesizing lessons about our method's behavior, and suggest future research venues.
README: Supplementary materials for "Fast Bayesian inference of phylogenies from multiple continuous characters"
This file will guide you to reproduce the simulations and analyses in the main manuscript.
1 Simulation study I
This section performs well-calibrated simulation study to validate the implementation of our Brownian motion (BM) model, which produces Figure 1 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/01validation
. Please make sure that the BEAST2 XML template BMPruneLikelihood_CalValSimMorph_template.xml
is inside the folder.
1.1 Simulate Fossilised Birth-death (FBD) trees
Execute the R script by
Rscript 01SimulateFBDTree.R
The parameters of FBD model prior distributions are detailed as follows:
-
div.meanlog
: mean value of the LogNormal distribution for diversification rate -
div.sdlog
: standard deviation of the LogNormal distribution for diversification rate -
turn.alpha
: alpha value of the Beta distribution for turnover rate -
turn.beta
: beta value of the Beta distribution for turnover rate -
sample.alpha
: alpha value of the Beta distribution for sampling proportion -
sample.beta
: beta value of the Beta distribution for sampling proportion
The script will simulate and save the trees to RData/FBDTreeSimulation.RData
, i.e.,
-
trs
: list of 150 simulated FBD trees. Note that we have simulated another 50 trees just in case some trees have sampled ancestor at the root, in which case likelihoods cannot be computed.
1.2 Simulate continuous characters on simulated FBD trees and prepare xml files for BEAST2
Execute the R script by
Rscript 02BMPruneLikelihoodCalValSimMorph.R
The settings of the simulation used in the script include the following variables:
-
K
: number of traits -
n.sim
: number of simulations -
sigmasq.mean
: mean value of the Lognormal distribution for diffusion rate -
sigmasq.sd
: standard deviation of the LogNormal distribution for diffusion rate -
correlation.lower
: lower bond of the Uniform distribution for trait correlations -
correlation.upper
: upper bond of the Uniform distribution for trait correlations -
x0.mean
: mean value of the Normal distribution for ancestral values at the root -
x0.sd
: standard deviation of the Normal distribution for ancestral values at the root
The true values of parameters in simulating will be saved to RData/BMPruneLikelihoodCalValSimMorph.RData
, including
-
trees.2.save
: list of 100 FBD trees to simulate continuous characters on. -
traits.2.save
: list of simulated characters at each tip of each FBD tree -
sigmasq.2.save
: list of diffusion rates of characters on each FBD tree -
correlation.2.save
: list of trait correlations on each FBD tree -
root.2.save
: list of ancestral values at the root of each FBD tree
The R script will also generate 100 xml files and write them to folder xmls/
, i.e., BMPruneLikelihood_calval_1.xml
, ..., BMPruneLikelihood_calval_100.xml
1.3 Compare posterior means against simulated parameter values
Execute the R script by
Rscript 03BMPruneLikelihoodCalValSimMorphPlot.R
The metadata will be saved to RData/BMPruneLikelihoodMorphResults.RData
, including
-
log.varValues.1
,log.varValues.2
,log.varValues.3
,log.varValues.4
: posterior mean and 95% HPD intervals of trait diffusion rates from the BEAST2 log files -
log.covValues.1
,log.covValues.2
,log.covValues.3
,log.covValues.4
,log.covValues.5
,log.covValues.6
: posterior mean and 95% HPD intervals of trait correlations from the BEAST2 log files -
log.rootValues.1
,log.rootValues.2
,log.rootValues.3
,log.rootValues.4
: posterior mean and 95% HPD intervals of ancestral trait values at the root of the tree from the BEAST2 log files -
log.FBDdiversificationRate
: posterior mean and 95% HPD intervals of diversification rate from the BEAST2 log files -
log.FBDturnover
: posterior mean and 95% HPD intervals of turnover rate from the BEAST2 log files -
log.samplingProportion
: posterior mean and 95% HPD intervals of sampling proportion from the BEAST2 log files -
log.TheTree.height
: posterior mean and 95% HPD intervals of tree heights from the BEAST2 log files -
cover.df
: the summarised coverage table
The figures will be exported to folder figures/
(using Latex to compile the .tex
files to generate PDF files), including
-
BMPruneLikelihoodMain.tex
: Figure 1 in the main manuscript, coverage details of 5 parameters to show in the main manuscript, i.e., evolutionary rate of the first character ($r_1$), correlation between the first and second character ($\rho_{12}$), ancestral state of the first character at the root ($y_{0}^1$), root height ($t_{mrca}$) and number of sampled ancestors (S.A. count) -
BMPruneLikelihoodFBDTree.tex
: Supplementary Figure 6, coverage details of the FBD tree model parameters -
BMPruneLikelihoodRootValue.tex
: Supplementary Figure 7, coverage details of the ancestral states at the root of the four characters -
BMPruneLikelihoodTraitRateMatrix.tex
: Supplementary Figure 8, coverage details of 4 evolutionary rates of the four continuous characters and 6 trait correlations between them
Note that we will need to run LogAnalyser and obtain the Effective Sample Size (ESS) of each sampled parameter from the BEAST2 log files, after which the resulting file is saved to figures/BMPruneLikelihoodESS.txt
.
The tables will be exported to folder figures/
, including
-
BMPruneLikelihoodCoverage.txt
: the three columns represent the parameter names and the percentage of simulations in which the posterior mean is covered by the 95% HPD interval (True
indicates covered andFalse
indicates not covered) -
BMPruneLikelihoodEssSummary.txt
: the three columns represent the parameter names, mean and standard deviation of the ESS based on the 100 simulations
2 Simulation study II
This section simulates different number of continuous characters on FBD trees and produces results for Figure 2 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/02simulation
and the following BEAST2 XML templates are inside the folder:
-
BMLikelihoodContFossil_template.xml
-
BMLikelihoodContTraitNr_template.xml
2.1 Figure 2a, 2b and 2c
(2.1.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9a.R
The simulated trees will be saved to RData/SimulatedFBDTrees_Fig9a.RData
, including
-
sim.trees
: a list of 4 simulated FBD trees used in this study -
div.rate
,origin
,sampling.rate
,turn.rate
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates respectively.
(2.1.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 02Fig2a-cSim.R
The true values of the parameters are saved to RData/BMContTraitNrTree1Simulation.RData
, including
-
sim.tree
: the simulated tree used in this analysis, presented in Figure 9a in the Supplementary material -
rate.values.2.save
: shared evolutionary rates of characters on simulated trees
(2.1.3) Compare the estimated and true node times
Execute the R script by
Rscript 03Fig2a-cSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/SimTree1
) that include the estimated divergence times (please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details).
The metadata will be saved to RData/Fig2a-c.RData
, including
-
abv.mean.df
: the data frame of absolute mean of difference between the true node height and the posterior mean height -
fig.3.save
: ggplot object for plotting Figure 2 (a) - (c) in the main manuscript.
2.2 Figure 2d, 2e, 2f and 2g, 2h, 2i
(2.2.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9b.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9b.RData
, including
-
tree.2.save
: a list of 10 simulated FBD trees used in this study -
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(2.2.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 04Fig2d-iSim.R
The true values of the parameters are saved to RData/BMContFossilAgeSimulation.RData
, including
-
sim.tree
: the simulated FBD tree used in this analysis, resented in Figure 9b in the Supplementary material -
tree.prime.2.save
: the trees after pruning fossils fromsim.tree
-
rate.values.2.save
: trait evolutionary rates for simulating continuous character
(2.2.3) Compare the estimated and true node times
Execute the R scripts by
Rscript 05Fig2d-fSimPlot.R
Rscript 06Fig2g-iSimPlot.R
The scripts require .log files from BEAST2 and .txt files (in folder log/Fossil1
, log/Fossil2
, log/Fossil3
, log/Fossil4
) that include the estimated divergence times. Here we only provide the .txt files that summarise the node times and 95% HPD intervals from the MCC trees obtained after BEAST2 analysis. please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details about the .log files and tree files.
The metadata will be saved to RData/Fig2d-f.RData
and RData/Fig2g-i.RData
correspondingly, including
-
abv.mean.df
: the data frame of absolute mean of difference between the true node height and the posterior mean height -
width.mean.df
: the data frame of the width of 95% HPD intervals covering the true node height -
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set -
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates -
fig.2.save
: ggplot object for plotting Figure 2 (d) - (f) / (g) - (i) in the main manuscript.
3 Simulation study III
This section (1) simulates trees with different number of fossils, (2) investigates the adherence of the simulated data to phylogenetic BM model assumptions and (3) compares alternative methods for accounting for character correlation under trees of different sizes. The scripts will produce results for Figure 3 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/02simulation
and the following BEAST2 XML templates are inside the folder:
-
BMLikelihoodContFossil_template.xml
-
BMLikelihoodContTraitNr_template.xml
-
BMLikelihoodContExtant_template.xml
3.1 Figure 3a, 3b and 3c
(3.1.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9c.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9c.RData
, including
-
tree.2.save
: a list of 103 simulated FBD trees used in this study -
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(3.1.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 07Fig3a-cSim.R
The true values of the parameters are saved to RData/BMContFixFossilNrSimulation.RData
, including
-
sim.tree
: the simulated FBD tree used in this analysis, presented in Figure 9c in the Supplementary material -
tree.heights.2.save
: tree height after removing the fossils from thesim.tree
-
rate.values.2.save
: trait evolutionary rates for simulating continuous characters
(3.1.3) Compare the estimated and true node times
Execute the R script by
Rscript 08Fig3a-cSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/Fossil
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3a-c.RData
, including
-
res.df
: the data frame contains (1) absolute values of tree height, (2) absolute values of averaged internal node times and (3) width of 95% HPD intervals of node times between the true and estimated values -
fig.2.save
: ggplot object for plotting Figure 3 (a) - (c) in the main manuscript.
3.2 Figure 3d, 3e and 3f
Note that this simulation will use the simulated trees in RData/SimulatedFBDTrees_Fig9b.RData
.
(3.2.1) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 09Fig3d-fSim.R
The true values of the parameters are saved to RData/BMWhiteNoiseSimulation.RData
, including
-
sim.tree
: the simulated FBD tree used in this analysis -
tree.height
: height of the simulated FBD tree -
rate.values.2.save
: trait evolutionary rates for simulating continuous characters
(3.2.2) Compare the estimated and true node times
Execute the R script by
Rscript 10Fig3d-fSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/WhiteNoise
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3d-f.RData
, including
-
width.mean.df
: the data frame of mean width of 95% HPD intervals covering true node times -
abv.mean.df
: the data frame of absolute values of difference between true and estimated node times -
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set -
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates -
fig.2.save
: ggplot object for plotting Figure 3 (d) - (f) in the main manuscript.
3.3 Figure 3g, 3h and i
(3.3.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9d.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9d.RData
, including
-
tree.2.save
: a list of 112 simulated FBD trees used in this study -
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(3.3.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 11Fig3g-iSim.R
The true values of the parameters are saved to RData/SimulationContExtantProcess.RData
, including
-
sim.tree
: the simulated FBD tree used in this analysis, presented in Figure 9d in the Supplementary material -
derived.tree.2.save
: a list of 4 trees after removing extant species fromsim.tree
-
true.times
: a list of true node times corresponding to the derived tree -
mean.cor.sh
: mean value of the trait correlations obtained from shrinkage estimate -
true.tree.height
: tree height of simulated FBD trees inderived.tree.2.save
(3.3.3) Compare the estimated and true node times
Execute the R script by
Rscript 12Fig3g-iSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/Extant
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3g-i.RData
, including
-
width.mean.df
: the data frame of mean width of 95% HPD intervals covering true node times -
abv.mean.df
: the data frame of absolute values of difference between true and estimated node times -
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set -
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates -
fig.2.save
: ggplot object for plotting Figure 3 (g) - (i) in the main manuscript.
4 Carnivora data set analysis
This section runs Bayesian phylogenetic analysis of Carnivora data set with molecular and morphological (continuous and discrete) characters. The scripts will (1) plot the maximum-clade-credibility summary trees for Figure 4, (2) summarise the marginal likelihood and calculate the Bayes factor for Table 2 and (3) plot multidimensional scaling of Euclidean distances between Carnivora species cranium landmarks for Figure 5.
Make sure your working directory is at zhang_etal_2023/03carnivora
. We also provide the JAR file contraband.jar
for running the BEAST2 analysis.
4.1 Running total-evidence dating analysis
Note that the suffix "Run1, Run2 and Run3" refer to the different initial values for the parameters.
- random: without any constraints
java -jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run1.xml
java -jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run2.xml
java -jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run3.xml
java -jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run1.xml
java -jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run2.xml
java -jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run3.xml
- group: constraint the Smilodon and Hyaenas groups, as well as an outgroup (Echinosorex_gymnura and Erinaceus_europaeus)
java -jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run1.xml
java -jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run2.xml
java -jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run3.xml
java -jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run1.xml
java -jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run2.xml
java -jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run3.xml
- dogcat: constraint the Feliformia and Caniformia groups, as well as an outgroup (Echinosorex_gymnura and Erinaceus_europaeus)
java -jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run1.xml
java -jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run2.xml
java -jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run3.xml
java -jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run1.xml
java -jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run2.xml
java -jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run3.xml
4.2 Estimating marginal likelihood using nested sampling
Note that the suffix "NS1" refers to the initial chains and "NS_XX" refers to the chains with appropriate particle count.
- random
java -jar contraband.jar random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS1.xml
java -jar contraband.jar random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS11.xml
java -jar contraband.jar random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS1.xml
java -jar contraband.jar random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS10.xml
- group
java -jar contraband.jar group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS1.xml
java -jar contraband.jar group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS11.xml
java -jar contraband.jar group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS1.xml
java -jar contraband.jar group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS11.xml
- dogcat
java -jar contraband.jar dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS1.xml
java -jar contraband.jar dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS9.xml
java -jar contraband.jar dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS1.xml
java -jar contraband.jar dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS12.xml
4.3 Summarising trees and marginal likelihood
(4.3.1) After removing burnin, we combined the .log and .trees files from BEAST2 analysis using LogCombiner
. The Maximum clade credibility (MCC) trees are obtained using TreeAnnotator
. The internal node times are summarised based on posterior mean heights.
The resulting MCC trees are located in the following folders:
-
dogcat/BDSS/MCC_MolContDisc_BDSS_27Taxa_dogcat_mean.tree
-
dogcat/FBD/MCC_MolContDisc_FBD_27Taxa_dogcat_mean.tree
-
group/BDSS/MCC_MolContDisc_BDSS_27Taxa_group_mean.tree
-
group/FBD/MCC_MolContDisc_FBD_27Taxa_group_mean.tree
-
random/BDSS/MCC_MolContDisc_BDSS_27Taxa_random_mean.tree
-
random/FBD/MCC_MolContDisc_FBD_27Taxa_random_mean.tree
(4.3.2) The estimated marginal likelihood is summarised using NSLogAnalyser
, after which the generated files are written to the following folders:
-
dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS12.txt
-
dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS9.txt
-
group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS11.txt
-
group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS11.txt
-
random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS10.txt
-
random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS11.txt
(4.3.3) The following script will use the summarised trees and log files to make Figure 4 and Table 2 in the main manuscript.
Rscript MCCTreePlot.R
Note that the script requires to load the Excel file data/Carnivora.xlsx
that contains the detailed information of the data set used in our analysis, including Taxon (species names), data types (molecular, continuous, discrete), Specimen mid point age and Specimen age range, which is referred to the following literature:
-
Sheet1
: combined table ofSheet2
andSheet3
after removing duplicates -
Sheet2
: Alvarez Carretero et al. 2019 -
Sheet3
: Barrett el al. 2021
The output files include:
-
figures/BayesFactor.txt
: tab-separate format text file for Table 2 in the main manuscript -
figures/MCCTreesRandom.tex
: file for Latex to compile and generate Figure 4 in the main manuscript
(4.4) Multi-dimensional scaling (MDS) of cranium landmarks
The following script will (1) read the alignment of the cranium landmarks from Carnivora species, (2) calculate Euclidean distances of the 3-dimensional data and (3) perform MDS and produce Figure 5 in the main manuscript.
Rscript data/TraitValueMDS.R
The script has an option to load the following files, either
-
data/MorphologicalAlignments.RData
: aC
matrix of aligned 3-dimensional cranium landmarks. NOTE that theC
matrix is obtained from the Github repository https://github.com/dosreislab/mcmc3r/blob/master/vignettes/Reproduce_Carnivora_analysis.Rmd. After loading, the script will perform MDS based on the calculated Euclidean distances.
or
-
data/TraitMDS.RData
: a list of MDS results we have performed previously. Please refer to https://github.com/Rong419/MorphologyDocument/blob/master/manuscript/rscript/TraitValueMDS_resubmit2.R for more details.
The R script will finally export the file /figures/figureCarnivoraTraitsMDS.tex
for Latex to compile and generate Figure 5 in the main manuscript.
Methods
1. The data for well-calibrated validation and simulation studies can be simulated by running R scripts inside corresponding folders (i.e., 01validation, 02simulation and 03carnivora).
2. The XML files are used for Carnivoran analyses. The molecular and continuous traits data can be obtained from Sandra Álvarez-Carretero et al., 2019. The discrete trait data can be obtained from Paul Z. Barrett et al., 2021.
3. Data processing details can be found on the Github repository https://github.com/Rong419/MorphologyDocument.git.