Fast Bayesian inference of phylogenies from multiple continuous characters
Cite this dataset
Zhang, Rong; Zhang, Rong; Drummond, Alexei; Mendes, Fabio (2024). Fast Bayesian inference of phylogenies from multiple continuous characters [Dataset]. Dryad. https://doi.org/10.5061/dryad.pnvx0k6vj
Abstract
Timescaled phylogenetic trees are an ultimate goal of evolutionary biology and a necessary ingredient in comparative studies. The accumulation of genomic data has resolved the tree of life to a great extent, yet timing evolutionary events remains challenging if not impossible without external information such as fossil ages and morphological characters. Methods for incorporating morphology in tree estimation have lagged behind their molecular counterparts, especially in the case of continuous characters. Despite recent advances, such tools are still direly needed as we approach the limits of what molecules can teach us. Here, we implement a suite of stateoftheart methods for leveraging continuous morphology in phylogenetics, and by conducting extensive simulation studies we thoroughly validate and explore our methods' properties. While retaining model generality and scalability, we make it possible to estimate absolute and relative divergence times from multiple continuous characters while accounting for uncertainty. We compile and analyze one of the most datatype diverse data sets to date, comprised of contemporaneous and ancient molecular sequences, and discrete and continuous characters from living and extinct Carnivora taxa. We conclude by synthesizing lessons about our method's behavior, and suggest future research venues.
README: Supplementary materials for "Fast Bayesian inference of phylogenies from multiple continuous characters"
This file will guide you to reproduce the simulations and analyses in the main manuscript.
1 Simulation study I
This section performs wellcalibrated simulation study to validate the implementation of our Brownian motion (BM) model, which produces Figure 1 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/01validation
. Please make sure that the BEAST2 XML template BMPruneLikelihood_CalValSimMorph_template.xml
is inside the folder.
1.1 Simulate Fossilised Birthdeath (FBD) trees
Execute the R script by
Rscript 01SimulateFBDTree.R
The parameters of FBD model prior distributions are detailed as follows:

div.meanlog
: mean value of the LogNormal distribution for diversification rate 
div.sdlog
: standard deviation of the LogNormal distribution for diversification rate 
turn.alpha
: alpha value of the Beta distribution for turnover rate 
turn.beta
: beta value of the Beta distribution for turnover rate 
sample.alpha
: alpha value of the Beta distribution for sampling proportion 
sample.beta
: beta value of the Beta distribution for sampling proportion
The script will simulate and save the trees to RData/FBDTreeSimulation.RData
, i.e.,

trs
: list of 150 simulated FBD trees. Note that we have simulated another 50 trees just in case some trees have sampled ancestor at the root, in which case likelihoods cannot be computed.
1.2 Simulate continuous characters on simulated FBD trees and prepare xml files for BEAST2
Execute the R script by
Rscript 02BMPruneLikelihoodCalValSimMorph.R
The settings of the simulation used in the script include the following variables:

K
: number of traits 
n.sim
: number of simulations 
sigmasq.mean
: mean value of the Lognormal distribution for diffusion rate 
sigmasq.sd
: standard deviation of the LogNormal distribution for diffusion rate 
correlation.lower
: lower bond of the Uniform distribution for trait correlations 
correlation.upper
: upper bond of the Uniform distribution for trait correlations 
x0.mean
: mean value of the Normal distribution for ancestral values at the root 
x0.sd
: standard deviation of the Normal distribution for ancestral values at the root
The true values of parameters in simulating will be saved to RData/BMPruneLikelihoodCalValSimMorph.RData
, including

trees.2.save
: list of 100 FBD trees to simulate continuous characters on. 
traits.2.save
: list of simulated characters at each tip of each FBD tree 
sigmasq.2.save
: list of diffusion rates of characters on each FBD tree 
correlation.2.save
: list of trait correlations on each FBD tree 
root.2.save
: list of ancestral values at the root of each FBD tree
The R script will also generate 100 xml files and write them to folder xmls/
, i.e., BMPruneLikelihood_calval_1.xml
, ..., BMPruneLikelihood_calval_100.xml
1.3 Compare posterior means against simulated parameter values
Execute the R script by
Rscript 03BMPruneLikelihoodCalValSimMorphPlot.R
The metadata will be saved to RData/BMPruneLikelihoodMorphResults.RData
, including

log.varValues.1
,log.varValues.2
,log.varValues.3
,log.varValues.4
: posterior mean and 95% HPD intervals of trait diffusion rates from the BEAST2 log files 
log.covValues.1
,log.covValues.2
,log.covValues.3
,log.covValues.4
,log.covValues.5
,log.covValues.6
: posterior mean and 95% HPD intervals of trait correlations from the BEAST2 log files 
log.rootValues.1
,log.rootValues.2
,log.rootValues.3
,log.rootValues.4
: posterior mean and 95% HPD intervals of ancestral trait values at the root of the tree from the BEAST2 log files 
log.FBDdiversificationRate
: posterior mean and 95% HPD intervals of diversification rate from the BEAST2 log files 
log.FBDturnover
: posterior mean and 95% HPD intervals of turnover rate from the BEAST2 log files 
log.samplingProportion
: posterior mean and 95% HPD intervals of sampling proportion from the BEAST2 log files 
log.TheTree.height
: posterior mean and 95% HPD intervals of tree heights from the BEAST2 log files 
cover.df
: the summarised coverage table
The figures will be exported to folder figures/
(using Latex to compile the .tex
files to generate PDF files), including

BMPruneLikelihoodMain.tex
: Figure 1 in the main manuscript, coverage details of 5 parameters to show in the main manuscript, i.e., evolutionary rate of the first character ($r_1$), correlation between the first and second character ($\rho_{12}$), ancestral state of the first character at the root ($y_{0}^1$), root height ($t_{mrca}$) and number of sampled ancestors (S.A. count) 
BMPruneLikelihoodFBDTree.tex
: Supplementary Figure 6, coverage details of the FBD tree model parameters 
BMPruneLikelihoodRootValue.tex
: Supplementary Figure 7, coverage details of the ancestral states at the root of the four characters 
BMPruneLikelihoodTraitRateMatrix.tex
: Supplementary Figure 8, coverage details of 4 evolutionary rates of the four continuous characters and 6 trait correlations between them
Note that we will need to run LogAnalyser and obtain the Effective Sample Size (ESS) of each sampled parameter from the BEAST2 log files, after which the resulting file is saved to figures/BMPruneLikelihoodESS.txt
.
The tables will be exported to folder figures/
, including

BMPruneLikelihoodCoverage.txt
: the three columns represent the parameter names and the percentage of simulations in which the posterior mean is covered by the 95% HPD interval (True
indicates covered andFalse
indicates not covered) 
BMPruneLikelihoodEssSummary.txt
: the three columns represent the parameter names, mean and standard deviation of the ESS based on the 100 simulations
2 Simulation study II
This section simulates different number of continuous characters on FBD trees and produces results for Figure 2 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/02simulation
and the following BEAST2 XML templates are inside the folder:

BMLikelihoodContFossil_template.xml

BMLikelihoodContTraitNr_template.xml
2.1 Figure 2a, 2b and 2c
(2.1.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9a.R
The simulated trees will be saved to RData/SimulatedFBDTrees_Fig9a.RData
, including

sim.trees
: a list of 4 simulated FBD trees used in this study 
div.rate
,origin
,sampling.rate
,turn.rate
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates respectively.
(2.1.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 02Fig2acSim.R
The true values of the parameters are saved to RData/BMContTraitNrTree1Simulation.RData
, including

sim.tree
: the simulated tree used in this analysis, presented in Figure 9a in the Supplementary material 
rate.values.2.save
: shared evolutionary rates of characters on simulated trees
(2.1.3) Compare the estimated and true node times
Execute the R script by
Rscript 03Fig2acSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/SimTree1
) that include the estimated divergence times (please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details).
The metadata will be saved to RData/Fig2ac.RData
, including

abv.mean.df
: the data frame of absolute mean of difference between the true node height and the posterior mean height 
fig.3.save
: ggplot object for plotting Figure 2 (a)  (c) in the main manuscript.
2.2 Figure 2d, 2e, 2f and 2g, 2h, 2i
(2.2.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9b.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9b.RData
, including

tree.2.save
: a list of 10 simulated FBD trees used in this study 
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(2.2.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 04Fig2diSim.R
The true values of the parameters are saved to RData/BMContFossilAgeSimulation.RData
, including

sim.tree
: the simulated FBD tree used in this analysis, resented in Figure 9b in the Supplementary material 
tree.prime.2.save
: the trees after pruning fossils fromsim.tree

rate.values.2.save
: trait evolutionary rates for simulating continuous character
(2.2.3) Compare the estimated and true node times
Execute the R scripts by
Rscript 05Fig2dfSimPlot.R
Rscript 06Fig2giSimPlot.R
The scripts require .log files from BEAST2 and .txt files (in folder log/Fossil1
, log/Fossil2
, log/Fossil3
, log/Fossil4
) that include the estimated divergence times. Here we only provide the .txt files that summarise the node times and 95% HPD intervals from the MCC trees obtained after BEAST2 analysis. please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details about the .log files and tree files.
The metadata will be saved to RData/Fig2df.RData
and RData/Fig2gi.RData
correspondingly, including

abv.mean.df
: the data frame of absolute mean of difference between the true node height and the posterior mean height 
width.mean.df
: the data frame of the width of 95% HPD intervals covering the true node height 
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set 
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates 
fig.2.save
: ggplot object for plotting Figure 2 (d)  (f) / (g)  (i) in the main manuscript.
3 Simulation study III
This section (1) simulates trees with different number of fossils, (2) investigates the adherence of the simulated data to phylogenetic BM model assumptions and (3) compares alternative methods for accounting for character correlation under trees of different sizes. The scripts will produce results for Figure 3 in the main manuscript.
Make sure your working directory is at zhang_etal_2023/02simulation
and the following BEAST2 XML templates are inside the folder:

BMLikelihoodContFossil_template.xml

BMLikelihoodContTraitNr_template.xml

BMLikelihoodContExtant_template.xml
3.1 Figure 3a, 3b and 3c
(3.1.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9c.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9c.RData
, including

tree.2.save
: a list of 103 simulated FBD trees used in this study 
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(3.1.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 07Fig3acSim.R
The true values of the parameters are saved to RData/BMContFixFossilNrSimulation.RData
, including

sim.tree
: the simulated FBD tree used in this analysis, presented in Figure 9c in the Supplementary material 
tree.heights.2.save
: tree height after removing the fossils from thesim.tree

rate.values.2.save
: trait evolutionary rates for simulating continuous characters
(3.1.3) Compare the estimated and true node times
Execute the R script by
Rscript 08Fig3acSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/Fossil
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3ac.RData
, including

res.df
: the data frame contains (1) absolute values of tree height, (2) absolute values of averaged internal node times and (3) width of 95% HPD intervals of node times between the true and estimated values 
fig.2.save
: ggplot object for plotting Figure 3 (a)  (c) in the main manuscript.
3.2 Figure 3d, 3e and 3f
Note that this simulation will use the simulated trees in RData/SimulatedFBDTrees_Fig9b.RData
.
(3.2.1) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 09Fig3dfSim.R
The true values of the parameters are saved to RData/BMWhiteNoiseSimulation.RData
, including

sim.tree
: the simulated FBD tree used in this analysis 
tree.height
: height of the simulated FBD tree 
rate.values.2.save
: trait evolutionary rates for simulating continuous characters
(3.2.2) Compare the estimated and true node times
Execute the R script by
Rscript 10Fig3dfSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/WhiteNoise
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3df.RData
, including

width.mean.df
: the data frame of mean width of 95% HPD intervals covering true node times 
abv.mean.df
: the data frame of absolute values of difference between true and estimated node times 
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set 
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates 
fig.2.save
: ggplot object for plotting Figure 3 (d)  (f) in the main manuscript.
3.3 Figure 3g, 3h and i
(3.3.1) Simulate FBD trees
Execute the R script by
Rscript 01TreeSim_Fig9d.R
The simulated trees are saved to RData/SimulatedFBDTrees_Fig9d.RData
, including

tree.2.save
: a list of 112 simulated FBD trees used in this study 
diversification.rate.2.save
,origin.2.save
,sampling.proportion.2.save
,turn.over.2.save
,tree.height.2.save
: arrays of FBD model parameters, i.e., diversification rate, origin, sampling proportion, turnover rates and tree height respectively.
(3.3.2) Simulate continuous characters on the simulated tree and write xml files for BEAST2 analysis
Execute the R script by
Rscript 11Fig3giSim.R
The true values of the parameters are saved to RData/SimulationContExtantProcess.RData
, including

sim.tree
: the simulated FBD tree used in this analysis, presented in Figure 9d in the Supplementary material 
derived.tree.2.save
: a list of 4 trees after removing extant species fromsim.tree

true.times
: a list of true node times corresponding to the derived tree 
mean.cor.sh
: mean value of the trait correlations obtained from shrinkage estimate 
true.tree.height
: tree height of simulated FBD trees inderived.tree.2.save
(3.3.3) Compare the estimated and true node times
Execute the R script by
Rscript 12Fig3giSimPlot.R
The script requires .log files from BEAST2 and .txt files (in folder log/Extant
) that include the estimated divergence times. Please refer to our Github repository https://github.com/Rong419/MorphologyDocument for more details.
The metadata will be saved to RData/Fig3gi.RData
, including

width.mean.df
: the data frame of mean width of 95% HPD intervals covering true node times 
abv.mean.df
: the data frame of absolute values of difference between true and estimated node times 
sampled.tree.height
: the data frame of tree height (difference and absolute value) for each replicated data set 
tree.height.mean
: the data frame of tree height (difference and absolute value) averaged over each 20 data sets replicates 
fig.2.save
: ggplot object for plotting Figure 3 (g)  (i) in the main manuscript.
4 Carnivora data set analysis
This section runs Bayesian phylogenetic analysis of Carnivora data set with molecular and morphological (continuous and discrete) characters. The scripts will (1) plot the maximumcladecredibility summary trees for Figure 4, (2) summarise the marginal likelihood and calculate the Bayes factor for Table 2 and (3) plot multidimensional scaling of Euclidean distances between Carnivora species cranium landmarks for Figure 5.
Make sure your working directory is at zhang_etal_2023/03carnivora
. We also provide the JAR file contraband.jar
for running the BEAST2 analysis.
4.1 Running totalevidence dating analysis
Note that the suffix "Run1, Run2 and Run3" refer to the different initial values for the parameters.
 random: without any constraints
java jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run1.xml
java jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run2.xml
java jar contraband.jar random/FBD/MolContDisc_FBD_27Taxa_Run3.xml
java jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run1.xml
java jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run2.xml
java jar contraband.jar random/BDSS/MolContDisc_BDSS_27Taxa_Run3.xml
 group: constraint the Smilodon and Hyaenas groups, as well as an outgroup (Echinosorex_gymnura and Erinaceus_europaeus)
java jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run1.xml
java jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run2.xml
java jar contraband.jar group/FBD/MolContDisc_FBD_27Taxa_group_Run3.xml
java jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run1.xml
java jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run2.xml
java jar contraband.jar group/BDSS/MolContDisc_BDSS_27Taxa_group_Run3.xml
 dogcat: constraint the Feliformia and Caniformia groups, as well as an outgroup (Echinosorex_gymnura and Erinaceus_europaeus)
java jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run1.xml
java jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run2.xml
java jar contraband.jar dogcat/FBD/MolContDisc_FBD_27Taxa_dogcat_Run3.xml
java jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run1.xml
java jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run2.xml
java jar contraband.jar dogcat/FBD/MolContDisc_BDSS_27Taxa_dogcat_Run3.xml
4.2 Estimating marginal likelihood using nested sampling
Note that the suffix "NS1" refers to the initial chains and "NS_XX" refers to the chains with appropriate particle count.
 random
java jar contraband.jar random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS1.xml
java jar contraband.jar random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS11.xml
java jar contraband.jar random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS1.xml
java jar contraband.jar random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS10.xml
 group
java jar contraband.jar group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS1.xml
java jar contraband.jar group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS11.xml
java jar contraband.jar group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS1.xml
java jar contraband.jar group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS11.xml
 dogcat
java jar contraband.jar dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS1.xml
java jar contraband.jar dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS9.xml
java jar contraband.jar dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS1.xml
java jar contraband.jar dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS12.xml
4.3 Summarising trees and marginal likelihood
(4.3.1) After removing burnin, we combined the .log and .trees files from BEAST2 analysis using LogCombiner
. The Maximum clade credibility (MCC) trees are obtained using TreeAnnotator
. The internal node times are summarised based on posterior mean heights.
The resulting MCC trees are located in the following folders:

dogcat/BDSS/MCC_MolContDisc_BDSS_27Taxa_dogcat_mean.tree

dogcat/FBD/MCC_MolContDisc_FBD_27Taxa_dogcat_mean.tree

group/BDSS/MCC_MolContDisc_BDSS_27Taxa_group_mean.tree

group/FBD/MCC_MolContDisc_FBD_27Taxa_group_mean.tree

random/BDSS/MCC_MolContDisc_BDSS_27Taxa_random_mean.tree

random/FBD/MCC_MolContDisc_FBD_27Taxa_random_mean.tree
(4.3.2) The estimated marginal likelihood is summarised using NSLogAnalyser
, after which the generated files are written to the following folders:

dogcat/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_dogcat_NS12.txt

dogcat/FBD/nestedsampling/MolContDisc_FBD_27Taxa_dogcat_NS9.txt

group/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_group_NS11.txt

group/FBD/nestedsampling/MolContDisc_FBD_27Taxa_group_NS11.txt

random/BDSS/nestedsampling/MolContDisc_BDSS_27Taxa_NS10.txt

random/FBD/nestedsampling/MolContDisc_FBD_27Taxa_NS11.txt
(4.3.3) The following script will use the summarised trees and log files to make Figure 4 and Table 2 in the main manuscript.
Rscript MCCTreePlot.R
Note that the script requires to load the Excel file data/Carnivora.xlsx
that contains the detailed information of the data set used in our analysis, including Taxon (species names), data types (molecular, continuous, discrete), Specimen mid point age and Specimen age range, which is referred to the following literature:

Sheet1
: combined table ofSheet2
andSheet3
after removing duplicates 
Sheet2
: Alvarez Carretero et al. 2019 
Sheet3
: Barrett el al. 2021
The output files include:

figures/BayesFactor.txt
: tabseparate format text file for Table 2 in the main manuscript 
figures/MCCTreesRandom.tex
: file for Latex to compile and generate Figure 4 in the main manuscript
(4.4) Multidimensional scaling (MDS) of cranium landmarks
The following script will (1) read the alignment of the cranium landmarks from Carnivora species, (2) calculate Euclidean distances of the 3dimensional data and (3) perform MDS and produce Figure 5 in the main manuscript.
Rscript data/TraitValueMDS.R
The script has an option to load the following files, either

data/MorphologicalAlignments.RData
: aC
matrix of aligned 3dimensional cranium landmarks. NOTE that theC
matrix is obtained from the Github repository https://github.com/dosreislab/mcmc3r/blob/master/vignettes/Reproduce_Carnivora_analysis.Rmd. After loading, the script will perform MDS based on the calculated Euclidean distances.
or

data/TraitMDS.RData
: a list of MDS results we have performed previously. Please refer to https://github.com/Rong419/MorphologyDocument/blob/master/manuscript/rscript/TraitValueMDS_resubmit2.R for more details.
The R script will finally export the file /figures/figureCarnivoraTraitsMDS.tex
for Latex to compile and generate Figure 5 in the main manuscript.
Methods
1. The data for wellcalibrated validation and simulation studies can be simulated by running R scripts inside corresponding folders (i.e., 01validation, 02simulation and 03carnivora).
2. The XML files are used for Carnivoran analyses. The molecular and continuous traits data can be obtained from Sandra ÁlvarezCarretero et al., 2019. The discrete trait data can be obtained from Paul Z. Barrett et al., 2021.
3. Data processing details can be found on the Github repository https://github.com/Rong419/MorphologyDocument.git.
Funding
China Scholarship Council, Award: No. 201706990021
Royal Society of New Zealand, Award: 16UOA277
National Science Foundation, Award: DEB2040347