Assessing the adequacy of morphological models using Posterior Predictive Simulations

Mulvey, Laura 1 ; May, Mike2; Brown, Jeremy3; Höhna, Sebastian4; Wright, April5; Warnock, Rachel1

Published Oct 07, 2024 on Dryad. https://doi.org/10.5061/dryad.4f4qrfjkq

Data files

Oct 07, 2024 version files 109.37 MB

Mulveyetal.zip

109.36 MB
README.md

3.13 KB

Abstract

Reconstructing the evolutionary history of different groups of organisms provides insight into how life originated and diversified on Earth. Phylogenetic trees are commonly used to estimate this evolutionary history. Within Bayesian phylogenetics a major step in estimating a tree is in choosing an appropriate model of character evolution. While the most common character data used is molecular sequence data, morphological data remains an vital source of information. The use of morphological characters allows for the incorporation fossil taxa, and despite advances in molecular sequencing, continues to play a significant role in neontology. Moreover, it is the main data source that allows us to unite extinct and extant taxa directly under the same generating process. We therefore require suitable models of morphological character evolution, the most common being the Mk Lewis model. While it is frequently used in both palaeobiology and neontology, it is not known whether the simple Mk substitution model, or any extensions to it, provide a sufficiently good description of the process of morphological evolution. In this study we investigate the impact of different morphological models on empirical tetrapod data sets. Specifically, we compare unpartitioned Mk models with those where characters are partitioned by the number of observed states, both with and without allowing for rate variation across sites and accounting for ascertainment bias. We show that the choice of substitution model has an impact on both topology and branch lengths, highlighting the importance of model choice. Through simulations, we validate the use of the model adequacy approach, posterior predictive simulations, for choosing an appropriate model. Additionally we compare the performance of model adequacy with Bayesian model selection. We demonstrate how model selection approaches based on marginal likelihoods are not appropriate for choosing between models with partition schemes that vary in character state space (i.e., that vary in Q-matrix state size).Using posterior predictive simulations we found that current variations of the Mk model are often performing adequately in capturing the evolutionary dynamics that generated our data. We do not find any preference for a particular model extension across multiple data sets, indicating that there is no `one size fits all' when it comes to morphological data and that careful consideration should be given to choosing models of discrete character evolution. By using suitable models of character evolution, we can increase our confidence in our phylogenetic estimates, which should in turn allow us to gain more accurate insights into the evolutionary history of both extinct and extant taxa.

SupplementaryInformation.pdf contains plots referenced in the paper
SupplementaryFile1.pdf contains the difference in tree length inferred using different substitution models for all 114 data sets
SupplementaryFile2.pdf contains the difference in tree space inferred using cause different substitution models for all 114 data sets
PPS_Morphology contains all data and scripts associated with the analyses:
- Empirical-Inf: This contains everything for the empirical analysis. Data directory contains all the morphological data sets analysed. Scripts directory contains all revbayes scripts for inference and Rscripts for down stream analysis.
- PPS-Simulations: This contains the set up for the simulation study:
  * Simulation: The data directory contains the output from the empirical inference of Egi and Shoshani for
  four different models. Simulated data sets are in the data_model directories.
  Scripts contains all the revbayes scripts used for the inference.
  Data was simulated using the Sim.r file.
  * Analysis: Contains two directories, Egi and Shsoahni. Each contains the posterior predictive simulations’ workflow.
  To start the analysis run the scripts in the jobs_"model" files. All revbayes scripts used are in the scripts folder.
  There are five other scripts used in the analysis. sim-start.sh: this contains all the commands for each individual run. CheckConvergence.r: this file ensures that the initial MCMC has reach convergence before simulating new data sets. MorphoSim.r: simulates data sets in R using the phangorn R package. Anaylsis.r: this file calculates the tests statistics. Need to specify on line 3 and 11 which simulation set up (which model) you are analysing. Cumulative.r: this file was used to assess the number of replicates data sets are necessary to determine the adequacy of a model. This was only used once, after which we simulated 500 data sets for the rest of the models.
  * Stepping Stone: Contains two directories, Egi and Shsoahni. Each contains the scripts for the stepping stone analysis. To start the analysis run the scripts in the jobs_"model" files. All revbayes scripts used are in the scripts folder. There are two other scripts used in the analysis. stepping_stone.sh: this contains all the commands for each individual run. ml_ss.sh: gets all the marginal likelihoods and puts them into a cvs file. Requires you provide the model name as an argument.
- PPS Empirical: The scripts required to carry out posterior predictive simulations on the empirical data sets.
  All the revbayes scripts are in the scripts directory. Running the files in the jobs_"datasetname" directories will start the analysis. There are four other scripts used in the analysis. MorphoSim.r: simulates data sets in R using the phangorn R package.Test_Statistics.r: calculates all test statistics. Fig-6.r: calculates consistency index and retention index test stats. P-vals.r: calculates the posterior predictive p-values

Assessing the adequacy of morphological models using Posterior Predictive Simulations

Data files

Abstract

README: Assessing the Adequacy of Morphological Models using Posterior Predictive Simulations

Works referencing this dataset