Data from: On the Mkv model with among-character rate variation
Data files
Jun 23, 2025 version files 161.23 MB
-
README.md
9.27 KB
-
SupplementaryDataForSubmission_AfterRevisions.zip
161.22 MB
Abstract
Models used in likelihood-based morphological phylogenetics often adapt molecular phylogenetics models to the specificities of morphological data. Such is the case for the widely used Mkv model, which introduces an acquisition bias correction for sampling only characters that are observed to be variable---and for models of among-character rate variation (ACRV), routinely applied by researchers to relax the equal-rates assumption of Mkv. However, the interaction between variable character acquisition bias and ACRV has never been explored before. We demonstrate that there are two distinct approaches to conditioning the likelihood on variable characters when there is ACRV, and we call them joint and marginal acquisition bias. Far from being just a trivial mathematical detail, we show that how the variable character conditional likelihood is calculated results in different assumptions about how rate variation is distributed in morphological datasets. Simulations demonstrate that tree length and amount of ACRV in the data are systematically biased when conditioning on variable characters differently from how the data was simulated. Moreover, an empirical case study with extant and extinct taxa reveals a potential impact not only on the estimation of branch lengths but also of phylogenetic relationships. We recommend the use of the marginal acquisition bias approach for morphological datasets modeled with ACRV. Finally, we urge developers of phylogenetic software to clarify which acquisition bias correction is implemented for both estimation and simulation, and we discuss the implications of our findings on modeling variable characters for the future of morphological phylogenetics.
https://doi.org/10.5061/dryad.fxpnvx13g
Description of the data and file structure
This compressed file archive contains all data and scripts used in the simulation study and in the empirical example included in the paper, as well as output tree files, scripts, and plots to check for MCMC convergence, and scripts for plotting results.
Files and variables
File: SupplementaryDataForSubmission_AfterRevisions.zip
Description:
data folder: contains data files used for all analyses
Gekkota_Villa2023_All.nex: Morphological data in Nexus format, used in the empirical example. Data from Villa (2023).Gekkota_Villa2023_Extant.nex: Morphological data in Nexus format, used in the empirical example. Data modified from Villa (2023) to prune out extinct taxa.Sim1000Chars_TreeN_MODEL.nex: Simulated morphological data in Nexus format, used in the simulation study.Nis the number associated with a simulated tree (1 to 250), whileMODELis the name of the morphological substitution model under which the data was simulated (Mk DiscretizedGamma, jMkv DiscretizedGamma, mMkv DiscretizedGamma).Sim1000Chars_TreeN_MODEL.rates.txt: Text file with true rates under which each character in the simulated datasets was simulated.Nis the number associated with a simulated tree (1 to 250), whileMODELis the name of the morphological substitution model under which the data was simulated (Mk DiscretizedGamma, jMkv DiscretizedGamma, mMkv DiscretizedGamma).Sim100Chars_TreeN_MODEL_alpha0.5.nex: Simulated morphological data (100 characters) in Nexus format, used in the simulation study.Nis the number associated with a simulated tree (1 to 250), whileMODELis the name of the morphological substitution model under which the data was simulated (Mk DiscretizedGamma, jMkv DiscretizedGamma, mMkv DiscretizedGamma). The alpha parameter for the discretized gamma distribution was set as 0.5.Sim100Chars_TreeN_MODEL_alpha0.5.rates.txt: Text file with true rates under which each character in the simulated datasets (100 characters) was simulated.Nis the number associated with a simulated tree (1 to 250), whileMODELis the name of the morphological substitution model under which the data was simulated (Mk DiscretizedGamma, jMkv DiscretizedGamma, mMkv DiscretizedGamma). The alpha parameter for the discretized gamma distribution was set as 0.5.sim_treessubfolder: All trees (from 1 to 250) used to simulate data, in Newick format.
output folder: contains all MAP (Maximum A Posteriori) trees from the simulation study and the empirical example.
postprocessing folder: contains scripts and plots to check convergence of MCMC runs, and scripts for plotting results after running phylogenetic analyses
CheckConvergence.R: R script to check convergence of empirical analyses.CheckConvergence_Sims.R: R script to check convergence of simulated data analyses.CheckConvergence_Sims_ForCluster.R: R script to check convergence of simulated data analyses, adapted to run in a high-performance computing cluster.Convenience_Sims_R.sh: Bash script to run CheckConvergence_Sims_ForCluster.R in in high-performance computing cluster.ConvergenceTable_Sims.R: R script to build convergence table.Empirical_NoCharactersExcluded_run_mcmc.sh: Bash script to run empirical analyses under models that do not exclude invariant characters (Mk, Mk DiscretizedGamma) in a high-performance computing cluster.Empirical_run_mcmc.sh: Bash script to run empirical analyses under models that exclude invariant characters (Mkv, jMkv, DiscretizedGamma, mMkv, DiscretizedGamma) in a high-performance computing cluster.plot_helper.R: R helper function to run the scriptSplitDifferences.R.PlotAlpha_Simulations.R: R script to plot values of an alpha parameter of gamma-distributed ACRV from simulated data analyses.PlotEmpiricalAlpha.R: R script to plot values of an alpha parameter of gamma-distributed ACRV from empirical analyses.PlotEmpiricalRateRatio.R: R script to plot lowest-to-highest rate ratios from empirical analyses.PlotEmpiricalTreeLengths.R: R script to plot tree lengths from empirical analyses.PlotRateRatio_Simulations.R: R script to plot lowest-to-highest rate ratios from simulated data analyses.PlotTreeLengths_Simulations.R: R script to plot tree lengths from simulated data analyses.PlotRateRatio_Simulations_100Characters.R: R script to plot lowest-to-highest rate ratios from simulated data analyses done with 100 simulated characters.PlotTreeLengths_Simulations_100Characters.R: R script to plot tree lengths from simulated data analyses done with 100 simulated characters.Simulation_NoCharactersExcluded_run_mcmc.sh: Bash script to run simulated data analyses estimated under models that do not exclude invariant characters (Mk, Mk DiscretizedGamma) in a high-performance computing cluster.Simulation_run_mcmc.sh: Bash script to run simulated data analyses estimated under models that exclude invariant characters (Mkv, jMkv, DiscretizedGamma, mMkv, DiscretizedGamma) in a high-performance computing cluster.SplitDifferences.R: R script to plot differences in tree split probabilities between different substitution models.100Characterssubfolder: Plots of alpha values and tree lengths from simulated data analyses done with 100 simulated characters.convergencesubfolder: Plots to evaluate convergence of MCMC runs, both for empirical and simulated data analyses.empirical_alphasubfolder: Plots of alpha values from empirical analyses.empirical_tree_lengths subfolder: Plots of tree lengths from empirical analyses.sim_alpha subfolder: Plots of alpha values from simulated data analyses.sim_tree_lengths: Plots of tree lengths from simulated data analyses.splits_differences: Plots comparing tree split probabilities estimated under different substitution models.
scripts folder: contains scripts for running phylogenetic analyses
EmpiricalAnalysis_NoCharacterExcluded.Rev: Main RevBayes script to run empirical analyses under models that do not exclude invariant characters (Mk, Mk DiscretizedGamma).Empirical Analysis.Rev: Main RevBayes script to run empirical analyses under models that exclude invariant characters (Mkv, jMkv, DiscretizedGamma, mMkv, DiscretizedGamma).jMkv_DiscretizedGamma_Sims.Rev: RevBayes script for jMkv + discretized gamma substitution model, adapted for simulated data analyses.jMkv_DiscretizedGamma.Rev: RevBayes script for jMkv + discretized gamma substitution model.Mk_Binary.Rev: RevBayes script for the Mk substitution model, adapted for simulated data analyses.Mk_DiscretizedGamma_Binary.Rev: RevBayes script for Mk + discretized gamma substitution model, adapted for simulated data analyses.Mk_DiscretizedGamma.Rev: RevBayes script for Mk + discretized gamma substitution model.Mk.Rev: RevBayes script for the Mk substitution model.Mkv_Sims.Rev: RevBayes script for the Mkv substitution model, adapted for simulated data analyses.Mkv.Rev: RevBayes script for the Mkv substitution model.mMkv_DiscretizedGamma_Sims.Rev: RevBayes script for mMkv + discretized gamma substitution model, adapted for simulated data analyses.mMkv_DiscretizedGamma.Rev: RevBayes script for mMkv + discretized gamma substitution model.SimulatedDataAnalysis_NoCharacterExcluded.Rev: Main RevBayes script to run simulated data analyses under models that do not exclude invariant characters (Mk, Mk DiscretizedGamma).SimulatedDataAnalysis.Rev: Main RevBayes script to run empirical analyses under models that exclude invariant characters (Mkv, jMkv, DiscretizedGamma, mMkv DiscretizedGamma).SimulateUnrootedTrees.Rev: RevBayes script to simulate trees for the simulation study.Simulator_ContinuousGamma.R: R script to simulate data under both jMkv and mMkv + continuous gamma substitution model.Simulator_jMkv_DiscretizedGamma.R: R script to simulate data under the jMkv + discretized gamma substitution model.Simulator_Mk_DiscretizedGamma.R: R script to simulate data under the Mk + discretized gamma substitution model.Simulator_mMkv_DiscretizedGamma.R: R script to simulate data under the mMkv + discretized gamma substitution model.Simulator_RatePlot.R:R script to plot effective rates of simulated characters.Uniform_Tree.Rev: RevBayes script for uniform unrooted tree model.Uniform_Tree_GammaDirichlet.Rev: RevBayes script for uniform unrooted tree model with compound gamma-Dirichlet prior on tree length.
validation folder: contains scripts used for validation of the implementations of the jMkv + discretized gamma and mMkv + discretized gamma models in RevBayes, as well as results and figures of the validations.
Access information
Empirical morphological data were derived from the following sources:
- Villa, A. (2023). A redescription of Palaeogekko risgoviensis (Squamata, Gekkota) from the Middle Miocene of Germany, with new data on its morphology. PeerJ, 11, e14717.
