Data from: How many characters are needed to reconstruct a phylogeny?

Published Sep 26, 2025 on Dryad. https://doi.org/10.5061/dryad.63xsj3vd8

Data files

Sep 26, 2025 version files 257.88 MB

README.md

5.64 KB
SupplementaryDataForSubmission.zip

257.88 MB

Abstract

Despite increased recent attention towards Bayesian phylogenetics and its applications in understanding macroevolutionary processes, it remains unclear how many discrete characters are needed to accurately estimate tree topologies in a Bayesian framework. This could be particularly relevant for morphological datasets used in phylogenetics, as they usually consist of few dozens to few hundreds of characters—orders of magnitude smaller than most molecular datasets. I designed a simulation study in the software RevBayes to explore how the number of sampled discrete characters affects accuracy and precision of Bayesian phylogenetic estimates, under various setups differing in number of taxa, average number of state changes per character (i.e., tree length), and number of states per character. Results indicate that between 100 and 500 variable characters are necessary to reach sufficient accuracy and precision of phylogenetic estimates for as low as 20 tips. All other parameters being equal, multistate characters produce slightly more accurate estimates than binary characters, and more labile characters produce more accurate estimates for trees above 50 tips. The results of this study highlight the continuous need for global research efforts geared towards the characterization and digitization of interspecific morphological diversity in both extant and extinct taxa.

Dataset DOI: 10.5061/dryad.63xsj3vd8

Description of the data and file structure

This compressed file archive contains all data and scripts used in the simulation study, as well as scripts to check for MCMC convergence, calculate metrics of tree accuracy and precision, and to plot results.

Files and variables

File: SupplementaryDataForSubmission.zip

Description:

data folder: contains data files used for the simulation study

sim_trees subfolder: Contains all trees used to simulated data, in Newick format. It is organized in additional subfolders depending on the number of taxa (tips) of the tree (5, 10, 20, 50, 100, or 200), and on the expected tree length (1, 3, or 10).
Ntaxa subfolders: Contain all simulated datasets in Nexus format, divided by number of taxa (N = 5, 10, 20, 50, 100, or 200), and with additional subfolders depending on the expected tree length (1, 3, or 10) and on the number of states per character (2, 3, or 4). Simulated data filenames are in the format XCharacters_TreeY_Mkv.nex, where X is the number of characters (20, 50, 100, 500, 1000, 5000) and Y is the simulation replicate (from 1 to 50).

output folder: contains calculated metrics used to assess the accuracy and precision of phylogenetic estimates obtained from all simulated data, as well as MCMC convergence assessments

convergence subfolder: Contains assessment of convergence of continuous parameters, done using the R package convenience.
ExpectedAverageBranchLengths.csv: Comma-separated-values table with average branch lengths calculated for each combination of tree length and number of taxa.
meanMAPCladePP subfolder: Contains mean posterior probabilities of clades in the MAP tree, for all phylogenetic analyses on simulated data.
RF_MAP subfolder: Contains normalized Robinson-Foulds (RF) distances between MAP tree and true tree, for all phylogenetic analyses on simulated data.
SummaryResults_Metrics.csv: Comma-separated-values table with mean and standard deviation of all three metrics calculated in the study, for different combinations of number of taxa, number of characters, number of states per characters, and total tree length.
TrueCladePP subfolder: Contains posterior probabilities of clades in the true tree (true clades), for all phylogenetic analyses on simulated data.

plots folder: Contains plots made in R of metrics used to assess accuracy and precision of phylogenetic estimates.

scripts folder: contains scripts for running phylogenetic analyses, check convergence of MCMC runs, and plot results

CheckConvergence_Sims.R: R script to check MCMC convergence of continuous parameters in the Bayesian phylogenetic analysis. The ForCluster version is modified to work with bash script on high-performance computing cluster.
Convenience_Sims_R.sh: Bash script to run CheckConvergence_Sims_ForCluster.R in high-performance computing cluster.
ConvergenceTable_Sims.R: R script to summarize results of convergence check
ExtractMeanMAPCladePP.R: R script to calculate mean posterior probabilities of clades in the MAP tree.
ExtractThresholdTrueClade95PP.R: R script to calculate the percentage of true clades that are strongly supported (that is, that have posterior probability > 0.95).
ExtractTrueCladePP.Rev: RevBayes script to calculate the posterior probabilities of clades in the true tree (true clades). The ForCluster version is modified to work with bash script on high-performance computing cluster.
ExtractTrueCladePP.sh: Bash script to run ExtractTrueCladePP_ForCluster.Rev in high-performance computing cluster.
PlotMDS_CID.R: R script to plot 2D mapping of tree space based on Clustering Information Distances between trees.
PlotMeanMAPCladePP.R: R script to plot line plots and box plots of mean posterior probabilities of clades in the MAP tree.
PlotRF_MAP_ByBranchLength.R: R script to plot line plots and box plots of normalized Robinson-Foulds (RF) distances between MAP tree and true tree, for combinations of tree length and number of taxa resulting in similar average branch lengths.
PlotRF_MAP.R: R script to plot line plots and box plots of normalized Robinson-Foulds (RF) distances between MAP tree and true tree.
PlotRFVariationBetweenStateNumbers.R: R script to plot line plots of the percentage variation in RF distances when going from binary characters to multi-state characters.
PlotThresholdTrueClade95PP_ByBranchLength.R: R script to plot line plots and box plots of the percentage of true clades that are strongly supported (that is, that have posterior probability > 0.95), for combinations of tree length and number of taxa resulting in similar average branch lengths.
PlotThresholdTrueClade95PP.R: R script to plot line plots and box plots of the percentage of true clades that are strongly supported (that is, that have posterior probability > 0.95).
RFdistance_MAP.R: R script to calculate normalized Robinson-Foulds (RF) distances between MAP tree and true tree.
SimulatedDataAnalysis_Mkv.Rev: RevBayes script to run phylogenetic analyses on simulated datasets.
SimulateUnrootedTrees.Rev: RevBayes script to simulate unrooted trees.
Simulation_run_mcmc.sh: Bash script to run SimulatedDataAnalysis_Mkv.Rev in high-performance computing cluster.
Simulator_Mkv.Rev: RevBayes script to simulate character data under the Mkv model.
Simulator_Mkv.R: R script analogous to the previous RevBayes script, simulating character data under the Mkv model.