Data and code from: Automatic discovery of optimal discrete character models
Data files
Apr 08, 2026 version files 979.67 MB
-
corhmm-dredge.zip
979.66 MB
-
README.md
11.96 KB
Abstract
Modeling discrete character evolution in a Markovian framework has become the standard in phylogenetic comparative methods. The increasing size and complexity of these models reflects a trend of analyses to include more taxa and more discrete characters. However, as the complexity of the models increase, so do the number of potential model structures and the number of estimable parameters, making it nearly impossible to consider all modeling options for a given dataset. To overcome this issue, I apply a combination of regularization and parameter sharing optimization to models of discrete character evolution. This allows for the automatic searching and optimization across different model structures without user specification. I test this framework under several simulation scenarios, including hidden rates and multiple discrete characters. The results indicate that regularized models significantly outperform traditional approaches, yielding a far lower variance and a nearly tenfold reduction in the overall error of parameter estimates in the most extreme scenarios. I illustrate the power of automatic model selection by revisiting the ancestral state estimation of concealed ovulation and mating systems in Old-World monkeys. Using the dredge algorithm, I discover a previously unexamined model structure which has both better statistical performance and a differing ancestral state reconstruction when compared to default model sets. In general, these results highlight the dangers of an over-reliance on default model sets. The combination of automatic model selection and regularization help overcome problems of over-parameterization, and these results demonstrate that when inferences are drawn from a larger model space, they are both more statistically robust and biologically realistic.
All files below are in corhmm-dredge.zip, which includes all of the code for simulating and analyzing the dredge method.
Automatically search discrete character evolution models and regularize parameter values using a combination of regularization and simulated annealing.
Overview
This project implements automatic model selection for discrete character evolution in phylogenetic comparative methods. The "dredge" algorithm combines regularization techniques (L1, L2) with simulated annealing to optimize across different model structures without requiring user specification of model complexity.
Repository Structure
corhmm-dredge-data/
├── code/ # R scripts and shell scripts
├── data/ # Simulated datasets
├── empirical_results/ # Fitted model objects for empirical analyses
├── param_results/ # Parameter estimation results by regularization type
├── parameter_tables/ # True parameter tables used for simulation
├── structure_results/ # Model structure test results
│ ├── dep_model/ # Dependence/correlation model tests
│ ├── hmm_model/ # Hidden Markov model tests
│ └── ord_model/ # Ordered model tests
├── summ_results/ # Summarized structure test results
├── tables/ # Summary statistics and output tables (CSV/XLSX)
└── trees/ # Phylogenetic trees used for simulation
Code
Simulation scripts
Each simulation script generates data under a specific character configuration, fits models, and saves results. Trees of size 100, 250, and 500 tips are used across all scenarios.
code/simulate-01.R— Single binary character:nChar=1, nStates=2code/simulate-02.R— Two binary characters:nChar=2, nStates=2code/simulate-03.R— Three binary characters:nChar=3, nStates=2
Each script saves simulated data to data/full_data-0N.RDS, true parameter tables to parameter_tables/par_table-0N.csv, and fitted model objects to param_results/.
code/simulate-trees.R— Generates the phylogenetic trees saved intrees/.code/run-simulations.sh— Shell script to run all simulation scripts in sequence.
Analysis scripts
code/analysis-param-est.R— Compares estimated parameters to true values; computes bias, variance, and RMSE on the raw scale.code/analysis-param-est-log.R— Same as above but on the log scale.code/analysis-param-est-remove-outlier.R— Parameter estimation analysis with outlier removal (raw scale).code/analysis-param-est-log-remove-outlier.R— Parameter estimation analysis with outlier removal (log scale).code/analysis-model-struct.R— Analyzes model structure test results (dependence, HMM, ordered models).code/compare-models.R— Compares model fits across regularization types.code/compare-tests.R— Compares statistical test performance across simulation settings.code/empirical.R— Runs the empirical analyses using the primates dataset bundled with corHMM.
Plotting scripts
code/plot-parameter-results.R— Generates parameter estimation figures.code/plot_SA_info.R— Plots simulated annealing diagnostics.code/asr-plot.R— Plots ancestral state reconstructions.code/dent-surfaces.R— Plots regularization surfaces.
Note on output directories: The plotting scripts write PDF output files to a plots/ directory (e.g., plots/par-plot.pdf, plots/sa_info.pdf, plots/asr_phy.pdf, plots/alternative_recons.pdf, plots/JS-div-asr.pdf). This directory is not included in the repository submission but must be created manually before running the plotting scripts (mkdir plots). The plots/ directory is generated by the user when reproducing the figures.
Model structure test scripts
These scripts simulate data under a specific model structure, fit a suite of candidate models using corHMMDredge, and save per-replicate results to structure_results/. Each script runs 100 independent replicates and saves intermediate output as intermediate_results_sim_N.rds.
code/model-strucure-tests-corr.R— Simulates two correlated binary characters (4-state model) with asymmetric transition rates designed to exhibit dependent/correlated evolution. Tests the ability ofcorHMMDredgeto recover dependent model structure.code/model-strucure-tests-hmm.R— Simulates a single binary character evolving under hidden rate variation (two-rate-category HMM). Tests the ability ofcorHMMDredgeto detect hidden Markov model structure.code/model-strucure-tests-ord.R— Simulates an ordered three-state character (outx → fac → self) with only sequential transitions permitted. Tests the ability ofcorHMMDredgeto recover ordered model structure.
Utilities
code/utils.R— Shared helper functions used across simulation and analysis scripts.
Data
data/
full_data-01.RDS— Simulated dataset for simulation 01 (1 binary character).full_data-02.RDS— Simulated dataset for simulation 02 (2 binary characters).full_data-03.RDS— Simulated dataset for simulation 03 (3 binary characters).
Each RDS file contains a list of simulation replicates. Each replicate is a list with:
phy: the phylogenetic treepar: the true rate matrix used for simulationdat: the simulated tip state datacor_dat: the data formatted for corHMM input
parameter_tables/
par_table-01.csv— True parameter values for simulation 01 replicates.par_table-02.csv— True parameter values for simulation 02 replicates.par_table-03.csv— True parameter values for simulation 03 replicates.
Each table contains one row per simulation replicate and one column per transition rate parameter, drawn from a log-normal distribution (mean=0, sd=0.25 on the log scale).
trees/
tree_100.tre— Set of trees with 100 tips.tree_250.tre— Set of trees with 250 tips.tree_500.tre— Set of trees with 500 tips.
Results
param_results/
Fitted corHMM model objects for the parameter estimation simulations, saved as RDS files. Named res0N_TYPE.RDS where N is the simulation number (01–03) and TYPE is the regularization method:
l0— No regularization (baseline); e.g.,res01_l0.RDS,res02_l0.RDS,res03_l0.RDSl1— L1 regularization; e.g.,res01_l1.RDS,res02_l1.RDS,res03_l1.RDSl2— L2 regularization; e.g.,res01_l2.RDS,res02_l2.RDS,res03_l2.RDSer— Equal rates constraint; e.g.,res01_er.RDS,res02_er.RDS,res03_er.RDS
Each RDS file contains a list of fitted corHMM model objects, one per simulation replicate. Each model object is the direct output of corHMM() or corHMMDredge() and includes the estimated rate matrix, log-likelihood, AIC, and other standard corHMM output fields.
structure_results/
Fitted model objects for model structure tests, organized by model type:
dep_model/— Tests for correlated evolution between characters.hmm_model/— Tests for hidden rate variation (HMM structure).ord_model/— Tests for ordered character evolution.
Each subdirectory contains intermediate_results_sim_N.rds files (one per simulation replicate, N = 1–100).
summ_results/
Summarized structure test results:
results_corr.RDS— Summary of dependence/correlation model tests.results_hmm.RDS— Summary of HMM model tests.results_ord.RDS— Summary of ordered model tests.
empirical_results/
All empirical analysis outputs are stored in the single folder empirical_results/ (there is no separate "empirical_results 2" folder). The 2 suffix on empirical2.RDS denotes a secondary model run, not a separate directory.
empirical.RDS— Main corHMMDredgeSA model fits for the empirical analysis (primary simulated annealing run on the primates dataset).empirical2.RDS— Secondary empirical model fits from a follow-up run initialized from the best model of the primary run.empirical_kfold.RDS— K-fold cross-validation results for selecting the optimal regularization penalty (lambda).empirical_profile.RDS— Profile likelihood results for confidence intervals on the estimated rate parameters.
Tables
All summary tables are in tables/. The log- prefix indicates results computed on the log scale; out-rm- indicates outliers were removed prior to summarization.
Parameter estimation tables
raw-parameter-comparison.csv— Raw parameter estimates vs. true values for all simulation replicates (unlogged scale).log-parameter-comparison.csv— Same as above on the log scale.
Column definitions (parameter comparison files)
ntips: Number of tips in the phylogeny (100, 250, 500).type: Regularization method (er,l0,l1,l2).par: Transition rate parameter name (e.g.,q[1][2]).value: Estimated parameter value.true: True (simulated) parameter value.diff: Difference between estimated and true value.mse: Squared difference between estimated and true value.model: Complexity of the generating model (1Char,2Char,3Char).
Summary tables
stats-summary-by-type.csv/log-stats-summary-by-type.csv— Bias, variance, and RMSE by regularization method and tree size.out-rm-stats-summary-by-type.csv/out-rm-log-stats-summary-by-type.csv— Same with outliers removed.propor-variance-by-type.csv/log-propor-variance-by-type.csv— Variance of L1, L2, and ER estimates as a proportion of the L0 (unregularized) baseline. Values > 1 indicate higher variance than baseline.par-summary-by-type.csv/log-par-summary-by-type.csv— Mean, variance, and median of estimates grouped bytypeandntips.par-summary-by-type-par.csv/log-par-summary-by-type-par.csv— Same as above, additionally grouped by specific parameter (par).par-summary-by-type-model.csv/log-par-summary-by-type-model.csv— Same as above, additionally grouped by model complexity (1Char,2Char,3Char).final-par-table.xlsx— Formatted parameter summary table for publication.
Model structure test tables
structure_test_dep_median.csv— Median test statistics for the dependence model tests.structure_test_hmm_median.csv— Median test statistics for the HMM model tests.structure_test_ord_median.csv— Median test statistics for the ordered model tests.test_summary_dep.csv— Full test summary for dependence model.test_summary_hmm.csv— Full test summary for HMM model.test_summary_ord.csv— Full test summary for ordered model.test_summary_master.xlsx— Combined formatted structure test table for publication.asr_test_median_master.xlsx— Median ancestral state reconstruction test statistics.
Empirical tables
empirical_model_table.csv— Model comparison table across all candidate models explored during the empirical corHMMDredgeSA runs on the primates dataset. Each row is one fitted model. Columns:np: Number of free parameters in the model.nRateCat: Number of hidden rate categories (1 = standard model, 2 = hidden Markov model).lnLik: Log-likelihood of the fitted model.AIC: Akaike Information Criterion (2 × np − 2 × lnLik).dAIC: Difference in AIC from the best-fitting model (lowest AIC = 0).AICwt: AIC weight (relative likelihood of the model given the set of candidate models).
Simulated under discrete characters.
