Resolving competing evolutionary histories in joint ancestral state reconstruction

Published Mar 26, 2026 on Dryad. https://doi.org/10.5061/dryad.g1jwstr55

Data files

Mar 26, 2026 version files 413.62 MB

joint-unc.zip

413.60 MB
README.md

23.58 KB

Abstract

This repository contains simulation data, empirical data, and analysis code for a study evaluating joint uncertainty in discrete character ancestral state reconstruction (ASR). We compare joint and marginal reconstruction approaches across a range of simulated conditions and apply these methods to an empirical dataset of carbapenem-resistant Klebsiella pneumoniae (CRKP) isolates from a Long-Term Acute Care Hospital (LTACH).

All files below are in joint-unc.zip which includes all of the code for simulating and comparing our new joint uncertainty methods.

General Information

Corresponding author information

Name: James Boyko
ORCID: 0000-0003-0952-169X
Affiliation: University of Michigan

Related publication

[PLACEHOLDER: Full citation once published]

Notes on file naming conventions

Simulation condition files follow the pattern:

[method]_n[ntaxa]_k[nstates]_q[rate]_[model]

where:

[method] is the reconstruction approach used to compute confidence intervals:
- simmap — stochastic mapping
- pupko — Pupko joint reconstruction
- marginal — marginal reconstruction
[ntaxa] is the number of taxa in the simulated tree (always 101 in this study)
[nstates] (k) is the number of discrete character states (2 or 3)
[rate] (q) is the per-branch transition rate used to simulate data (0.25, 0.5, or 1)
[model] is the substitution model:
- er — equal rates (all transition rates constrained to be equal)
- ard — all rates different (each transition rate estimated separately)

Description of the data and file structure

joint-unc/
├── data/
│   ├── CRKP_LTACH_df_041624.csv
│   └── pr.tre
├── R/
│   ├── analysis.R
│   ├── diff.R
│   ├── simulation.R
│   ├── utils.R
│   └── visualization.R
├── scripts/
│   ├── install_packages.R
│   ├── 01_run_simulations.R
│   ├── 02_run_recon_unc.R
│   ├── 03_compare_jnt-mar.R
│   ├── 04_analyze_unc_results.R
│   ├── 05_run_empirical.R
│   ├── add_marginal_to_results.R
│   └── viz_aid_plots.R
├── res/
│   ├── n101_k[nstates]_q[rate]_er_results.RDS  (6 files, top-level)
│   ├── unc/
│   │   └── n101_k[nstates]_q[rate]_[model].RDS  (12 files)
│   ├── mar/
│   │   └── n101_k[nstates]_q[rate]_[model]_intermediates.RDS  (12 files)
│   ├── mod/
│   │   └── n101_k[nstates]_q[rate]_[model]_results.RDS  (12 files)
│   ├── p_ci_rds/
│   │   └── p_ci_[method]_n101_k[nstates]_q[rate]_[model]_results_sim[NNNN].rds  (1797 files)
│   ├── old/                          (superseded results, not used in final analyses)
│   └── older/                        (superseded results, not used in final analyses)
├── tables/
│   ├── agg_accuracy.csv
│   ├── full_sim_analysis_results.csv
│   └── prop_table.csv
├── doc/
│   └── vignette/
│       ├── joint_unc_vignette.rmd
│       └── joint_unc_vignette.html
└── empirical_analysis/
    ├── data/empirical/
    │   ├── dataset/
    │   │   ├── df.csv
    │   │   └── df.RDS
    │   ├── joint_reconstruction/
    │   │   └── *.RDS  (6 files)
    │   ├── phyloAMR/
    │   │   └── *.RDS  (13 files)
    │   ├── resistance_genotypes/
    │   │   ├── resistance_genotypes.csv
    │   │   └── resistance_genotypes.RDS
    │   └── tree/
    │       └── tree.treefile
    ├── scripts/empirical/
    │   ├── dataset_curation/
    │   ├── figures/
    │   ├── joint_reconstruction/
    │   ├── phyloAMR_analyses/
    │   └── tables/
    ├── tables/empirical/
    │   ├── table_median_stats_across_axes.csv
    │   └── table_phenotype_genotype_pairwise_synchronous_transitions_summary.csv
    ├── environments/
    │   ├── phyloAMR.yaml
    │   └── phyloAMR_installation_README
    └── lib/
        └── consistent_themes.R

`data/`

Input data for the empirical analysis.

data/CRKP_LTACH_df_041624.csv

Antibiotic susceptibility testing (AST) data for CRKP isolates collected from an LTACH. Each row is one bacterial isolate. For each of nine antibiotics, six columns are provided:

Column suffix	Description
`[ABX]`	Raw MIC value as a string (e.g., `">16"`, `"≤0.5"`)
`[ABX]_cat`	Categorical susceptibility: `Susceptible`, `Intermediate`, or `Resistant`
`[ABX]_dich`	Dichotomous susceptibility: `Susceptible` or `Non-Susceptible`
`[ABX]_dich_num`	Dichotomous susceptibility as integer: `0` = Susceptible, `1` = Non-Susceptible
`[ABX]_log_2`	Log2-transformed MIC numeric value
`[ABX]_num`	Numeric MIC value

Antibiotic abbreviations:

Abbreviation	Antibiotic
AMK	Amikacin
CST	Colistin
CZA	Ceftazidime-avibactam
IMI	Imipenem
IR	Imipenem-relebactam
MERO	Meropenem
MVB	Meropenem-vaborbactam
PLZ	Plazomicin
TGC	Tigecycline

Additional columns:

Column	Description
(row index)	Row identifier
`isolate_no`	Unique isolate identifier (e.g., `PCMP_H1`)
`Patient_ID`	De-identified patient identifier (integer)
`clades`	Phylogenetic clade assignment: `clade I`, `clade IIA`, or `clade IIB`

data/pr.tre

Phylogenetic tree of the CRKP isolates in Newick format. Tip labels correspond to isolate_no values in CRKP_LTACH_df_041624.csv. Branch lengths are in substitutions per site; they are multiplied by 1000 in the analysis scripts.

`R/`

Reusable R source functions sourced by the analysis scripts.

File	Description
`simulation.R`	Functions for simulating discrete characters on trees and storing results
`analysis.R`	Functions for running corHMM reconstructions and extracting results
`utils.R`	Utility functions for loading results, computing metrics, and file I/O
`visualization.R`	Plotting functions used across scripts
`diff.R`	Functions for comparing joint and marginal distributions

`scripts/`

Main analysis scripts intended to be run in the following order:

Script	Description
`install_packages.R`	Installs required CRAN packages (`geiger`, `corHMM`, `Rtsne`, `dbscan`)
`01_run_simulations.R`	Simulates discrete characters on random trees under six rate conditions, saving raw simulation results to `res/`
`02_run_recon_unc.R`	Computes joint confidence intervals for each simulation × method combination using `corHMM:::compute_joint_ci()`; saves per-simulation RDS files to `res/p_ci_rds/`
`03_compare_jnt-mar.R`	Compares joint and marginal reconstruction accuracy across simulations
`04_analyze_unc_results.R`	Aggregates results across simulations and produces summary tables in `tables/`
`05_run_empirical.R`	Runs the full empirical analysis on the CRKP dataset: corHMM ASR, joint CI sampling, tSNE + HDBSCAN clustering, and figure generation
`add_marginal_to_results.R`	Adds marginal reconstruction results to the existing simulation result objects
`viz_aid_plots.R`	Generates visualization aid plots

The two .Rmd files in scripts/ are working documents used during manuscript preparation.

`res/`

Stored simulation and uncertainty results.

res/n101_k[nstates]_q[rate]_er_results.RDS (6 files)

Each file stores the raw simulation output for one rate condition under the equal-rates (er) model. The file name encodes the simulation condition (see naming convention above). Each RDS file contains an R list with one element per simulation replicate (100 replicates per condition). Each replicate element contains:

$res — a corHMM model fit object (the fitted ancestral state reconstruction)
Simulated tree and character data used to fit the model

res/unc/n101_k[nstates]_q[rate]_[model].RDS (12 files)

Per-condition uncertainty summary objects, covering both er and ard models across all six rate conditions. Each RDS file is an R list containing aggregated uncertainty metrics across the 100 simulation replicates for that condition.

res/p_ci_rds/p_ci_[method]_n101_k[nstates]_q[rate]_[model]_results_sim[NNNN].rds (1797 files)

One file per simulation replicate × method × condition combination. Each RDS file is the output of corHMM:::compute_joint_ci() for a single replicate, containing the joint probability distribution over ancestral state assignments. The three methods are simmap, pupko, and marginal. Replicates are numbered 0001–0100 per condition.

res/mar/n101_k[nstates]_q[rate]_[model]_intermediates.RDS (12 files)

Intermediate marginal reconstruction objects generated during pipeline development, covering all 12 conditions (both er and ard models). Each RDS file contains the marginal probability matrices at internal nodes for the 100 simulation replicates of that condition, stored prior to final aggregation.

res/mod/n101_k[nstates]_q[rate]_[model]_results.RDS (12 files)

Modified/updated simulation result objects covering all 12 conditions. These parallel the top-level _results.RDS files but include additional marginal reconstruction fields added by scripts/add_marginal_to_results.R. Each is an R list of 100 replicate elements, each containing the fitted corHMM object and associated marginal results.

res/old/ and res/older/

Superseded result files from earlier versions of the analysis pipeline. These are retained for provenance but are not used in any final analyses or figures. Users reproducing the analyses should ignore these directories.

`tables/`

CSV summary tables derived from the simulation results.

tables/agg_accuracy.csv

Aggregated reconstruction accuracy across simulations, summarized by condition. Columns:

Column	Description
`rate`	Transition rate used in simulation (`0.25`, `0.5`, or `1`)
`nstate`	Number of discrete states (`2` or `3`)
`model`	Substitution model (`ard` or `er`)
`joint_correct`	Proportion of simulations where joint reconstruction recovered the true state at all nodes
`marginal_correct`	Proportion of simulations where marginal reconstruction recovered the true state at all nodes
`ci_exact_match_jnt`	Proportion of simulations where the joint confidence interval exactly matched the true state assignment
`ci_exact_match_mar`	Proportion of simulations where the marginal confidence interval exactly matched the true state assignment

tables/full_sim_analysis_results.csv

Full per-simulation results (one row per replicate). Columns:

Column	Description
(row index)	Row identifier
`ntaxa`	Number of taxa in the simulated tree
`nstate`	Number of discrete states
`rate`	Transition rate used in simulation
`model`	Substitution model (`ard` or `er`)
`marginal_correct`	Logical; whether marginal reconstruction was correct at all nodes
`joint_correct`	Logical; whether joint reconstruction was correct at all nodes
`true_in_ci`	Logical; whether the true state assignment fell within the joint confidence interval
`true_in_mar_ci`	Logical; whether the true state assignment fell within the marginal confidence interval
`true_in_lik_dist`	Logical; whether the true state fell within the joint likelihood distribution
`true_in_mar_lik_dist`	Logical; whether the true state fell within the marginal likelihood distribution
`ci_exact_match_jnt`	Logical; whether the joint CI exactly matched the true assignment
`ci_exact_match_mar`	Logical; whether the marginal CI exactly matched the true assignment
`joint_dist`	Distance between joint reconstruction and the true assignment
`marginal_dist`	Distance between marginal reconstruction and the true assignment

tables/prop_table.csv

Summary table of credible interval sizes and coverage proportions across conditions. Each row is one condition (encoded in the row name, e.g., n101_k2_q0.25_er). Columns:

Column	Description
`n_simmap`	Mean number of unique state assignments in the SIMMAP-based credible set
`n_pupko`	Mean number of unique state assignments in the Pupko joint-based credible set
`n_marginal`	Mean number of unique state assignments in the marginal-based credible set
`p_simmap`	Coverage probability of the SIMMAP credible set (proportion of simulations containing the true assignment)
`p_pupko`	Coverage probability of the Pupko credible set
`p_marginal`	Coverage probability of the marginal credible set

`doc/`

doc/vignette/joint_unc_vignette.rmd / joint_unc_vignette.html

An R Markdown vignette demonstrating how to generate and interpret alternative joint ancestral state reconstructions using the methods described in the associated publication (Boyko et al. 2026). It walks through installation, data preparation, running compute_joint_ci(), tSNE dimensionality reduction, HDBSCAN clustering, and visualization of alternative reconstruction clusters. The .html is the pre-rendered output.

`empirical_analysis/`

Self-contained sub-project for the empirical CRKP analysis.

`empirical_analysis/data/empirical/dataset/`

df.csv / df.RDS

The primary dataset used in the empirical analysis. Each row is one CRKP isolate. Columns:

Column	Description
`isolate_no`	Unique isolate identifier
`Patient_ID`	De-identified patient identifier (integer)
`Cx_date`	Culture collection date (ISO 8601 format)
`blbli_dich_num`	Dichotomous susceptibility to imipenem-relebactam (BL/BLI): `0` = Susceptible, `1` = Non-Susceptible

The RDS file contains the same data as an R data frame.

`empirical_analysis/data/empirical/resistance_genotypes/`

resistance_genotypes.csv / resistance_genotypes.RDS

Resistance genotype data for each CRKP isolate. All genotype columns are binary: 0 = absent, 1 = present. Columns:

Column	Description
`isolate_no`	Unique isolate identifier
`RamA`	Presence of ramA efflux pump activator putative function-altering variant (PFAV)
`RamR`	Presence of ramR efflux pump regulator PFAV
`PBP_any`	Presence of any penicillin-binding protein (PBP) mutation
`PBP2`	Presence of PBP2 mutation specifically
`PBP4`	Presence of PBP4 mutation specifically
`OmpK36_c25t`	Presence of cytosine-to-thymine transition at position 25 in ompK36
`OmpK36_truncation_kleborate`	Presence of truncation in ompK36 as detected by Kleborate
`OmpK36_putative_function_altering`	Presence of any putative function-altering mutation in ompK36
`OmpK36_L3_mutations`	Presence of loop 3 insertion in ompK36
`OmpK36GD`	Presence of GD (Gly-Asp) insertion in ompK36 loop 3
`OmpK36TD`	Presence of TD (Thr-Asp) insertion in ompK36 loop 3
`OmpK36_promoter_IS`	Presence of insertion sequence at the ompK36 promoter
`OmpK36_disruptive`	Presence of putative disruptive mutation in ompK36
`AA552`	Presence of AA552 blaKPC-containing plasmid marker
`AA018`	Presence of AA018 plasmid marker

`empirical_analysis/data/empirical/tree/`

tree.treefile

Maximum-likelihood phylogenetic tree of the CRKP isolates in Newick format, used in the empirical analysis. Tip labels correspond to isolate_no values in the dataset files.

`empirical_analysis/data/empirical/joint_reconstruction/`

RDS files storing intermediate and final results from the joint uncertainty ASR analysis of the CRKP dataset.

File	Description
`phenotype_genotype_joint_uncertainty_asrs.RDS`	List of sampled joint ancestral state reconstructions from `compute_joint_ci()`, covering both the phenotype (IR susceptibility) and genotype axes
`phenotype_genotype_focal_asrs.RDS`	The single maximum-likelihood joint reconstruction for each axis
`joint_uncertainty_asr_clustering_optimal.RDS`	Optimal HDBSCAN clustering result applied to the joint reconstruction space
`joint_uncertainty_asr_dbres_iterations.RDS`	HDBSCAN clustering results across a range of `minPts` parameter values
`joint_uncertainty_asr_phylo_dist_matrix.RDS`	Phylogenetically-informed pairwise distance matrix between sampled reconstructions
`alternative_reconstruction_tsne.RDS`	tSNE dimensionality reduction result of the sampled joint reconstruction space

`empirical_analysis/data/empirical/phyloAMR/`

RDS files storing results from the phyloAMR analyses of phenotype-genotype transition associations.

File	Description
`joint_states_parent_child_dfs.RDS`	Data frames of parent-child node pairs with joint ancestral state assignments across all sampled reconstructions
`joint_states_reconstruction_transition_statistics.RDS`	Transition count statistics (gains, losses, continuations) summarized across all sampled reconstructions
`pairwise_synchronous_transitions_comparisons.RDS`	Pairwise synchronous transition comparison results across all phenotype reconstructions
`pairwise_synchronous_transitions_genotype_as_comparitor.RDS`	Synchronous transition comparisons using each genotype axis as the comparitor
`pairwise_synchronous_transitions_phenotype_as_comparitor.RDS`	Synchronous transition comparisons using the phenotype axis as the comparitor
`phenotype_genotype_joint_uncertainty_asr_clustering.RDS`	HDBSCAN clustering of joint ASR space for the full phenotype-genotype analysis
`phenotype_genotype_joint_uncertainty_asr_clustering_statistics.RDS`	Summary statistics of the clustering result
`phenotype_genotype_transition_association_statistics.RDS`	Permutation-based association test statistics between phenotype and genotype transitions
`representative_blbli_joint_uncertainty_asr_clustering.RDS`	HDBSCAN clustering of the IR-susceptibility (BL/BLI) representative reconstruction space
`representative_blbli_joint_uncertainty_asr_clustering_df.RDS`	Data frame summarizing the representative BL/BLI clustering result
`representative_blbli_joint_uncertainty_asr_clustering_statistics.RDS`	Summary statistics of the representative BL/BLI clustering
`representative_blbli_joint_uncertainty_parent_child_dfs.RDS`	Parent-child data frames for representative BL/BLI reconstructions
`representative_phenotype_genotype_downstream_transition_statistics.RDS`	Downstream transition statistics for representative phenotype-genotype reconstruction pairs
`representative_phenotype_genotype_synchronous_transition_statistics.RDS`	Synchronous transition statistics for representative phenotype-genotype reconstruction pairs
`representative_phenotype_genotype_transition_association_statistics.RDS`	Permutation-based association test statistics for representative reconstruction pairs

`empirical_analysis/tables/empirical/`

table_median_stats_across_axes.csv

Summary of reconstruction statistics across all phenotype reconstruction axes (i.e., alternative joint reconstruction clusters). Each row is one reconstruction axis. Values are reported as median (IQR). Columns include: number of reconstructions per axis, phylogenetic events, singleton counts, cluster counts and sizes, transition counts, gain events, loss events, and continuation counts.

table_phenotype_genotype_pairwise_synchronous_transitions_summary.csv

Summary of synchronous transition counts between the phenotype (IR susceptibility) axis and each resistance genotype feature, reported by reconstruction axis. Values are reported as median (IQR). Each column corresponds to one genotype feature (e.g., presence of a specific ompK36 variant or plasmid marker).

`empirical_analysis/scripts/empirical/`

R Markdown (.Rmd) scripts for all empirical analyses, organized into subdirectories by analysis type. Each .Rmd has a corresponding rendered .html output.

Subdirectory	Description
`dataset_curation/`	Scripts to curate and merge the phylogenetic tree and AST dataset
`joint_reconstruction/`	Scripts to compute and cluster joint ASR samples
`phyloAMR_analyses/`	Scripts to run phyloAMR association tests between phenotype and genotype transitions
`figures/`	Scripts to generate all empirical figures
`tables/`	Scripts to generate summary tables

The .sbat files in joint_reconstruction/ and phyloAMR_analyses/ are SLURM batch submission scripts used to run computationally intensive steps on an HPC cluster (tested on the University of Michigan Great Lakes cluster). The most intensive steps (get_phenotype_genotype_ancestral_states.Rmd and phyloAMR_representative_phenotype_genotype_association_testing.R) were run with 15 CPUs and 75 GB RAM; adjust --cpus-per-task and --mem in the .sbat files as needed for your cluster.

`empirical_analysis/environments/`

phyloAMR.yaml

Conda environment specification for reproducing the empirical analysis software environment.

phyloAMR_installation_README

Instructions for installing the phyloAMR R package and its dependencies.

`empirical_analysis/lib/consistent_themes.R`

Shared ggplot2 theme settings used across empirical figure scripts.

Sharing/Access information

Data was derived from the following sources:

CRKP isolate collection and AST data from the LTACH cohort described in the associated publication.

Code/Software

All analyses were performed in R. The development version of corHMM (available on GitHub) is required, as the compute_joint_ci() function used throughout is not yet on CRAN.

Required R packages:

Package	Version tested	Source
corHMM	2.10.1 (GitHub)	`devtools::install_github("thej022214/corHMM")`
phyloAMR	0.1.0 (GitHub)	`devtools::install_github("kylegontjes/phyloAMR")`
ape	5.8.1	CRAN
geiger	2.0.11	CRAN
Rtsne	0.17	CRAN
dbscan	1.2.4	CRAN
proxy	0.4.29	CRAN
RColorBrewer	1.1.3	CRAN
phytools	2.5.2	CRAN
phangorn	2.12.1	CRAN
tidyverse	2.0.0	CRAN
pbmcapply	1.5.1	CRAN
ggtree	4.0.4	Bioconductor
parallel	(base R)	—

The empirical analysis additionally requires a conda environment. Set it up using empirical_analysis/environments/phyloAMR.yaml:

conda env create -f empirical_analysis/environments/phyloAMR.yaml
conda activate joint_uncertainty_corHMM_phyloAMR

Then install the GitHub packages from within R (see empirical_analysis/environments/phyloAMR_installation_README for full instructions).

Install CRAN dependencies for the simulation analysis by running scripts/install_packages.R.

Recommended execution order for simulation analyses:

scripts/install_packages.R — install dependencies
scripts/01_run_simulations.R — simulate data and run reconstructions
scripts/02_run_recon_unc.R — compute joint confidence intervals (parallelized; uses 10 cores by default)
scripts/03_compare_jnt-mar.R — compare joint and marginal accuracy
scripts/04_analyze_unc_results.R — aggregate results and produce summary tables
scripts/05_run_empirical.R — run the empirical analysis

Scripts 02 and 05 are computationally intensive. Script 02 uses parallel::mclapply() with mc.cores = 10 by default; adjust as needed. The HPC batch scripts in empirical_analysis/scripts/empirical/joint_reconstruction/ and phyloAMR_analyses/ were used for the most intensive empirical steps.

R version: 4.5.2

Resolving competing evolutionary histories in joint ancestral state reconstruction

Data files

Abstract

README: DATA FOR: Resolving competing evolutionary histories in joint ancestral state reconstruction

General Information

Notes on file naming conventions

Description of the data and file structure

data/

R/

scripts/

res/

tables/

doc/

empirical_analysis/

empirical_analysis/data/empirical/dataset/

empirical_analysis/data/empirical/resistance_genotypes/

empirical_analysis/data/empirical/tree/

empirical_analysis/data/empirical/joint_reconstruction/

empirical_analysis/data/empirical/phyloAMR/

empirical_analysis/tables/empirical/

empirical_analysis/scripts/empirical/

empirical_analysis/environments/

empirical_analysis/lib/consistent_themes.R