Resolving competing evolutionary histories in joint ancestral state reconstruction
Data files
Mar 26, 2026 version files 413.62 MB
-
joint-unc.zip
413.60 MB
-
README.md
23.58 KB
Abstract
This repository contains simulation data, empirical data, and analysis code for a study evaluating joint uncertainty in discrete character ancestral state reconstruction (ASR). We compare joint and marginal reconstruction approaches across a range of simulated conditions and apply these methods to an empirical dataset of carbapenem-resistant Klebsiella pneumoniae (CRKP) isolates from a Long-Term Acute Care Hospital (LTACH).
All files below are in joint-unc.zip which includes all of the code for simulating and comparing our new joint uncertainty methods.
General Information
Corresponding author information
Name: James Boyko
ORCID: 0000-0003-0952-169X
Affiliation: University of Michigan
Related publication
[PLACEHOLDER: Full citation once published]
Notes on file naming conventions
Simulation condition files follow the pattern:
[method]_n[ntaxa]_k[nstates]_q[rate]_[model]
where:
[method]is the reconstruction approach used to compute confidence intervals:simmap— stochastic mappingpupko— Pupko joint reconstructionmarginal— marginal reconstruction
[ntaxa]is the number of taxa in the simulated tree (always101in this study)[nstates](k) is the number of discrete character states (2or3)[rate](q) is the per-branch transition rate used to simulate data (0.25,0.5, or1)[model]is the substitution model:er— equal rates (all transition rates constrained to be equal)ard— all rates different (each transition rate estimated separately)
Description of the data and file structure
joint-unc/
├── data/
│ ├── CRKP_LTACH_df_041624.csv
│ └── pr.tre
├── R/
│ ├── analysis.R
│ ├── diff.R
│ ├── simulation.R
│ ├── utils.R
│ └── visualization.R
├── scripts/
│ ├── install_packages.R
│ ├── 01_run_simulations.R
│ ├── 02_run_recon_unc.R
│ ├── 03_compare_jnt-mar.R
│ ├── 04_analyze_unc_results.R
│ ├── 05_run_empirical.R
│ ├── add_marginal_to_results.R
│ └── viz_aid_plots.R
├── res/
│ ├── n101_k[nstates]_q[rate]_er_results.RDS (6 files, top-level)
│ ├── unc/
│ │ └── n101_k[nstates]_q[rate]_[model].RDS (12 files)
│ ├── mar/
│ │ └── n101_k[nstates]_q[rate]_[model]_intermediates.RDS (12 files)
│ ├── mod/
│ │ └── n101_k[nstates]_q[rate]_[model]_results.RDS (12 files)
│ ├── p_ci_rds/
│ │ └── p_ci_[method]_n101_k[nstates]_q[rate]_[model]_results_sim[NNNN].rds (1797 files)
│ ├── old/ (superseded results, not used in final analyses)
│ └── older/ (superseded results, not used in final analyses)
├── tables/
│ ├── agg_accuracy.csv
│ ├── full_sim_analysis_results.csv
│ └── prop_table.csv
├── doc/
│ └── vignette/
│ ├── joint_unc_vignette.rmd
│ └── joint_unc_vignette.html
└── empirical_analysis/
├── data/empirical/
│ ├── dataset/
│ │ ├── df.csv
│ │ └── df.RDS
│ ├── joint_reconstruction/
│ │ └── *.RDS (6 files)
│ ├── phyloAMR/
│ │ └── *.RDS (13 files)
│ ├── resistance_genotypes/
│ │ ├── resistance_genotypes.csv
│ │ └── resistance_genotypes.RDS
│ └── tree/
│ └── tree.treefile
├── scripts/empirical/
│ ├── dataset_curation/
│ ├── figures/
│ ├── joint_reconstruction/
│ ├── phyloAMR_analyses/
│ └── tables/
├── tables/empirical/
│ ├── table_median_stats_across_axes.csv
│ └── table_phenotype_genotype_pairwise_synchronous_transitions_summary.csv
├── environments/
│ ├── phyloAMR.yaml
│ └── phyloAMR_installation_README
└── lib/
└── consistent_themes.R
data/
Input data for the empirical analysis.
data/CRKP_LTACH_df_041624.csv
Antibiotic susceptibility testing (AST) data for CRKP isolates collected from an LTACH. Each row is one bacterial isolate. For each of nine antibiotics, six columns are provided:
| Column suffix | Description |
|---|---|
[ABX] |
Raw MIC value as a string (e.g., ">16", "≤0.5") |
[ABX]_cat |
Categorical susceptibility: Susceptible, Intermediate, or Resistant |
[ABX]_dich |
Dichotomous susceptibility: Susceptible or Non-Susceptible |
[ABX]_dich_num |
Dichotomous susceptibility as integer: 0 = Susceptible, 1 = Non-Susceptible |
[ABX]_log_2 |
Log2-transformed MIC numeric value |
[ABX]_num |
Numeric MIC value |
Antibiotic abbreviations:
| Abbreviation | Antibiotic |
|---|---|
| AMK | Amikacin |
| CST | Colistin |
| CZA | Ceftazidime-avibactam |
| IMI | Imipenem |
| IR | Imipenem-relebactam |
| MERO | Meropenem |
| MVB | Meropenem-vaborbactam |
| PLZ | Plazomicin |
| TGC | Tigecycline |
Additional columns:
| Column | Description |
|---|---|
| (row index) | Row identifier |
isolate_no |
Unique isolate identifier (e.g., PCMP_H1) |
Patient_ID |
De-identified patient identifier (integer) |
clades |
Phylogenetic clade assignment: clade I, clade IIA, or clade IIB |
data/pr.tre
Phylogenetic tree of the CRKP isolates in Newick format. Tip labels correspond to isolate_no values in CRKP_LTACH_df_041624.csv. Branch lengths are in substitutions per site; they are multiplied by 1000 in the analysis scripts.
R/
Reusable R source functions sourced by the analysis scripts.
| File | Description |
|---|---|
simulation.R |
Functions for simulating discrete characters on trees and storing results |
analysis.R |
Functions for running corHMM reconstructions and extracting results |
utils.R |
Utility functions for loading results, computing metrics, and file I/O |
visualization.R |
Plotting functions used across scripts |
diff.R |
Functions for comparing joint and marginal distributions |
scripts/
Main analysis scripts intended to be run in the following order:
| Script | Description |
|---|---|
install_packages.R |
Installs required CRAN packages (geiger, corHMM, Rtsne, dbscan) |
01_run_simulations.R |
Simulates discrete characters on random trees under six rate conditions, saving raw simulation results to res/ |
02_run_recon_unc.R |
Computes joint confidence intervals for each simulation × method combination using corHMM:::compute_joint_ci(); saves per-simulation RDS files to res/p_ci_rds/ |
03_compare_jnt-mar.R |
Compares joint and marginal reconstruction accuracy across simulations |
04_analyze_unc_results.R |
Aggregates results across simulations and produces summary tables in tables/ |
05_run_empirical.R |
Runs the full empirical analysis on the CRKP dataset: corHMM ASR, joint CI sampling, tSNE + HDBSCAN clustering, and figure generation |
add_marginal_to_results.R |
Adds marginal reconstruction results to the existing simulation result objects |
viz_aid_plots.R |
Generates visualization aid plots |
The two .Rmd files in scripts/ are working documents used during manuscript preparation.
res/
Stored simulation and uncertainty results.
res/n101_k[nstates]_q[rate]_er_results.RDS (6 files)
Each file stores the raw simulation output for one rate condition under the equal-rates (er) model. The file name encodes the simulation condition (see naming convention above). Each RDS file contains an R list with one element per simulation replicate (100 replicates per condition). Each replicate element contains:
$res— acorHMMmodel fit object (the fitted ancestral state reconstruction)- Simulated tree and character data used to fit the model
res/unc/n101_k[nstates]_q[rate]_[model].RDS (12 files)
Per-condition uncertainty summary objects, covering both er and ard models across all six rate conditions. Each RDS file is an R list containing aggregated uncertainty metrics across the 100 simulation replicates for that condition.
res/p_ci_rds/p_ci_[method]_n101_k[nstates]_q[rate]_[model]_results_sim[NNNN].rds (1797 files)
One file per simulation replicate × method × condition combination. Each RDS file is the output of corHMM:::compute_joint_ci() for a single replicate, containing the joint probability distribution over ancestral state assignments. The three methods are simmap, pupko, and marginal. Replicates are numbered 0001–0100 per condition.
res/mar/n101_k[nstates]_q[rate]_[model]_intermediates.RDS (12 files)
Intermediate marginal reconstruction objects generated during pipeline development, covering all 12 conditions (both er and ard models). Each RDS file contains the marginal probability matrices at internal nodes for the 100 simulation replicates of that condition, stored prior to final aggregation.
res/mod/n101_k[nstates]_q[rate]_[model]_results.RDS (12 files)
Modified/updated simulation result objects covering all 12 conditions. These parallel the top-level _results.RDS files but include additional marginal reconstruction fields added by scripts/add_marginal_to_results.R. Each is an R list of 100 replicate elements, each containing the fitted corHMM object and associated marginal results.
res/old/ and res/older/
Superseded result files from earlier versions of the analysis pipeline. These are retained for provenance but are not used in any final analyses or figures. Users reproducing the analyses should ignore these directories.
tables/
CSV summary tables derived from the simulation results.
tables/agg_accuracy.csv
Aggregated reconstruction accuracy across simulations, summarized by condition. Columns:
| Column | Description |
|---|---|
rate |
Transition rate used in simulation (0.25, 0.5, or 1) |
nstate |
Number of discrete states (2 or 3) |
model |
Substitution model (ard or er) |
joint_correct |
Proportion of simulations where joint reconstruction recovered the true state at all nodes |
marginal_correct |
Proportion of simulations where marginal reconstruction recovered the true state at all nodes |
ci_exact_match_jnt |
Proportion of simulations where the joint confidence interval exactly matched the true state assignment |
ci_exact_match_mar |
Proportion of simulations where the marginal confidence interval exactly matched the true state assignment |
tables/full_sim_analysis_results.csv
Full per-simulation results (one row per replicate). Columns:
| Column | Description |
|---|---|
| (row index) | Row identifier |
ntaxa |
Number of taxa in the simulated tree |
nstate |
Number of discrete states |
rate |
Transition rate used in simulation |
model |
Substitution model (ard or er) |
marginal_correct |
Logical; whether marginal reconstruction was correct at all nodes |
joint_correct |
Logical; whether joint reconstruction was correct at all nodes |
true_in_ci |
Logical; whether the true state assignment fell within the joint confidence interval |
true_in_mar_ci |
Logical; whether the true state assignment fell within the marginal confidence interval |
true_in_lik_dist |
Logical; whether the true state fell within the joint likelihood distribution |
true_in_mar_lik_dist |
Logical; whether the true state fell within the marginal likelihood distribution |
ci_exact_match_jnt |
Logical; whether the joint CI exactly matched the true assignment |
ci_exact_match_mar |
Logical; whether the marginal CI exactly matched the true assignment |
joint_dist |
Distance between joint reconstruction and the true assignment |
marginal_dist |
Distance between marginal reconstruction and the true assignment |
tables/prop_table.csv
Summary table of credible interval sizes and coverage proportions across conditions. Each row is one condition (encoded in the row name, e.g., n101_k2_q0.25_er). Columns:
| Column | Description |
|---|---|
n_simmap |
Mean number of unique state assignments in the SIMMAP-based credible set |
n_pupko |
Mean number of unique state assignments in the Pupko joint-based credible set |
n_marginal |
Mean number of unique state assignments in the marginal-based credible set |
p_simmap |
Coverage probability of the SIMMAP credible set (proportion of simulations containing the true assignment) |
p_pupko |
Coverage probability of the Pupko credible set |
p_marginal |
Coverage probability of the marginal credible set |
doc/
doc/vignette/joint_unc_vignette.rmd / joint_unc_vignette.html
An R Markdown vignette demonstrating how to generate and interpret alternative joint ancestral state reconstructions using the methods described in the associated publication (Boyko et al. 2026). It walks through installation, data preparation, running compute_joint_ci(), tSNE dimensionality reduction, HDBSCAN clustering, and visualization of alternative reconstruction clusters. The .html is the pre-rendered output.
empirical_analysis/
Self-contained sub-project for the empirical CRKP analysis.
empirical_analysis/data/empirical/dataset/
df.csv / df.RDS
The primary dataset used in the empirical analysis. Each row is one CRKP isolate. Columns:
| Column | Description |
|---|---|
isolate_no |
Unique isolate identifier |
Patient_ID |
De-identified patient identifier (integer) |
Cx_date |
Culture collection date (ISO 8601 format) |
blbli_dich_num |
Dichotomous susceptibility to imipenem-relebactam (BL/BLI): 0 = Susceptible, 1 = Non-Susceptible |
The RDS file contains the same data as an R data frame.
empirical_analysis/data/empirical/resistance_genotypes/
resistance_genotypes.csv / resistance_genotypes.RDS
Resistance genotype data for each CRKP isolate. All genotype columns are binary: 0 = absent, 1 = present. Columns:
| Column | Description |
|---|---|
isolate_no |
Unique isolate identifier |
RamA |
Presence of ramA efflux pump activator putative function-altering variant (PFAV) |
RamR |
Presence of ramR efflux pump regulator PFAV |
PBP_any |
Presence of any penicillin-binding protein (PBP) mutation |
PBP2 |
Presence of PBP2 mutation specifically |
PBP4 |
Presence of PBP4 mutation specifically |
OmpK36_c25t |
Presence of cytosine-to-thymine transition at position 25 in ompK36 |
OmpK36_truncation_kleborate |
Presence of truncation in ompK36 as detected by Kleborate |
OmpK36_putative_function_altering |
Presence of any putative function-altering mutation in ompK36 |
OmpK36_L3_mutations |
Presence of loop 3 insertion in ompK36 |
OmpK36GD |
Presence of GD (Gly-Asp) insertion in ompK36 loop 3 |
OmpK36TD |
Presence of TD (Thr-Asp) insertion in ompK36 loop 3 |
OmpK36_promoter_IS |
Presence of insertion sequence at the ompK36 promoter |
OmpK36_disruptive |
Presence of putative disruptive mutation in ompK36 |
AA552 |
Presence of AA552 blaKPC-containing plasmid marker |
AA018 |
Presence of AA018 plasmid marker |
empirical_analysis/data/empirical/tree/
tree.treefile
Maximum-likelihood phylogenetic tree of the CRKP isolates in Newick format, used in the empirical analysis. Tip labels correspond to isolate_no values in the dataset files.
empirical_analysis/data/empirical/joint_reconstruction/
RDS files storing intermediate and final results from the joint uncertainty ASR analysis of the CRKP dataset.
| File | Description |
|---|---|
phenotype_genotype_joint_uncertainty_asrs.RDS |
List of sampled joint ancestral state reconstructions from compute_joint_ci(), covering both the phenotype (IR susceptibility) and genotype axes |
phenotype_genotype_focal_asrs.RDS |
The single maximum-likelihood joint reconstruction for each axis |
joint_uncertainty_asr_clustering_optimal.RDS |
Optimal HDBSCAN clustering result applied to the joint reconstruction space |
joint_uncertainty_asr_dbres_iterations.RDS |
HDBSCAN clustering results across a range of minPts parameter values |
joint_uncertainty_asr_phylo_dist_matrix.RDS |
Phylogenetically-informed pairwise distance matrix between sampled reconstructions |
alternative_reconstruction_tsne.RDS |
tSNE dimensionality reduction result of the sampled joint reconstruction space |
empirical_analysis/data/empirical/phyloAMR/
RDS files storing results from the phyloAMR analyses of phenotype-genotype transition associations.
| File | Description |
|---|---|
joint_states_parent_child_dfs.RDS |
Data frames of parent-child node pairs with joint ancestral state assignments across all sampled reconstructions |
joint_states_reconstruction_transition_statistics.RDS |
Transition count statistics (gains, losses, continuations) summarized across all sampled reconstructions |
pairwise_synchronous_transitions_comparisons.RDS |
Pairwise synchronous transition comparison results across all phenotype reconstructions |
pairwise_synchronous_transitions_genotype_as_comparitor.RDS |
Synchronous transition comparisons using each genotype axis as the comparitor |
pairwise_synchronous_transitions_phenotype_as_comparitor.RDS |
Synchronous transition comparisons using the phenotype axis as the comparitor |
phenotype_genotype_joint_uncertainty_asr_clustering.RDS |
HDBSCAN clustering of joint ASR space for the full phenotype-genotype analysis |
phenotype_genotype_joint_uncertainty_asr_clustering_statistics.RDS |
Summary statistics of the clustering result |
phenotype_genotype_transition_association_statistics.RDS |
Permutation-based association test statistics between phenotype and genotype transitions |
representative_blbli_joint_uncertainty_asr_clustering.RDS |
HDBSCAN clustering of the IR-susceptibility (BL/BLI) representative reconstruction space |
representative_blbli_joint_uncertainty_asr_clustering_df.RDS |
Data frame summarizing the representative BL/BLI clustering result |
representative_blbli_joint_uncertainty_asr_clustering_statistics.RDS |
Summary statistics of the representative BL/BLI clustering |
representative_blbli_joint_uncertainty_parent_child_dfs.RDS |
Parent-child data frames for representative BL/BLI reconstructions |
representative_phenotype_genotype_downstream_transition_statistics.RDS |
Downstream transition statistics for representative phenotype-genotype reconstruction pairs |
representative_phenotype_genotype_synchronous_transition_statistics.RDS |
Synchronous transition statistics for representative phenotype-genotype reconstruction pairs |
representative_phenotype_genotype_transition_association_statistics.RDS |
Permutation-based association test statistics for representative reconstruction pairs |
empirical_analysis/tables/empirical/
table_median_stats_across_axes.csv
Summary of reconstruction statistics across all phenotype reconstruction axes (i.e., alternative joint reconstruction clusters). Each row is one reconstruction axis. Values are reported as median (IQR). Columns include: number of reconstructions per axis, phylogenetic events, singleton counts, cluster counts and sizes, transition counts, gain events, loss events, and continuation counts.
table_phenotype_genotype_pairwise_synchronous_transitions_summary.csv
Summary of synchronous transition counts between the phenotype (IR susceptibility) axis and each resistance genotype feature, reported by reconstruction axis. Values are reported as median (IQR). Each column corresponds to one genotype feature (e.g., presence of a specific ompK36 variant or plasmid marker).
empirical_analysis/scripts/empirical/
R Markdown (.Rmd) scripts for all empirical analyses, organized into subdirectories by analysis type. Each .Rmd has a corresponding rendered .html output.
| Subdirectory | Description |
|---|---|
dataset_curation/ |
Scripts to curate and merge the phylogenetic tree and AST dataset |
joint_reconstruction/ |
Scripts to compute and cluster joint ASR samples |
phyloAMR_analyses/ |
Scripts to run phyloAMR association tests between phenotype and genotype transitions |
figures/ |
Scripts to generate all empirical figures |
tables/ |
Scripts to generate summary tables |
The .sbat files in joint_reconstruction/ and phyloAMR_analyses/ are SLURM batch submission scripts used to run computationally intensive steps on an HPC cluster (tested on the University of Michigan Great Lakes cluster). The most intensive steps (get_phenotype_genotype_ancestral_states.Rmd and phyloAMR_representative_phenotype_genotype_association_testing.R) were run with 15 CPUs and 75 GB RAM; adjust --cpus-per-task and --mem in the .sbat files as needed for your cluster.
empirical_analysis/environments/
phyloAMR.yaml
Conda environment specification for reproducing the empirical analysis software environment.
phyloAMR_installation_README
Instructions for installing the phyloAMR R package and its dependencies.
empirical_analysis/lib/consistent_themes.R
Shared ggplot2 theme settings used across empirical figure scripts.
Sharing/Access information
Data was derived from the following sources:
- CRKP isolate collection and AST data from the LTACH cohort described in the associated publication.
Code/Software
All analyses were performed in R. The development version of corHMM (available on GitHub) is required, as the compute_joint_ci() function used throughout is not yet on CRAN.
Required R packages:
| Package | Version tested | Source |
|---|---|---|
| corHMM | 2.10.1 (GitHub) | devtools::install_github("thej022214/corHMM") |
| phyloAMR | 0.1.0 (GitHub) | devtools::install_github("kylegontjes/phyloAMR") |
| ape | 5.8.1 | CRAN |
| geiger | 2.0.11 | CRAN |
| Rtsne | 0.17 | CRAN |
| dbscan | 1.2.4 | CRAN |
| proxy | 0.4.29 | CRAN |
| RColorBrewer | 1.1.3 | CRAN |
| phytools | 2.5.2 | CRAN |
| phangorn | 2.12.1 | CRAN |
| tidyverse | 2.0.0 | CRAN |
| pbmcapply | 1.5.1 | CRAN |
| ggtree | 4.0.4 | Bioconductor |
| parallel | (base R) | — |
The empirical analysis additionally requires a conda environment. Set it up using empirical_analysis/environments/phyloAMR.yaml:
conda env create -f empirical_analysis/environments/phyloAMR.yaml
conda activate joint_uncertainty_corHMM_phyloAMR
Then install the GitHub packages from within R (see empirical_analysis/environments/phyloAMR_installation_README for full instructions).
Install CRAN dependencies for the simulation analysis by running scripts/install_packages.R.
Recommended execution order for simulation analyses:
scripts/install_packages.R— install dependenciesscripts/01_run_simulations.R— simulate data and run reconstructionsscripts/02_run_recon_unc.R— compute joint confidence intervals (parallelized; uses 10 cores by default)scripts/03_compare_jnt-mar.R— compare joint and marginal accuracyscripts/04_analyze_unc_results.R— aggregate results and produce summary tablesscripts/05_run_empirical.R— run the empirical analysis
Scripts 02 and 05 are computationally intensive. Script 02 uses parallel::mclapply() with mc.cores = 10 by default; adjust as needed. The HPC batch scripts in empirical_analysis/scripts/empirical/joint_reconstruction/ and phyloAMR_analyses/ were used for the most intensive empirical steps.
R version: 4.5.2
