phyddle: Software for exploring phylogenetic models with deep learning
Data files
Sep 12, 2025 version files 223.45 KB
-
phyddle_ms_examples.zip
24.98 KB
-
phyddle_ms_supp.pdf
191.73 KB
-
README.md
6.74 KB
Abstract
Many realistic phylogenetic models lack tractable likelihood functions, prohibiting their use with standard inference methods. We present phyddle, a pipeline-based toolkit for performing phylogenetic modeling tasks using likelihood-free deep learning approaches. phyddle coordinates modeling tasks through five analysis steps (Simulate, Format, Train, Estimate, and Plot) that transform raw phylogenetic datasets as input into numerical and visualized model-based output. Benchmarks show that phyddle accurately performs a range of inference tasks, such as estimating macroevolutionary parameters, selecting among continuous trait evolution models, and passing coverage tests for epidemiological models, even for models that lack tractable likelihoods. phyddle has a flexible command-line interface, making it easy to integrate deep learning approaches for phylogenetics into research workflows.
https://doi.org/10.5061/dryad.612jm64ch
Description of the data and file structure
Files in this repository correspond to:
Landis, M.J. and Thompson, A. 2025. phyddle: software for exploring phylogenetic models with deep learning. Systematic Biology (in press). https://doi.org/10.1093/sysbio/syaf036
Contents:
-
Supplementary Information for the study:
Figure S1: Comparison of Bayesian and phyddle estimates for the sampling rate, δ, when all pathogens are sampled during the exponential growth phase of an outbreak (A–D) or sampled at any time during an outbreak (E–H). True parameter values are plotted against phyddle (blue; A and E) and Bayesian (red; B and F) point estimates. Estimated support interval bounds (gold; C, D, G, and H) for phyddle and Bayesian methods are also plotted against each other. Any point that falls on a slope-1, intercept-0 line has perfectly matching x and y values. Data displayed is a random subsample of 50 values (roughly 50%). Intervals shown are 95% CPI (conformalized prediction interval) or HPD (highest posterior density). Bayesian estimates and test data for comparison of exponential phase data (A–D) are from (Thompson et al. 2024). See main text for analysis details (Landis and Thompson 2024).
Figure S2: Comparison of Bayesian and phyddle estimates for the migration rate, m, when all pathogens are sampled during the exponential growth phase of an outbreak (A–D) or sampled at any time during an outbreak (E–H). True parameter values are plotted against phyddle (blue; A and E) and Bayesian (red; B and F) point estimates. Estimated support interval bounds (gold; C, D, G, and H) for phyddle and Bayesian methods are also plotted against each other. Any point that falls on a slope-1, intercept-0 line has perfectly matching x and y values. Data displayed is a random subsample of 50 values (roughly 50%). Intervals shown are 95% CPI (conformalized prediction interval) or HPD (highest posterior density). Bayesian estimates and test data for comparison of exponential phase data (A–D) are from (Thompson et al. 2024). See main text for analysis details (Landis and Thompson 2024).
-
Config file and simulator script used for the three examples used in the study.
Files and variables
-
phyddle_ms_supp.pdfcontains the Supplementary Information for the study. This file contains no data and no variables. -
phyddle_ms_examples.zipis a zipped version of the GitHub repository associated with the study: https://github.com/mlandis/phyddle_ms. The zip file contains three directories that correspond to the three examples in the manuscript. Each example directory contains the phyddle config file and simulator script used to produce the results presented in the paper. Users must read the phyddle documentation to fully understand how to use these data (https://phyddle.org).Example directories:
bisse_r: Compare maximum likeilhood and phyddle parameter estimates for binary state speciation extinction (BiSSE) model using R.levy_r: Compare Akiake Information Criterion (AIC) and phyddle model classification to select among four continuous trait models using R.sirm_master/all_phasesandsirm_master/exponential_phase: Compare Bayesian and phyddle parameter estimate for epidemiological model with migration, with (all phases) and without (exponential phase) model violation, using MASTER.
Each example directory is the output of a phyddle analysis (https://phyddle.org):
bisse_r/bisse_r/config.py: phyddle config file that specifies how the phyddle analysis will be run (see phyddle documentation).bisse_r/sim_bisse.R- simulates trees and data under binary state speciation extinction (BiSSE) model using R. Network prediction targets are parameters are numerical birth rate, death rate, and state transition rate variables.
levy_r/levy_r/config.py: phyddle config file that specifies how the phyddle analysis will be run (see phyddle documentation).levy_r/sim_levy.R- simulates trees and data under univariate continuous trait models (Brownian motion, Ornstein-Uhlenbeck, Early Burst, and a Lévy process with jumps) using R. Network prediction target is a categorical model classification variable, named "model_type".
sirm_master/all_phases/sirm_master/all_phases/config.py: phyddle config file that specifies how the phyddle analysis will be run (see phyddle documentation).sirm_master/all_phases/post_exponentialphase_SIRM_sim_one.py- simulates trees and data under an SIR model with migration with sampling throughout the entire outbreak. Network prediction targets are numerical basic reproduction number (R0), sampling rate (delta), and migration rate (m) variables.
sirm_master/exponential_phase/sirm_master/exponential_phase/config.py: phyddle config file that specifies how the phyddle analysis will be run (see phyddle documentation).sirm_master/exponential_phase/thompsonetal2024_SIRM_sim_one.py- simulates trees and data under an SIR model with migration with sampling during the exponential growth phase of the outbreak. Network prediction targets are numerical basic reproduction number (R0), sampling rate (delta), and migration rate (m) variables.
Usage
To re-run the phyddle analyses:
- Install the phyddle software following these instructions: link.
- Download and unzip the Dryad repository.
- Open shell and enter the Dryad repository directory.
- Enter the subdirectory for the example you want to try (e.g.
bisse_rorlevy_r). - Run
phyddle -c config.py -s SFTEP --end_idx 50000to run the phyddle pipeline with 50,000 simulated training examples.
Code/software
The study uses the software package, phyddle. Learn more about phyddle here:
- Paper: https://doi.org/10.1093/sysbio/syaf036
- Documentation: https://phyddle.org
- Software: https://github.com/mlandis/phyddle
References
Landis, M. J., & Thompson, A. (2024). phyddle: Software for phylogenetic model exploration with deep learning. bioRxix, 2024
Thompson, A., Liebeskind, B. J., Scully, E. J., & Landis, M. J. (2024). Deep learning and likelihood approaches for viral phylogeography converge on the same answers whether the inference model is right or wrong. Systematic Biology, 73(1), 183–206, 2024
