Data from: Torchtree: flexible phylogenetic model development and inference using PyTorch
Abstract
Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-differentiable models. Our experiments show that inference using the forward KL divergence is frequently faster per iteration compared to the evidence lower bound (ELBO) criterion, although the ELBO-based inference may converge faster in some cases. Overall, torchtree provides a flexible and efficient framework for phylogenetic model development and inference using PyTorch.
Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV. torchtree: flexible phylogenetic model development and inference using PyTorch. arXiv:2406.18044 (2024)
Description of the data
The SI.pdf file contains supplementary methods and figures referenced in the main manuscript (found on Zenodo under Supplemental Information).
The data.zip contains input files and phylogenetic trees used for analyses in the associated manuscript. The data are organized by dataset (HCV
and SC2
) and by tool (beast
and torchtree
), and include sequence alignments (see next section for SC2 alignment) and configuration files (xml and json files). torchtree uses variational Bayes while BEAST uses MCMC.
data/
├── HCV/
│ ├── HCV.fasta # Sequence alignment for HCV
│ ├── HCV.tree # Newick tree
│ └── beast/
│ ├── HCV_skyglide.xml # BEAST XML input (Skyglide model)
│ └── HCV_skygrid.xml # BEAST XML input (Skygrid model)
│
└── SC2/
├── beast/
│ ├── SC2_GTR.xml # BEAST XML input (GTR + Skygrid)
│ ├── SC2_GTR_skyglide.xml # BEAST XML input (GTR + Skyglide)
│ ├── SC2_HKY-RE.xml # BEAST XML input (HKY with random effect + Skygrid)
│ └── SC2_HKY-RE_skyglide.xml # BEAST XML input (HKY with random effect + Skyglide)
└── torchtree/
├── ELBO/
│ ├── SC2_GTR.json # torchtree JSON input (ELBO and GTR)
│ └── SC2_HKY-RE.json # torchtree JSON input (ELBO and HKY-RE)
└── KLpq-10/
├── SC2_GTR.json # torchtree JSON input (KL(p||q) and GTR)
└── SC2_HKY-RE.json # torchtree JSON input (VB KL(p||q and HKY-RE)
Sharing/Access information
Due to data sharing limitations, sequences for the SARS-CoV-2 analysis need to be downloaded from GISAID and the alignment (FASTA format) needs to be provided to the pipeline. The GISAID accession IDs are available in acknowledgements_table.xlsx (Github).
The sequences for the HCV analysis can be retrieved from the BEAST tutorial.
Code/Software
The pipeline supporting the results is located at https://github.com/4ment/torchtree-experiments.