Training and test data with scripts for simulation-trained deep learning and likelihood-based phylogeography comparisons

Thompson, Ammon 1

Published Jan 03, 2024; Updated Jan 09, 2024 on Dryad. https://doi.org/10.25338/B8SH2J

Data files

Jan 03, 2024 version files 3.03 GB

dryad_deeplearning_phylogeo_thompson_etal_2024_scripts_data_revision2.tar.gz
3.03 GB
README.md
4.64 KB

Jan 09, 2024 version files 3.03 GB

dryad_deeplearning_phylogeo_thompson_etal_2024_scripts_data_revision2.tar.gz
3.03 GB
README.md
4.64 KB

Abstract

Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among five locations and found they achieve similar levels of accuracy to Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.

All experiments were run on the following platform with the corresponding software versions:

Simulations: On an AWS EC2 instance running Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-1092-aws x86_64).
Simulation Experiment Results Analyses:
- Platform: x86_64-pc-linux-gnu (64-bit) on Windows Subsystem for Linux v2.
- Running under: Ubuntu 20.04.3 LTS.

Training and Other Large Data Files

Also available on github without large data files: https://github.com/ammonthompson/phylogeo_epi_cnn.

Software and Libraries

Beast 2 v2.6.3 with the package MASTER v6.1.2 installed for simulation.
- Assumes the executable beast2 is in a directory in your $PATH.
- In a directory in your path, make the following soft link: ln -s /path/to/bin/beast beast2
R, Rscript v4.1.1 with the following libraries for simulation and analysis:
- vioplot v0.3.7
- expm v0.999.6
- BEST v0.5.4
- phytools v0.7.90
- rjson v0.2.20
Python3 v3.8.10 with the following packages for machine learning:
- numpy v1.19.5
- scipy v1.6.3
- pandas v1.3.4
- ete3 v3.1.2
- matplotlib v3.1.2
- keras v2.6.0
- scikit-learn v0.24.2
Seq-gen v1.3.2 for simulating genome sequence evolution. Assumes it is in your $PATH.
RevBayes v1.1.0 for phylogenetic analysis
- TensorPhylo plugin (downloaded ~July 2021. Currently at https://bitbucket.org/mrmay/tensorphylo/src/master/)

Pipeline Overview:

The pipeline is structured to generate and analyze phylogeographic and epidemiological data. Execution begins with the “randomParams_simulation.sh” script, which further invokes the “simulateTreeAndAlignment.sh” script among others.

Directory Structure:

The root contains the simulation scripts for generating the training and testing data. The follwoing directories contain scripts and data for conducting the analysis of the study. The scripts directory contains support scripts for simulation.

neural_network_dev:
- Dedicated to neural network development and related utilities.
- Contains scripts for extracting labels from parameter files, computing means, and version 2 of label extraction.
- Python Modules:
  - cnn_utilities.py: Functions and utilities related to convolutional neural networks.
- uq_and_adequacy: Houses utilities for uncertainty quantification and model adequacy.
- Jupyter Notebooks for training and testing CNNs
phylo_analysis:
- Focuses on the analysis of phylogenetic trees and related data.
- Features Rev scripts for phylogenetic inference, handling and processing tree sets, fixing specific parameters, setting true values, and running IQ-TREE.
- Python Modules:
  - split_columns.py: A utility to split columns in a dataset.
real_data_analysis:
- Pertains to the analysis of real-world data sets from Nadeau et al. 2021.
- Contains scripts to adjust tree features such as branch lengths, tip ages, and polytomies.
scripts:
- General utility scripts for simulation and file processing tasks in the pipeline.
- Subdirectories include scripts for handling ‘cblv’ formatted data, generating XML files, extracting migration rates from XML, etc.
- Python Modules:
  - Several utilities including scripts to handle tree encoding, modify tree structures, and vectorize trees.
- R Scripts:
  - A suite of R scripts for generating random numbers, adjusting metadata, visualization, and performing specific analyses on population statistics, branch lengths, etc.

Analysis Scripts:

The pipeline also offers a suite of standalone scripts and modules in Python and R for tasks like data visualization, parameter tuning, branch length computation, and more.

Execution:

To initiate the pipeline, execute the “randomParams_simulation.sh” script, which orchestrates the simulation and subsequent analysis.

Simulation

Simulation settings are passed into the program with a control file like the one in:
./control_files/testing_controlfile.txt

Basic command to simulate training and test trees:
shell
randomParams_Simulation.sh path/to/control_file.txt num_locations sim_num_from,sim_num_to path/to/output_dir/output_file_prefix

Phylogenetic analysis

See README file in /phylo_analysis/

CNN training and testing

See README file in /neural_network_dev/

Analysis and Plotting

The final analysis script that generates results and figures for the manuscript is:

analysis_plotting/extant_analysis_and_plot_results.R