Online appendix and simulated data sets for assesment of Birth-Death Exposed-Infectious (BDEI) phylodynamic model estimators

Zhukova, Anna 1

Published Dec 13, 2022; Updated Sep 11, 2023 on Dryad. https://doi.org/10.5061/dryad.r7sqv9sgx

Data files

Dec 13, 2022 version files 7.88 MB

BDEI_data.zip
7.88 MB
README.md
4.03 KB

Sep 11, 2023 version files 15.45 MB

README.md
5.45 KB
Simulated_data.zip
15.44 MB

Abstract

The birth-death exposed-infectious (BDEI) phylodynamic model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters, from time-scaled phylogenetic trees.

We implemented a highly parallelizable estimator for the BDEI model in a maximum likelihood framework (PyBDEI) using a combination of numerical analysis methods for efficient equation resolution. This dataset contains the assessment of PyBDEI in comparison with a Bayesian implementation in BEAST2 (mtbd package) and a deep learning estimator PhyloDeep: the parameter values estimated by the 3 tools.

The PyBDEI and the theoretical findings behind it are described in A Zhukova, F Hecht, Y Maday, and O Gascuel. Fast and Accurate Maximum-Likelihood Estimation of Multi-Type Birth-Death Epidemiological Models from Phylogenetic Trees Syst Biol 2023. This dataset contains the online Appendix (Fig S1-S3 and Table S1).

Online Appendix

Contains Fig S1-S3 and Table S1.

Simulated data

We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):

medium, a data set of 100 medium-sized trees (200 − 500 tips),
large, a data set of 100 large trees (5 000 − 10 000 tips)

The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).
To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:

incubation period 1/µ ∈ [0.2, 50]
basic reproductive number R_0 = λ/ψ ∈ [1, 5]
infectious period 1/ψ ∈ [1, 10].

Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.
For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.

Large forest data set

To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.

Type 1 forests (e.g. health policy change)

The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom-75% subtrees (in terms of time). We hence obtained 100 forests representing sup-epidemics that all started at the same time. They can be found in large/forests folder.

Type 2 forests (e.g. multiple introductions to a country)

The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we

took the parameter values Θ_i corresponding to each tree Tree_i in the large dataset (i ∈ {1, . . . , 100})
calculated the time T_i between the start of the tree Tree_i and the time of its last sampled tip
kept
1. uniformly drawing a time T_i,j ∈ [0, Ti], and
2. generating a (potentially hidden) tree Tree_i,j under parameters Θ_i till reaching the time T_i,j.

Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Tree_i,j) ⩾ 5 000. The resulting forest F_i included those of the trees Tree_i,j that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.
Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei
This dataset contains:

a forest of Type 1 for each large tree: large/forests/forest.[0-99].nwk
a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[0-99].nwk
the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tab-separated tables, with the following columns:
- (first column) - number of the tree/forest, between 0 and 99
- type - tool used for the estimation: PyBDEI, BEAST2, PhyloDeep, PyBDEI (forest, i.e. PyBDEI applied to a Type 1 forest), PyBDEI (subepidemic, i.e. PyBDEI applied to a Type 2 forest), real (real parameter value)
- mu - estimated (or real for real dataset) value of the state change rate µ
- mu_min - estimated 2,5% CI value of the state change rate µ
- mu_max - estimated 2,5% CI value of the state change rate µ
- lambda - estimated (or real for real dataset) value of the transmission rate λ
- lambda_min - estimated 2,5% CI value of the transmission rate λ
- lambda_max - estimated 97,5% CI value of the transmission rate λ
- psi - estimated (or real for real dataset) value of the becoming non-infectious rate ψ
- psi_min - estimated 2,5% CI value of the becoming non-infectious rate ψ
- psi_max - estimated 97,5% CI value of the becoming non-infectious rate ψ
- p - sampling probability (real value, fixed parameter)
- R_naught - estimated (or real for real dataset) value of the basic reproductive number R_0
- R_naught_min - estimated 2,5% CI value of the basic reproductive number R_0
- R_naught_max - estimated 97,5% CI value of the basic reproductive number R_0
- infectious_time - estimated (or real for real dataset) values of infectious period 1/ψ
- infectious_time_min - estimated 2,5% CI value of infectious period 1/ψ
- infectious_time_max - estimated 97,5% CI value of infectious period 1/ψ
- incubation_period – estimated (or real for real dataset) value of the incubation period 1/µ
- incubation_period_min – estimated 2,5% CI value of the incubation period 1/µ
- incubation_period_max – estimated 97,5% CI value of the incubation period 1/µ

Simulated data

We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):

medium, a data set of 100 medium-sized trees (200 − 500 tips),
large, a data set of 100 large trees (5 000 − 10 000 tips)

The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).

To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:

incubation period 1/µ ∈ [0.2, 50]
basic reproductive number R_0 = λ/ψ ∈ [1, 5]
infectious period 1/ψ ∈ [1, 10].

Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.

For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.

Large forest data set

To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.

Type 1 forests (e.g. health policy change)

Type 2 forests (e.g. multiple introductions to a country)

The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we

took the parameter values Θ_i corresponding to each tree Tree_i in the large dataset (i ∈ {1, . . . , 100})
calculated the time T_i between the start of the tree Tree_i and the time of its last sampled tip
kept
1. uniformly drawing a time T_i,j ∈ [0, Ti], and
2. generating a (potentially hidden) tree Tree_i,j under parameters Θ_i till reaching the time T_i,j.

As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.

Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei

This dataset contains:

a forest of Type 1 for each large tree: large/forests/forest.[0-99].nwk
a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[0-99].nwk
the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tab-separated tables

Online appendix and simulated data sets for assesment of Birth-Death Exposed-Infectious (BDEI) phylodynamic model estimators

Data files

Abstract

README: Online Appendix and simulated data for Zhukova et al. Syst Biol 2023

Online Appendix

Simulated data

Large forest data set

Type 1 forests (e.g. health policy change)

Type 2 forests (e.g. multiple introductions to a country)

Methods

Simulated data

Large forest data set

Type 1 forests (e.g. health policy change)

Type 2 forests (e.g. multiple introductions to a country)

Usage notes

Works referencing this dataset