Online appendix and simulated data sets for assesment of Birth-Death Exposed-Infectious (BDEI) phylodynamic model estimators
Data files
Dec 13, 2022 version files 7.88 MB
-
BDEI_data.zip
-
README.md
Sep 11, 2023 version files 15.45 MB
-
README.md
-
Simulated_data.zip
Abstract
The birth-death exposed-infectious (BDEI) phylodynamic model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters, from time-scaled phylogenetic trees.
We implemented a highly parallelizable estimator for the BDEI model in a maximum likelihood framework (PyBDEI) using a combination of numerical analysis methods for efficient equation resolution. This dataset contains the assessment of PyBDEI in comparison with a Bayesian implementation in BEAST2 (mtbd package) and a deep learning estimator PhyloDeep: the parameter values estimated by the 3 tools.
The PyBDEI and the theoretical findings behind it are described in A Zhukova, F Hecht, Y Maday, and O Gascuel. Fast and Accurate Maximum-Likelihood Estimation of Multi-Type Birth-Death Epidemiological Models from Phylogenetic Trees Syst Biol 2023. This dataset contains the online Appendix (Fig S1-S3 and Table S1).
README: Online Appendix and simulated data for Zhukova et al. Syst Biol 2023
Online Appendix
Contains Fig S1-S3 and Table S1.
Simulated data
We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):
- medium, a data set of 100 medium-sized trees (200 − 500 tips),
- large, a data set of 100 large trees (5 000 − 10 000 tips)
The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).
To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:
- incubation period 1/µ ∈ [0.2, 50]
- basic reproductive number R_0 = λ/ψ ∈ [1, 5]
- infectious period 1/ψ ∈ [1, 10].
Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.
For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.
Large forest data set
To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.
Type 1 forests (e.g. health policy change)
The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom-75% subtrees (in terms of time). We hence obtained 100 forests representing sup-epidemics that all started at the same time. They can be found in large/forests folder.
Type 2 forests (e.g. multiple introductions to a country)
The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we
- took the parameter values Θi corresponding to each tree Treei in the large dataset (i ∈ {1, . . . , 100})
- calculated the time Ti between the start of the tree Treei and the time of its last sampled tip
- kept
- uniformly drawing a time Ti,j ∈ [0, Ti], and
- generating a (potentially hidden) tree Treei,j under parameters Θi till reaching the time Ti,j.
Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Treei,j) ⩾ 5 000. The resulting forest Fi included those of the trees Treei,j that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.
Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei
This dataset contains:
- a forest of Type 1 for each large tree: large/forests/forest.[0-99].nwk
- a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[0-99].nwk
- the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tab-separated tables, with the following columns:
- (first column) - number of the tree/forest, between 0 and 99
- type - tool used for the estimation: PyBDEI, BEAST2, PhyloDeep, PyBDEI (forest, i.e. PyBDEI applied to a Type 1 forest), PyBDEI (subepidemic, i.e. PyBDEI applied to a Type 2 forest), real (real parameter value)
- mu - estimated (or real for real dataset) value of the state change rate µ
- mu_min - estimated 2,5% CI value of the state change rate µ
- mu_max - estimated 2,5% CI value of the state change rate µ
- lambda - estimated (or real for real dataset) value of the transmission rate λ
- lambda_min - estimated 2,5% CI value of the transmission rate λ
- lambda_max - estimated 97,5% CI value of the transmission rate λ
- psi - estimated (or real for real dataset) value of the becoming non-infectious rate ψ
- psi_min - estimated 2,5% CI value of the becoming non-infectious rate ψ
- psi_max - estimated 97,5% CI value of the becoming non-infectious rate ψ
- p - sampling probability (real value, fixed parameter)
- R_naught - estimated (or real for real dataset) value of the basic reproductive number R_0
- R_naught_min - estimated 2,5% CI value of the basic reproductive number R_0
- R_naught_max - estimated 97,5% CI value of the basic reproductive number R_0
- infectious_time - estimated (or real for real dataset) values of infectious period 1/ψ
- infectious_time_min - estimated 2,5% CI value of infectious period 1/ψ
- infectious_time_max - estimated 97,5% CI value of infectious period 1/ψ
- incubation_period -- estimated (or real for real dataset) value of the incubation period 1/µ
- incubation_period_min -- estimated 2,5% CI value of the incubation period 1/µ
- incubation_period_max -- estimated 97,5% CI value of the incubation period 1/µ
Methods
Simulated data
We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):
- medium, a data set of 100 medium-sized trees (200 − 500 tips),
- large, a data set of 100 large trees (5 000 − 10 000 tips)
The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).
To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:
- incubation period 1/µ ∈ [0.2, 50]
- basic reproductive number R_0 = λ/ψ ∈ [1, 5]
- infectious period 1/ψ ∈ [1, 10].
Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.
For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.
Large forest data set
To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.
Type 1 forests (e.g. health policy change)
The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom-75% subtrees (in terms of time). We hence obtained 100 forests representing sup-epidemics that all started at the same time. They can be found in large/forests folder.
Type 2 forests (e.g. multiple introductions to a country)
The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we
- took the parameter values Θi corresponding to each tree Treei in the large dataset (i ∈ {1, . . . , 100})
- calculated the time Ti between the start of the tree Treei and the time of its last sampled tip
- kept
- uniformly drawing a time Ti,j ∈ [0, Ti], and
- generating a (potentially hidden) tree Treei,j under parameters Θi till reaching the time Ti,j.
Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Treei,j) ⩾ 5 000. The resulting forest Fi included those of the trees Treei,j that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.
Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei
This dataset contains:
- a forest of Type 1 for each large tree: large/forests/forest.[0-99].nwk
- a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[0-99].nwk
- the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tab-separated tables
Usage notes
The data are uploaded as zip archives. The simulated data files within the archives can be opened with a text editor. The Appendix.pdf file can be opened with Adobe Acrobat or any other PDF viewer.