Online appendix and simulated data sets for assesment of BirthDeath ExposedInfectious (BDEI) phylodynamic model estimators
Data files
Dec 13, 2022 version files 7.88 MB

BDEI_data.zip

README.md
Sep 11, 2023 version files 15.45 MB

README.md

Simulated_data.zip
Abstract
The birthdeath exposedinfectious (BDEI) phylodynamic model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARSCoV2), and permits its estimation along with other parameters, from timescaled phylogenetic trees.
We implemented a highly parallelizable estimator for the BDEI model in a maximum likelihood framework (PyBDEI) using a combination of numerical analysis methods for efficient equation resolution. This dataset contains the assessment of PyBDEI in comparison with a Bayesian implementation in BEAST2 (mtbd package) and a deep learning estimator PhyloDeep: the parameter values estimated by the 3 tools.
The PyBDEI and the theoretical findings behind it are described in A Zhukova, F Hecht, Y Maday, and O Gascuel. Fast and Accurate MaximumLikelihood Estimation of MultiType BirthDeath Epidemiological Models from Phylogenetic Trees Syst Biol 2023. This dataset contains the online Appendix (Fig S1S3 and Table S1).
README: Online Appendix and simulated data for Zhukova et al. Syst Biol 2023
Online Appendix
Contains Fig S1S3 and Table S1.
Simulated data
We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):
 medium, a data set of 100 mediumsized trees (200 − 500 tips),
 large, a data set of 100 large trees (5 000 − 10 000 tips)
The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).
To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:
 incubation period 1/µ ∈ [0.2, 50]
 basic reproductive number R_0 = λ/ψ ∈ [1, 5]
 infectious period 1/ψ ∈ [1, 10].
Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.
For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.
Large forest data set
To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.
Type 1 forests (e.g. health policy change)
The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom75% subtrees (in terms of time). We hence obtained 100 forests representing supepidemics that all started at the same time. They can be found in large/forests folder.
Type 2 forests (e.g. multiple introductions to a country)
The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we
 took the parameter values Θ_{i} corresponding to each tree Tree_{i} in the large dataset (i ∈ {1, . . . , 100})
 calculated the time T_{i} between the start of the tree Tree_{i} and the time of its last sampled tip
 kept
 uniformly drawing a time T_{i,j} ∈ [0, Ti], and
 generating a (potentially hidden) tree Tree_{i,j} under parameters Θ_{i} till reaching the time T_{i,j}.
Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Tree_{i,j}) ⩾ 5 000. The resulting forest F_{i} included those of the trees Tree_{i,j} that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.
Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei
This dataset contains:
 a forest of Type 1 for each large tree: large/forests/forest.[099].nwk
 a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[099].nwk
 the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tabseparated tables, with the following columns:
 (first column)  number of the tree/forest, between 0 and 99
 type  tool used for the estimation: PyBDEI, BEAST2, PhyloDeep, PyBDEI (forest, i.e. PyBDEI applied to a Type 1 forest), PyBDEI (subepidemic, i.e. PyBDEI applied to a Type 2 forest), real (real parameter value)
 mu  estimated (or real for real dataset) value of the state change rate µ
 mu_min  estimated 2,5% CI value of the state change rate µ
 mu_max  estimated 2,5% CI value of the state change rate µ
 lambda  estimated (or real for real dataset) value of the transmission rate λ
 lambda_min  estimated 2,5% CI value of the transmission rate λ
 lambda_max  estimated 97,5% CI value of the transmission rate λ
 psi  estimated (or real for real dataset) value of the becoming noninfectious rate ψ
 psi_min  estimated 2,5% CI value of the becoming noninfectious rate ψ
 psi_max  estimated 97,5% CI value of the becoming noninfectious rate ψ
 p  sampling probability (real value, fixed parameter)
 R_naught  estimated (or real for real dataset) value of the basic reproductive number R_0
 R_naught_min  estimated 2,5% CI value of the basic reproductive number R_0
 R_naught_max  estimated 97,5% CI value of the basic reproductive number R_0
 infectious_time  estimated (or real for real dataset) values of infectious period 1/ψ
 infectious_time_min  estimated 2,5% CI value of infectious period 1/ψ
 infectious_time_max  estimated 97,5% CI value of infectious period 1/ψ
 incubation_period  estimated (or real for real dataset) value of the incubation period 1/µ
 incubation_period_min  estimated 2,5% CI value of the incubation period 1/µ
 incubation_period_max  estimated 97,5% CI value of the incubation period 1/µ
Methods
Simulated data
We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):
 medium, a data set of 100 mediumsized trees (200 − 500 tips),
 large, a data set of 100 large trees (5 000 − 10 000 tips)
The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence).
To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:
 incubation period 1/µ ∈ [0.2, 50]
 basic reproductive number R_0 = λ/ψ ∈ [1, 5]
 infectious period 1/ψ ∈ [1, 10].
Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2.
For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips.
Large forest data set
To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set.
Type 1 forests (e.g. health policy change)
The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom75% subtrees (in terms of time). We hence obtained 100 forests representing supepidemics that all started at the same time. They can be found in large/forests folder.
Type 2 forests (e.g. multiple introductions to a country)
The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we
 took the parameter values Θ_{i} corresponding to each tree Tree_{i} in the large dataset (i ∈ {1, . . . , 100})
 calculated the time T_{i} between the start of the tree Tree_{i} and the time of its last sampled tip
 kept
 uniformly drawing a time T_{i,j} ∈ [0, Ti], and
 generating a (potentially hidden) tree Tree_{i,j} under parameters Θ_{i} till reaching the time T_{i,j}.
Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Tree_{i,j}) ⩾ 5 000. The resulting forest F_{i} included those of the trees Tree_{i,j} that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value.
Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei
This dataset contains:
 a forest of Type 1 for each large tree: large/forests/forest.[099].nwk
 a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[099].nwk
 the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tabseparated tables
Usage notes
The data are uploaded as zip archives. The simulated data files within the archives can be opened with a text editor. The Appendix.pdf file can be opened with Adobe Acrobat or any other PDF viewer.