Data from: On the utility of deep learning for model classification and parameter estimation on complex diversification scenarios
Data files
May 12, 2026 version files 2.56 GB
-
README.md
9.38 KB
-
simulated_trees_deepbd.zip
2.56 GB
Abstract
Birth-Death models applied to dated phylogenies are a useful tool to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods such as Maximum Likelihood Estimation (MLE) or Bayesian Inference, though some of the most complex models present computational tractability issues. Recent years have witnessed the development of Deep Learning (DL) methods applied to evolutionary biology and phylogenetic inference. Here, we explore the power of Convolutional Neural Networks (CNNs), a type of DL method, to solve classification and regression (parameter estimation) tasks under six different rate-constant and rate-variable diversification scenarios: Constant Birth-Death, High-Extinction, Mass-Extinction, Diversity-Dependent, Stasis-and-Radiate, and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using a vectorization procedure that captures the topology and branch length information. The encoded trees were used to train and test a set of CNN models that were designed to tailor three empirical case studies differing in the number of tips. We compared the CNN's performance with MLE inference. Our results show that CNNs exhibited classification accuracy levels of 90-80\%, whereas maximum likelihood estimation achieved levels of 69-60\%, using AIC as model selection criterion. The most difficult scenarios to predict for the CNNs were the high-extinction and mass-extinction scenarios, which were often misidentified as one another. For the regression tasks, CNN models obtained generally lower mean average errors than MLE inference, irrespective of the number of tips in the simulated phylogenies, though differences were small. The only exception was the discrete time event parameter in the episodic diversification scenarios (Mass-Extinction, Stasis-and-Radiate, and Waxing-and-Waning), in which MLE inference showed a lower error than the CNNs. Finally, we illustrate and discuss the application of our CNNs to real-world phylogenies, using three classic empirical case studies: eucalypts, conifers, and cetaceans.
Dataset overview
This dataset simulated_trees_deepbd.zip contains simulated, extant-only phylogenetic trees used for the analyses in:
Gutiérrez de la Pena, P., Iglesias, G., Talavera, E., Sánchez Meseguer, A.,
and Sanmartin, I. 2026. On the Utility of Deep Learning for Model Classification and Parameter Estimation on Complex Diversification Scenarios. Systematic Biology. https://doi.org/10.1093/sysbio/syag030
Related code repository:
https://github.com/pablogpena/deep_birth_death
The data were generated to train and test convolutional neural networks for the classification of diversification scenarios and estimation of diversification parameters from dated phylogenies. Each row in a CSV file corresponds to one independent simulated tree. Trees are stored as Newick strings.
The files include six diversification scenarios:
| File prefix | Scenario in manuscript | Short description |
|---|---|---|
BD |
Constant Birth-Death, CBD | Constant speciation and extinction through time. |
HE |
High Extinction, HE | Constant birth-death with high relative extinction. |
ME_rho |
Mass Extinction, ME | Constant birth-death interrupted by a discrete mass-extinction event with survival probability rho. |
SAT |
Diversity Dependent, DD | Diversity-dependent birth-death simulation with declining speciation toward a carrying capacity. The filename prefix SAT is used in these files. |
SR |
Stasis and Radiate, SR | Piecewise-constant model with an older low-diversification phase and a later rapid-radiation phase. |
WW |
Waxing and Waning, WW | Piecewise-constant model with an older positive-diversification phase and a later declining-diversification phase. |
Folder organization
The root folder contains three subfolders, named by the target number of extant
tips in the simulated trees:
| Folder | Target tree size | Main contents |
|---|---|---|
87_10k/ |
87 extant tips | Raw and rescaled CSV files for all six scenarios, plus timing_log.txt. |
489_10k/ |
489 extant tips | Raw and rescaled CSV files for all six scenarios, plus timing_log.txt. |
674_10k/ |
674 extant tips | Raw and rescaled CSV files for all six scenarios, plus timing_log.txt. |
Each subfolder contains these CSV files:
BD_sim_no_fossil10000.csv
BD_sim_no_fossil10000_rescale.csv
HE_sim_no_fossil10000.csv
HE_sim_no_fossil10000_rescale.csv
ME_rho_sim_no_fossil10000.csv
ME_rho_sim_no_fossil10000_rescale.csv
SAT_sim_no_fossil10000.csv
SAT_sim_no_fossil10000_rescale.csv
SR_sim_no_fossil10000.csv
SR_sim_no_fossil10000_rescale.csv
WW_sim_no_fossil10000.csv
WW_sim_no_fossil10000_rescale.csv
The suffix no_fossil indicates that the Newick trees contain extant tips only; extinct or fossil lineages are not included in the tree strings. Files ending in _rescale.csv duplicate the original parameters and add their rescaled values used in the deep-learning workflow.
The files use a pipe character (|) as the field separator.
Simulation context
The simulations were generated with the R package TreeSim using scripts in
the associated code repository:
simulations/code/sim_phylogeny.r
simulations/code/sim_DD.r
simulations/config_sim.r
simulations/code/simulate_all.sh
The simulations targeted extant-only trees with 87, 489, or 674 living tips. For the constant-rate scenarios (BD, HE), trees were simulated with sim.bd.taxa. For ME_rho, SR, and WW, trees were simulated with sim.rateshift.taxa. For SAT / DD, trees were simulated with sim.rateshift.taxa using zero extinction and carrying capacity K = n_tips + 1.
The simulation scripts retained trees that matched the requested number of living tips and did not include dead tips in the output tree. For rate-shift and event-based models, the scripts also required the crown age to predate the event time by the configured buffer.
Parameter definitions
The manuscript parameterizes birth-death simulations using:
a = mu / lambda
r = lambda - mu
where lambda is the speciation rate, mu is the extinction rate, a is relative extinction, and r is net diversification. The simulation code derives
mu and lambda as:
mu = (a * r) / (1 - a)
lambda = r + mu
For event-based and piecewise scenarios, time is the event or shift time measured before the present. Following the manuscript convention, r1 and a1 describe the interval before the modeled event, while r0 and a0 describe the interval after the event. For constant-rate models, r1 = r0 and a1 = a0.
Data: raw CSV files
The raw files are named <PREFIX>_sim_no_fossil10000.csv and contain these
columns:
| Column | Type | Description |
|---|---|---|
n_tips |
integer | Number of extant tips in the simulated tree. |
r0 |
numeric | Net diversification parameter for state/interval 0. For SAT, this column instead stores the DD speciation parameter lambda0. |
r1 |
numeric | Net diversification parameter for state/interval 1. Equal to r0 in BD, HE, and ME_rho; sentinel value in SAT. |
a0 |
numeric | Relative extinction for state/interval 0, a0 = mu0 / lambda0. Sentinel value in SAT. |
a1 |
numeric | Relative extinction for state/interval 1, a1 = mu1 / lambda1. Equal to a0 in BD, HE, and ME_rho; sentinel value in SAT. |
time |
numeric | Shift or event time before the present. It is 0 for BD, HE, and SAT. |
frac0 |
numeric | Sampling/survival fraction for state/interval 0. In these simulations it is usually 1. |
frac1 |
numeric | Sampling/survival fraction for state/interval 1. In ME_rho, this is the mass-extinction survival probability rho. |
tree |
character | Simulated phylogenetic tree in Newick format. Branch lengths are included. |
Data: _rescale.csv files
The _rescale.csv files contain all raw columns plus derived columns used by
the deep-learning preprocessing workflow.
| Column | Type | Description |
|---|---|---|
resc_factor |
numeric | Average branch length of the original Newick tree, computed from all node.dist values returned by an ete3 traversal. |
mu0, mu1 |
numeric | Extinction rates derived from r0, r1, a0, and a1 using mu = (a * r) / (1 - a). |
lambda0, lambda1 |
numeric | Speciation rates derived as lambda = r + mu. |
norm_r0, norm_r1 |
numeric | r0 and r1 multiplied by resc_factor. For SAT, norm_r0 is the rescaled DD lambda0 target used by the neural-network workflow. |
norm_a0, norm_a1 |
numeric | a0 and a1 multiplied by resc_factor. These columns were generated for compatibility with the preprocessing table; model training used unscaled a values. |
norm_time |
numeric | time multiplied by resc_factor. |
norm_frac0, norm_frac1 |
numeric | frac0 and frac1 multiplied by resc_factor. These columns were generated for compatibility; model training used unscaled fraction values where needed. |
norm_mu0, norm_mu1 |
numeric | mu0 and mu1 multiplied by resc_factor. |
norm_lambda0, norm_lambda1 |
numeric | lambda0 and lambda1 multiplied by resc_factor. |
Important note for SAT / DD rescaled files: because a0 and a1 are sentinel values equal to 1, the generic formulas for mu and lambda divide by 1 - a and therefore produce infinite values in mu0, mu1, lambda0, lambda1, and their normalized counterparts. For SAT / DD analyses, use r0 as the DD lambda0 parameter and norm_r0 as its rescaled value.
Timing logs
Each subfolder contains timing_log.txt, a plain-text runtime log generated by the simulation shell script. These files record the date of the run, the scenario being simulated, R package loading messages, and wall/user/system time reported by the Unix time command. They are provenance files rather than simulation data tables.
