Data from: On the utility of deep learning for model classification and parameter estimation on complex diversification scenarios

Gutiérrez de la Peña, Pablo 1 ; Iglesias, Guillermo2 ; Talavera, Edgar2 ; Meseguer, Andrea1 ; Sanmartín, Isabel 1

Research facility: Consejo Superior de Investigaciones Científicas

Published May 12, 2026 on Dryad. https://doi.org/10.5061/dryad.f7m0cfz6b

Data files

May 12, 2026 version files 2.56 GB

README.md

9.38 KB
simulated_trees_deepbd.zip

2.56 GB

Abstract

Birth-Death models applied to dated phylogenies are a useful tool to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods such as Maximum Likelihood Estimation (MLE) or Bayesian Inference, though some of the most complex models present computational tractability issues. Recent years have witnessed the development of Deep Learning (DL) methods applied to evolutionary biology and phylogenetic inference. Here, we explore the power of Convolutional Neural Networks (CNNs), a type of DL method, to solve classification and regression (parameter estimation) tasks under six different rate-constant and rate-variable diversification scenarios: Constant Birth-Death, High-Extinction, Mass-Extinction, Diversity-Dependent, Stasis-and-Radiate, and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using a vectorization procedure that captures the topology and branch length information. The encoded trees were used to train and test a set of CNN models that were designed to tailor three empirical case studies differing in the number of tips. We compared the CNN's performance with MLE inference. Our results show that CNNs exhibited classification accuracy levels of 90-80\%, whereas maximum likelihood estimation achieved levels of 69-60\%, using AIC as model selection criterion. The most difficult scenarios to predict for the CNNs were the high-extinction and mass-extinction scenarios, which were often misidentified as one another. For the regression tasks, CNN models obtained generally lower mean average errors than MLE inference, irrespective of the number of tips in the simulated phylogenies, though differences were small. The only exception was the discrete time event parameter in the episodic diversification scenarios (Mass-Extinction, Stasis-and-Radiate, and Waxing-and-Waning), in which MLE inference showed a lower error than the CNNs. Finally, we illustrate and discuss the application of our CNNs to real-world phylogenies, using three classic empirical case studies: eucalypts, conifers, and cetaceans.

Dataset overview

This dataset simulated_trees_deepbd.zip contains simulated, extant-only phylogenetic trees used for the analyses in:

Gutiérrez de la Pena, P., Iglesias, G., Talavera, E., Sánchez Meseguer, A.,
and Sanmartin, I. 2026. On the Utility of Deep Learning for Model Classification and Parameter Estimation on Complex Diversification Scenarios. Systematic Biology. https://doi.org/10.1093/sysbio/syag030

Related code repository:
https://github.com/pablogpena/deep_birth_death

The data were generated to train and test convolutional neural networks for the classification of diversification scenarios and estimation of diversification parameters from dated phylogenies. Each row in a CSV file corresponds to one independent simulated tree. Trees are stored as Newick strings.

The files include six diversification scenarios:

File prefix	Scenario in manuscript	Short description
`BD`	Constant Birth-Death, CBD	Constant speciation and extinction through time.
`HE`	High Extinction, HE	Constant birth-death with high relative extinction.
`ME_rho`	Mass Extinction, ME	Constant birth-death interrupted by a discrete mass-extinction event with survival probability `rho`.
`SAT`	Diversity Dependent, DD	Diversity-dependent birth-death simulation with declining speciation toward a carrying capacity. The filename prefix `SAT` is used in these files.
`SR`	Stasis and Radiate, SR	Piecewise-constant model with an older low-diversification phase and a later rapid-radiation phase.
`WW`	Waxing and Waning, WW	Piecewise-constant model with an older positive-diversification phase and a later declining-diversification phase.

Folder organization

The root folder contains three subfolders, named by the target number of extant
tips in the simulated trees:

Folder	Target tree size	Main contents
`87_10k/`	87 extant tips	Raw and rescaled CSV files for all six scenarios, plus `timing_log.txt`.
`489_10k/`	489 extant tips	Raw and rescaled CSV files for all six scenarios, plus `timing_log.txt`.
`674_10k/`	674 extant tips	Raw and rescaled CSV files for all six scenarios, plus `timing_log.txt`.

Each subfolder contains these CSV files:

BD_sim_no_fossil10000.csv
BD_sim_no_fossil10000_rescale.csv
HE_sim_no_fossil10000.csv
HE_sim_no_fossil10000_rescale.csv
ME_rho_sim_no_fossil10000.csv
ME_rho_sim_no_fossil10000_rescale.csv
SAT_sim_no_fossil10000.csv
SAT_sim_no_fossil10000_rescale.csv
SR_sim_no_fossil10000.csv
SR_sim_no_fossil10000_rescale.csv
WW_sim_no_fossil10000.csv
WW_sim_no_fossil10000_rescale.csv

The suffix no_fossil indicates that the Newick trees contain extant tips only; extinct or fossil lineages are not included in the tree strings. Files ending in _rescale.csv duplicate the original parameters and add their rescaled values used in the deep-learning workflow.

The files use a pipe character (|) as the field separator.

Simulation context

The simulations were generated with the R package TreeSim using scripts in
the associated code repository:

simulations/code/sim_phylogeny.r
simulations/code/sim_DD.r
simulations/config_sim.r
simulations/code/simulate_all.sh

The simulations targeted extant-only trees with 87, 489, or 674 living tips. For the constant-rate scenarios (BD, HE), trees were simulated with sim.bd.taxa. For ME_rho, SR, and WW, trees were simulated with sim.rateshift.taxa. For SAT / DD, trees were simulated with sim.rateshift.taxa using zero extinction and carrying capacity K = n_tips + 1.

The simulation scripts retained trees that matched the requested number of living tips and did not include dead tips in the output tree. For rate-shift and event-based models, the scripts also required the crown age to predate the event time by the configured buffer.

Parameter definitions

The manuscript parameterizes birth-death simulations using:

a = mu / lambda
r = lambda - mu

where lambda is the speciation rate, mu is the extinction rate, a is relative extinction, and r is net diversification. The simulation code derives
mu and lambda as:

mu = (a * r) / (1 - a)
lambda = r + mu

For event-based and piecewise scenarios, time is the event or shift time measured before the present. Following the manuscript convention, r1 and a1 describe the interval before the modeled event, while r0 and a0 describe the interval after the event. For constant-rate models, r1 = r0 and a1 = a0.

Data: raw CSV files

The raw files are named <PREFIX>_sim_no_fossil10000.csv and contain these
columns:

Column	Type	Description
`n_tips`	integer	Number of extant tips in the simulated tree.
`r0`	numeric	Net diversification parameter for state/interval 0. For `SAT`, this column instead stores the DD speciation parameter `lambda0`.
`r1`	numeric	Net diversification parameter for state/interval 1. Equal to `r0` in `BD`, `HE`, and `ME_rho`; sentinel value in `SAT`.
`a0`	numeric	Relative extinction for state/interval 0, `a0 = mu0 / lambda0`. Sentinel value in `SAT`.
`a1`	numeric	Relative extinction for state/interval 1, `a1 = mu1 / lambda1`. Equal to `a0` in `BD`, `HE`, and `ME_rho`; sentinel value in `SAT`.
`time`	numeric	Shift or event time before the present. It is `0` for `BD`, `HE`, and `SAT`.
`frac0`	numeric	Sampling/survival fraction for state/interval 0. In these simulations it is usually `1`.
`frac1`	numeric	Sampling/survival fraction for state/interval 1. In `ME_rho`, this is the mass-extinction survival probability `rho`.
`tree`	character	Simulated phylogenetic tree in Newick format. Branch lengths are included.

Data: `_rescale.csv` files

The _rescale.csv files contain all raw columns plus derived columns used by
the deep-learning preprocessing workflow.

Column	Type	Description
`resc_factor`	numeric	Average branch length of the original Newick tree, computed from all `node.dist` values returned by an `ete3` traversal.
`mu0`, `mu1`	numeric	Extinction rates derived from `r0`, `r1`, `a0`, and `a1` using `mu = (a * r) / (1 - a)`.
`lambda0`, `lambda1`	numeric	Speciation rates derived as `lambda = r + mu`.
`norm_r0`, `norm_r1`	numeric	`r0` and `r1` multiplied by `resc_factor`. For `SAT`, `norm_r0` is the rescaled DD `lambda0` target used by the neural-network workflow.
`norm_a0`, `norm_a1`	numeric	`a0` and `a1` multiplied by `resc_factor`. These columns were generated for compatibility with the preprocessing table; model training used unscaled `a` values.
`norm_time`	numeric	`time` multiplied by `resc_factor`.
`norm_frac0`, `norm_frac1`	numeric	`frac0` and `frac1` multiplied by `resc_factor`. These columns were generated for compatibility; model training used unscaled fraction values where needed.
`norm_mu0`, `norm_mu1`	numeric	`mu0` and `mu1` multiplied by `resc_factor`.
`norm_lambda0`, `norm_lambda1`	numeric	`lambda0` and `lambda1` multiplied by `resc_factor`.

Important note for SAT / DD rescaled files: because a0 and a1 are sentinel values equal to 1, the generic formulas for mu and lambda divide by 1 - a and therefore produce infinite values in mu0, mu1, lambda0, lambda1, and their normalized counterparts. For SAT / DD analyses, use r0 as the DD lambda0 parameter and norm_r0 as its rescaled value.

Timing logs

Each subfolder contains timing_log.txt, a plain-text runtime log generated by the simulation shell script. These files record the date of the run, the scenario being simulated, R package loading messages, and wall/user/system time reported by the Unix time command. They are provenance files rather than simulation data tables.