Data from: Parameter estimation from phylogenetic trees using neural networks and ensemble learning
Data files
Jan 30, 2026 version files 27.40 MB
-
All_Data.RData
27.33 MB
-
Plot.R
59.04 KB
-
README.md
4.15 KB
Abstract
Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here, we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times, and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.
https://doi.org/10.1093/sysbio/syaf060
Description of the data and file structure
-
We are providing all of our research outputs from a complete run, which includes simulation, maximum likelihood estimation, neural network training, and testing from our computing cluster. The outputs are contained in two files: an R workspace and an R script file. To reproduce the figures presented in our paper, please load the workspace using your preferred R-compatible IDE and run the code provided in the R script file.
-
For all functions and scripts related to simulations, maximum likelihood estimation, and neural network training and testing, please refer to our codebase hosted at EvoLandEco/eveGNN: Codebase for Phylogenetic Tree Parameter Estimation with Neural Networks (github.com)
-
For illustration on using trained neural networks for phylogenetic tree parameter estimation, please refer to EvoLandEco/EvoNN: Functions to estimate phylogenetic tree parameters from pre-trained neural networks (github.com)
Data & File Overview
Package requirements
- R 4.2.1 (go to the R official website, the latest R base version should also work)
- PBD (on CRAN)
- DDD (on CRAN)
- ape (on CRAN)
- tidyverse (on CRAN)
- patchwork (on CRAN)
- ggh4x (on CRAN)
- ggnewscale (on CRAN)
- ggtext (on CRAN)
- devtools (on CRAN)
- eveGNN (go to https://github.com/EvoLandEco/eveGNN, clone the repository to a local path, and use devtools::load_all() to load all functions required)
- EvoNN (go to https://github.com/EvoLandEco/EvoNN)
Files included:
- Plot.R
- All_Data.RData
Data description:
- bd_boost_{arch_name}: Neural network performance results related to the birth-death model using a particular boost ensemble learning approach
- bd_mle_opt: Best-case MLE performance results related to the birth-death model
- bd_mle_typ: Typical/naive case MLE performance results related to the birth-death model
- ddd_deopim_opt: Best case MLE performance results related to the birth-death model using DEOptim optimizer
- ddd_deopim_typp: Typical/naive case MLE performance results related to the birth-death model using DEOptim optimizer
- ddd_new_robustness_{args} and ddd_new_robustness: Results related to the neural network inference result robustness
- ddd_robustness and ddd_robustness_dnn_lstm: Results related to the neural network inference result robustness
- ddd_simplex_opt: Best case MLE performance results related to the birth-death model using simplex optimizer
- ddd_simplex_typ: Typical/naive case MLE performance results related to the birth-death model using simplex optimizer
- ddd_subplex_opt: Best case MLE performance results related to the birth-death model using the subplex optimizer
- ddd_subplex_typ: Typical/naive case MLE performance results related to the birth-death model using the subplex optimizer
- empirical_uncertainty: Neural network estimation uncertainty range when facing the empirical trees
- pbd_boost_{arch_name}: Neural network performance results related to the protracted birth-death model using a particular boost ensemble learning approach
- pbd_durspec_{args}: Neural network performance results related to the protracted birth-death model; how parameter estimation accuracy changes along the values of the mean duration of speciation
- test_bagging: Neural network performance results related to the diversity-dependent diversification model using the bagging ensemble learning approach
- test_boost_{arch_name}: Neural network performance results related to the diversity-dependent diversification model using a particular boost ensemble learning approach
- test_cnn1d: Neural network performance results related to the diversity-dependent diversification model using the 1-dimensional convolutional neural network architecture
Note: other variables are intermediate or auxiliary.
