PhyloCNN: Improving tree representation and neural network architecture for deep learning from trees in phylodynamics and diversification studies
Data files
Dec 03, 2025 version files 148.90 MB
-
PhyloCNN_GitHub_12Nov2025.zip
148.87 MB
-
Primates-traits-ultrametric-tree.nwk
12.49 KB
-
README.md
8.09 KB
-
ZurichHIV-tree.nwk
9.09 KB
Abstract
Phylodynamics and diversification studies using complex evolutionary models can be challenging, especially with traditional likelihood-based approaches. As an alternative, likelihood-free simulation-based approaches have been proposed due to their ability to incorporate complex models and scenarios. Here, we propose a new simulation-based deep learning (DL) method capable of selecting birth-death models and accurately estimating their parameters in both phylodynamics and diversification studies. We use a convolutional approach, where trees are encoded using the neighborhood of all nodes and leaves of the input phylogeny. We also developed a dedicated neural network architecture called PhyloCNN. Using simulations, we compared the accuracy of PhyloCNN when using a variable number of neighbors to describe the local context of nodes and leaves. The number of neighbors had a greater impact when considering smaller training sets, with a broader context showing higher accuracy, especially for complex evolutionary models. Compared to other recently developed DL approaches, PhyloCNN showed higher or similar accuracies for all parameters when used with training sets one or two orders of magnitude smaller (10,000 to 100,000 simulated training trees, instead of millions). PhyloCNN also compared favorably with state-of-the-art likelihood-based methods. We applied PhyloCNN with compelling results to two real-world phylodynamics and diversification datasets, related to HIV superspreaders in Zurich and to primates and their ecological role as seed dispersers. The high accuracy and computational efficiency of PhyloCNN open new possibilities for phylodynamics and diversification studies that need to account for idiosyncratic phylogenetic histories with specific parameter spaces and sampling scenarios.
Dryad DOI: https://doi.org/10.5061/dryad.prr4xgxx9
This folder contains the code and the two empirical phylogenies analyzed in the manuscript entitled "PhyloCNN: Improving tree representation and neural network architecture for deep learning from trees in phylodynamics and diversification studies", by Manolo Perez and Olivier Gascuel. These two phylogenies are in Newick format.
PhyloCNN_GitHub_12Nov2025.zip: This file contains scripts and notebooks to perform simulations, encoding, model selection, parameter estimation, and posterior distribution analyses using PhyloCNN. This file is a mirror of phyloCNN's GitHub (https://github.com/manolofperez/phyloCNN/) on 14 October 2025.
ZurichHIV-tree.nwk: 200 taxa, from (Rasmussen et al. PLOS CB, 2017), analyzed in (Voznica et al. Nature Com 2022).
Primates-traits-ultrametric-tree.nwk: 260 taxa, from (Fabre et al. Mol Phyl Evol 2009) and (Gomez and Verdu Syst Biol 2012), analyzed in (Lambert et al. Syst Biol 2023). This tree is ultrametric (dated), and the trait values are included in the taxon names using [&&NHX-t_s=1 or 2]; 1 stands for mutualistic (= 0 in the manuscript) and 2 for antagonistic (= 1 in the manuscript).
PhyloCNN
This repository contains scripts and notebooks to perform simulations, encoding, model selection, parameter estimation, and posterior distribution analyses using PhyloCNN.
Article
Perez M.F. and Gascuel O. 2025. PhyloCNN: Improving tree representation and neural network architecture for deep learning from trees in phylodynamics and diversification studies. Systematic Biology.
Installation
To set up the required Python environment (using the file environment.yml), use the following command:
conda env create -f environment.yml
conda activate phylocnn
Scripts and Notebooks
Simulations Folder
- Phylodynamics Birth-Death Model Simulations (Python)
-
generate_parameters.py: Generate input parameters for BD, BDEI, and BDSS models. -
Command Examples:
python generate_parameters.py -m BD_PhyDyn -r 1,5 -i 1,10 -s 200,500 -p 0.01,1 -n 10000 -o parameters_BD.txtFor BD model, where -m=model; -r=R0; -i=1/γ; -s=tree size; -p=sampling probability; -n=number of samples; -o: output file
python generate_parameters.py -m BDEI -r 1,5 -i 1,10 -e 0.2,5 -s 200,500 -p 0.01,1 -n 10000 -o parameters_BDEI.txtFor BDEI model, where -m=model; -r=R0; -i=1/γ; -e=incubation factor (ε/γ); -s=tree size; -n=number of samples; -p=sampling probability; -o: output file
python generate_parameters.py -m BDSS -r 1,5 -i 1,10 -x 3,10 -f 0.05,0.2 -s 200,500 -p 0.01,1 -n 10000 -o parameters_BDSS.txtFor BDSS model, where -m=model; -r=R0; -i=1/γ; -x=XSS ; -f=fSS; -s=tree size; -p=sampling probability; -n=number of samples; -o: output file
-
The output from
generate_parameters.pyshould then be used with the simulators from (Voznica et al. 2022).
It requires the simulator to be called along with the parameter file generated in the previous step (e.g., parameters_BD.txt) and the maximum simulation time (with a default of 500; Voznica et al., 2022): Simulate trees using BD, BDEI, or BDSS parameters. -
Command Examples:
python TreeGen_BD_refactored.py parameters_BD.txt <max_time=500> > BD_trees.nwk
-
- Diversification Birth-Death Models Simulations (R + Python)
-
generate_parameters.py(Python): Generate input parameters for BD and BiSSE models. -
Command Examples for BD:
python generate_parameters.py -m BD_div -l 0.01,1.0 -t 0,1 -s 200,500 -p 0.01,1 -n 10000 -o parameters_BD_div.txtFor BD model, where -m=model; -l =λ; -t=τ; -s =tree size; -n = number of samples; -p = sampling probability; -o = output file
-
The output from
generate_parameters.pyshould then be used with the simulator from (Lambert et al. 2023).
It requires the simulator to be called along with the parameter file generated in the previous step (e.g., parameters_BD.txt) and the maximum simulation time (with a default of 500).python BD_simulator.py parameters_BD_div.txt <max_time=500> > BD_trees.nwk- Command Examples for BISSE:
python generate_parameters.py -m BISSE -l0 0.01,1.0 -t 0,1 -l1 0.1,1.0 -q 0.01,0.1 -s 200,500 -p 0.01,1 -n 10000 -o parameters_BiSSE.txtFor BiSSE model, where -m=model; -l0 =λ0; -t=τ; -l1=ratio between λ1 and λ0; -q=ratio between q (= q01 = q10) and λ0; -s =tree size; -n = number of samples; -p = sampling probability; -o = output file
-
Use the output to simulate trees with
BiSSE_simulator.R(R) from (Lambert et al. 2023).
The values between <> are the ones we used for the parameters required by the script (indice, seed number, step, number of retrials, and output file names).Rscript BiSSE_simulator.R parameters_BiSSE.txt <indice=1> <seed_base=12345> <step=10> <nb_retrials=100> BiSSE_trees.nwk BiSSE_stats.txt BiSSE_params.txt
-
Encoding Folder
- Phylogenies Encoding (Python)
-
PhyloCNN_Encoding_PhyloDyn.py: Encode BD, BDEI, BDSS, and BD_div trees. -
PhyloCNN_Encoding_BiSSE.py: Encode BiSSE trees. -
Command Examples:
python PhyloCNN_Encoding_PhyloDyn.py -t BD_trees.nwk -o Encoded_trees_BD.csv python PhyloCNN_Encoding_BiSSE.py -t BiSSE_trees.nwk -o Encoded_trees_BiSSE.csv
-
Preprocessing and Training Folder
- Preprocessing, Training, and Predictions (Jupyter Notebooks)
PhyloCNN_Train_PhyDyn_ModelSelection.ipynb: Model selection for BD, BDEI, BDSS.PhyloCNN_Train_BD.ipynb: Parameter estimation for the BD model.PhyloCNN_Train_BDEI.ipynb: Parameter estimation for the BDEI model.PhyloCNN_Train_BDSS.ipynb: Parameter estimation for the BDSS model.PhyloCNN_Train_BiSSE.ipynb: Parameter estimation for the BiSSE model.
CI folder
- Confidence Intervals and Posterior Distributions:
CI_HIV.ipynb: Compute confidence intervals for the HIV dataset.CI_primates.ipynb: Compute confidence intervals for the primates dataset.
Model Adequacy Folder
This folder has its own README file
- Posterior Sampling and Summary Statistics
SampleDistribution_kde.py: Samples parameter values from the posterior distribution using Gaussian Kernel Density Estimate (KDE).BiSSE_SumStats.ipynb: Extract summary statistics from trees simulated under BiSSE.
Empirical_Datasets Folder
This folder has its own README file
- Empirical Phylogenies for the HIV and the primates (with traits) datasets
- This folder contains the two empirical phylogenies to be analyzed.
Test_Sets folder
- Simulations from the test set
BD: Phylogenies (.nwk.gz) and parameter values (.csv.gz) for the BD model.BDEI: Phylogenies (.nwk.gz) and parameter values (.csv.gz) for the BDEI model.BDSS: Phylogenies (.nwk.gz) and parameter values (.csv.gz) for the BDSS model.BiSSE: Phylogenies (.nwk.gz) and parameter values (.csv.gz) for the BiSSE model.
Trained_Models folder
This folder has its own README file
- Trained neural networks for model selection and parameter estimation
- This folder contains the trained neural network models obtained with PhyloCNN.
