Toward a semi-supervised learning approach to phylogenetic estimation

Silvestro, Daniele 1 ; Latrille, Thibault2 ; Salamin, Nicolas 2

Published May 28, 2024 on Dryad. https://doi.org/10.5061/dryad.qz612jmn6

Data files

May 28, 2024 version files 763.26 MB

Chr1.WGAlign.FromBam.Filtered.fasta

736.24 MB
data_and_scripts.zip

26.97 MB
README.md

5.16 KB
Table_S7.csv

44.12 KB

Abstract

Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence, without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference, when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.

The latest phyloRNN code is available here: https://github.com/phyloRNN/

This repository hosts the version used in the manuscript and includes pre-trained models and the empirical data analyzed in the paper.

Empirical data

The file Chr1.WGAlign.FromBam.Filtered.fasta contains the alignment of chromosome 1 across clownfish species in FASTA format.

Supplementary Table

The full Table S7 discussed in the article is available in the data_and_scripts/Supplementary_information directory.

The table shows the results of 600 simulations (rows) under different heterogeneity models and tree length. For each simulation, Log-likelihood of the simulated data is computed given the true tree, given the nucleotide matrix estimated under gamma rates, and given site- specific rates. Site-specific rates are either posteriors under gamma model, posteriors under free-rates model, estimated by phyloRNN or finally the true rates (used as input of the simulation).

phyloRNN scripts

The scripts used in this paper are available in the data_and_scripts/Scripts directory

Simulations

Scripts to generate training and test datasets with 50 taxa and alignments of 1000 nucleotides.

1.simulate_training_test_data.py Simulate 60,000 training datasets with 50 tips and 1000 sites and 600 test datasets.

2.train_model.py Train a phyloRNN model.

3.compare_phyloRNN_w_phyML_1.py Compare site rates and tree lengths estimated on the test set based on a phyloRNN trained model and on phyML optimization. The output is a tab-separated table with accuracy metrics calculated for each model of rate heterogeneity.

4.compare_phyloRNN_w_phyML_2.py Compare site rates and tree lengths estimated on the test set based on a phyloRNN trained model and on phyML optimization using a fixed tree topology, constrained to the true one. The output is a tab-separated table with accuracy metrics calculated for each model of rate heterogeneity.

5.simulate_trainingset_mixed.py Simulate training datasets with a larger fraction of alignments based on a mixed model of rate heterogeneity, to assess the potential improvement in the prediction accuracy for this subset of simulations.

Clownfish_scripts

1.simulate_and_train.py Generate training set for the analysis of an alignment spanning chromosome 1 of 28 species of clownfish in batches of 1000 sites, and train a phyloRNN model.

2.clownfish_predictions.py Predict site rates across chromosome 1 of the clownfish clade (file Chr1.WGAlign.FromBam.Filtered.fasta).

3.plot_results.py Parse exon annotation and plot results.

RevBayes_experiments

Scripts to generate training and test datasets and analyze them using RevBayes. Scripts 1-5 generate datasets with 100 sites, train a phyloRNN model and analyze them in RevBayes applying per-site rates and gamma distributed rates to test their effect on the accuracy of the estimated trees.

Scripts 6-7 use previously a trained phyloRNN model with 20 taxa and 1000 sites, simulate a new test set, and analyze them in RevBayes applying per-site rates discretized into 4 or more rate categories.

1.simulate_train_data_revb.py Simulate training and test data with 20 taxa and 100 sites.

2.train_model_revb.py Train a new model based on training set (pre-trained model available in the trained_models directory).

3.generate_RevBayes_scripts.py Predict per-site rates for the test set based on the trained model and create RevBayes scripts to run phylogenetic estimation using phyloRNN rates.

4.parallelize_revb.py Run RevBayes analyses in parallel. Note the the path to the alignment files is included in the generated RevBayes scripts, so their names and location must not be changed.

5.1.parse_revb_res.py and 5.2.parse_revb_res.R Parse results of the RevBayes phylogenetic inference to assess the accuracy (weighted R-F distances) and compare models with gamma-distributed and phyloRNN rates.

6.sim_RevBayes_scripts_blocks.py Simulate test data sets with 50 taxa and 1000 sites, predict site rates with pre-trained phyloRNN model (available in the trained_models directory), discretize them into 4 or more categories, create RevBayes scripts using estimated rate categories.

7.parse_revb_res_blocks.R Parse results of the RevBayes phylogenetic inference to assess the accuracy (weighted R-F distances) and compare models with gamma-distributed and phyloRNN rate categories.

Trained_models

Trained phyloRNN models used in our study.

t20_s100: 20 tips, 100 sites (used for RevBayes tests)

t28_s1000: 28 tips, 1000 sites (used for clownfish dataset)

t50_s1000: 50 tips, 1000 sites (used for general benchmarking and for RevBayes tests with discrete phyloRNN rate classes)

t50_s1000mixed20: 50 tips, 1000 sites trained with higher fraction of mixed rate-model datasets (use to test accuracy on mixed models)

phyloRNN code

The version of phyloRNN used in the paper is available in the phyloRNN directory