Machine learning can be as good as maximum likelihood when reconstructing phylogenetic topologies and determining the best evolutionary model on four taxon alignments.

Phylogenetic tree reconstruction with molecular data is important in many fields of life science research. The gold standard in this discipline is the Maximum Likelihood tree reconstruction method. Here we show that for quartet trees, Machine Learning using neural networks can be as good as the Maximum Likelihood method to infer the best tree topology and the best model of sequence evolution for nucleotide as well as amino acid sequences. For this purpose we simulated data sets for a wide range of branch lengths, evolutionary models and model parameters and compared the topologies and inferred models obtained with Machine learning with those obtained with the Maximum Likelihood and the Neighbour Joining method. Our results show that neural networks are a promising avenue for determining relatedness between taxa, which is likely to accelerate the construction of phylogenetic trees in the future, while maintaining a high accuracy.

This archive is part of the DeepNNPhylogeny project: DeepNNPhylogeny, for which the code of the software is available on GitHub. It contains pre-trained neural networks to predict (a) the best models of sequence evolution and (b) the best quartet tree topologies for alignments of four nucleotide or amino acid sequences. For each use case, six neural networks with different architectures have been trained and saved for further usage with the Python library TensorFlow. Neural networks have been saved with the tf.keras.Model.save function in the so-called Tensorflow SavedModel format. All neural networks have been trained with a large number of alignments simulated with the software PolyMoSim v1.1.4, which is available on GitHub. For each simulated data set, model parameters (including proportion of invariant sites, shape parameter of gamma distribution for site heterogeneity, transition/transversion ratio - if applicable, nucleotide base frequencies - if applicable, relative substitution rates - if applicable) and branch lengths have been chosen by a random number generator in specified intervals. While nucleotide alignments have been simulated with the JC (Jukes-Cantor 1969), F81 (Felsenstein 1981), F84 (Felsenstein 1984), K2P (Kimura two parameter), HKY (Hasegawa-Kishino-Yano, 1985) or the GTR (general time reversible) model, amino acid alignments have been simulated with the Dayhoff, JTT, LG or the WAG model. These models are available for model prediction and for topology prediction. For more details on the simulation and training procedure, see the publication (will be available soon).

In this project, neural networks have been trained to:

- predict/classify the correct topology for four nucleotide or amino acid sequences that evolved on a quartet tree.

- predict the best model of sequence evolution for four nucleotide or amino acid sequences that evolved on a quartet tree.

Together with the software in the DeepNNPhylogeny project, the pre-trained neural networks can be used to predict the best model of sequence evolution for the model and topology classification tasks.

The GitHub repository DeepNNPhylogeny contains the software with which:

a) the neural networks presented here have been trained and with which new neural networks can be trained,

b) predictions can be made using the pre-trained neural networks available in this archive. They can predict with an accuracy close or identical to the Maximum likelihood method the best evolutionary model and best topology for alignments of four nucleotide or amino acid sequences.

The neural networks stored in this repository can be used as follows:

(i) Download the neural network(s) for the desire classification tasks from the DryAd page.

(ii) Unzip the dowloaded file, rename it to "PhylNNsaved" and place it in the home directory. For more detailed instructions, see README.md

(iii) Download and install the software from the DeepNNPhylogeny repository.

(iv) Start predicting models and topologies as described in the DeepNNPhylogeny repository.

The training script, which are only needed to train new neural networks, have the following dependencies:
- Python package, ideally Python 3.7 or more recent.
- Installed packages for: tensorflow, scikit-learn (optional, not used in the latest version )
- The software PolyMoSim.

The prediction scripts, which are need to conduct predictions, have the following dependencies:
- Python package, ideally Python 3.7 or more recent.
- Installed packages for: tensorflow, scikit-learn (optional, not used in the latest version)
- The software quartet-pattern-counter-v1.1

The required machine learning modules can be installed with pip or with the Anaconda package manager.

Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments

Data files

Abstract

Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments

Data files

Abstract

Methods

Usage notes

Works referencing this dataset