Skip to main content

Re-evaluating deep neural networks for phylogeny estimation: the issue of taxon sampling

Cite this dataset

Grosshauser, Martin; Zaharias, Paul; Warnow, Tandy (2020). Re-evaluating deep neural networks for phylogeny estimation: the issue of taxon sampling [Dataset]. Dryad.


Deep neural networks (DNNs) are powerful machine learning models that are widely used for classification problems, and have been recently proposed for quartet tree phylogeny estimation (Survorov et al. Systematic Biology 2020 and Zou et al. Molecular Biology and Evolution 2020). Here we present a study evaluating recently trained DNNs (from Zou et al., MBE 2020) in comparison to a collection of standard phylogeny estimation methods, including UPGMA, neighbor joining, maximum parsimony, and maximum likelihood, on a heterogeneous collection of 20-sequence datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation (to combine quartet trees into a tree on the full dataset) is only more accurate than UPGMA, and otherwise is less accurate than all standard phylogeny estimation methods we explore (maximum likelihood, neighbor joining, and maximum parsimony). We further find that while DNNs can provide good quartet tree accuracy, some standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees which must subsequently be combined into a tree on the full dataset: in contrast, global methods -- i.e., those that estimate trees from the full set of sequences -- are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.

Usage notes


README file for data and results




National Science Foundation, Award: ABI-1458652