Data from: Accurate inference of tree topologies from multiple sequence alignments using deep learning
Data files
Sep 06, 2019 version files 7.17 MB
-
S1_Suppl_BL_space.pdf
82.95 KB
-
S1_Suppl_tab_500regionaccuracy.docx
18.49 KB
-
S2_Suppl_3DSurface.pdf
1.28 MB
-
S2_Suppl_tab_consistency.docx
15.03 KB
-
S3_Suppl_Bias_gap.jpeg
2.70 MB
-
S4_Suppl_Bias_nogap.jpeg
2.79 MB
-
S5_Suppl_Precision_recall.pdf
152.22 KB
-
Suppl_fig_tab_legends_v1.docx
20.82 KB
-
Supplementary_Text_v1.docx
123.07 KB
Sep 06, 2019 version files 14.35 MB
-
S1_Suppl_BL_space.pdf
82.95 KB
-
S1_Suppl_tab_500regionaccuracy.docx
18.49 KB
-
S2_Suppl_3DSurface.pdf
1.28 MB
-
S2_Suppl_tab_consistency.docx
15.03 KB
-
S3_Suppl_Bias_gap.jpeg
2.70 MB
-
S4_Suppl_Bias_nogap.jpeg
2.79 MB
-
S5_Suppl_Precision_recall.pdf
152.22 KB
-
Suppl_fig_tab_legends_v1.docx
20.82 KB
-
Supplementary_Text_v1.docx
123.07 KB
Abstract
Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.