Data from: Some limitations of public sequence data for phylogenetic inference (in plants)

Hinchliff, Cody E.1; Smith, Stephen Andrew1

Published May 16, 2015 on Dryad. https://doi.org/10.5061/dryad.450qq

Data files

May 16, 2015 version files 185.85 MB

code.zip

22.87 KB
File_S1.pdf

152.91 MB
generic_tree_dated.config

1.83 KB
generic.ultrametric.tre

407.14 KB
RAxML_bestTree.generic_tree.rooted

717.77 KB
Streptophytina_by_genus_alignment.phy.reduced.zip

31.78 MB
Streptophytina_by_genus_partitions.part.reduced

3.42 KB

Abstract

The GenBank database contains essentially all of the nucleotide sequence data generated for published molecular systematic studies, but for the majority of taxa these data remain sparse. GenBank has value for phylogenetic methods that leverage data–mining and rapidly improving computational methods, but the limits imposed by the sparse structure of the data are not well understood. Here we present a tree representing 13,093 land plant genera—an estimated 80% of extant plant diversity—to illustrate the potential of public sequence data for broad phylogenetic inference in plants, and we explore the limits to inference imposed by the structure of these data using theoretical foundations from phylogenetic data decisiveness. We find that despite very high levels of missing data (over 96%), the present data retain the potential to inform over 86.3% of all possible phylogenetic relationships. Most of these relationships, however, are informed by small amounts of data—approximately half are informed by fewer than four loci, and more than 99% are informed by fewer than fifteen. We also apply an information theoretic measure of branch support to assess the strength of phylogenetic signal in the data, revealing many poorly supported branches concentrated near the tips of the tree, where data are sparse and the limiting effects of this sparseness are stronger. We argue that limits to phylogenetic inference and signal imposed by low data coverage may pose significant challenges for comprehensive phylogenetic inference at the species level. Computational requirements provide additional limits for large reconstructions, but these may be overcome by methodological advances, whereas insufficient data coverage can only be remedied by additional sampling effort. We conclude that public databases have exceptional value for modern systematics and evolutionary biology, and that a continued emphasis on expanding taxonomic and genomic coverage will play a critical role in developing these resources to their full potential.

Data from: Some limitations of public sequence data for phylogenetic inference (in plants)

Data files

Abstract

Streptophytina_by_genus_alignment.phy

Streptophytina_by_genus_partitions.part

RAxML_bestTree.generic_tree

generic.ultrametric

scripts

generic_tree_dated

Supplemental tree figures

Data from: Some limitations of public sequence data for phylogenetic inference (in plants)

Data files

Abstract

Usage notes

Streptophytina_by_genus_alignment.phy

Streptophytina_by_genus_partitions.part

RAxML_bestTree.generic_tree

generic.ultrametric

scripts

generic_tree_dated

Supplemental tree figures

Works referencing this dataset