Data from: A protocol for species delineation of public DNA databases, applied to the Insecta
Data files
May 14, 2014 version files 185.36 MB
-
delineation_matrix
-
insecta.fas.tar.gz
-
insecta.taxkey.tar.gz
-
supplement_fig1.tiff
Abstract
Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which i) the genetic loci L are partitioned, ii) the species S are delineated within each locus, then iii) species units are matched across loci to form the matrix LxS, a set of global (multi-locus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names was performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the COI locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered ~194,000, containing 41,525 species ID's (98.7 percent of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multi-partite matching algorithm to form multi-locus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species level applications of modern day sequence datasets. In particular, the LxS matrix represents a post-taxonomic framework that can be used for species level organization of meta-genomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.