Data from: Rapid and accurate taxonomic classification of insect (Class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
Data files
Feb 18, 2014 version files 22.57 MB
-
CanadianBenthosFastas.zip
257.47 KB
-
CustomizingInsectaTaxonomy.zip
97.06 KB
-
malaise.fasta
355.93 KB
-
README_for_CanadianBenthosFastas.docx
40.08 KB
-
README_for_CustomizingInsectaTaxonomy.docx
58.77 KB
-
README_for_TrainingTheClassifier.docx
120.27 KB
-
TrainingTheClassifier.zip
21.64 MB
Abstract
Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on difficult to define thresholds of distances, sequence similarity cutoffs, or monophyly. Most methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study is to use a naïve Bayesian classifier (Wang et al., 2007) to automate unsupervised taxonomic assignments for large batches of insect COI sequences such as data obtained from environmental barcoding using next generation sequencing platforms. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value and it is faster than the BLAST-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field datasets, and targeted testing of Lepidoptera, Diptera, and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cutoffs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.