Skip to main content

Longer is not always better: optimizing barcode length for large-scale species discovery and identification

Cite this dataset

Yeo, Darren; Meier, Rudolf; Srivathsan, Amrita (2020). Longer is not always better: optimizing barcode length for large-scale species discovery and identification [Dataset]. Dryad.


New techniques for the species-level sorting of millions of specimens are needed in order to accelerate species discovery, determine how many species live on earth, and develop efficient biomonitoring techniques. These sorting methods should be reliable, scalable and cost-effective, as well as being largely insensitive to low-quality genomic DNA, given that this is usually all that can be obtained from museum specimens. Mini-barcodes seem to satisfy these criteria, but it is unclear how well they perform for species-level sorting when compared to full-length barcodes. This is here tested based on 20 empirical datasets covering ca. 30,000 specimens (5,500 species) and six clade-specific datasets from GenBank covering ca. 98,000 specimens (>20,000 species). All specimens in these datasets had full-length barcodes and had been sorted to species-level based on morphology. Mini-barcodes of different lengths and positions were obtained in silico from full-length barcodes using a sliding window approach (3 windows: 100-bp, 200-bp, 300-bp) and by excising nine mini-barcodes with established primers (length: 94 – 407-bp). We then tested whether barcode length and/or position reduces species-level congruence between morphospecies and molecular Operational Taxonomic Units (mOTUs) that were obtained using three different species delimitation techniques (PTP, ABGD, objective clustering). Surprisingly, we find no significant differences in performance for both species- or specimen-level identification between full-length and mini-barcodes as long as they are of moderate length (>200-bp). Only very short mini-barcodes (<200-bp) perform poorly, especially when they are located near the 5’ end of the Folmer region. The mean congruence between morphospecies and mOTUs was ca. 75% for barcodes >200-bp and the congruent mOTUs contain ca. 75% of all specimens. Most conflict is caused by ca. 10% of the specimens that can be identified and should be targeted for re-examination in order to efficiently resolve conflict. Our study suggests that large-scale species discovery, identification, and metabarcoding can utilize mini-barcodes without any demonstrable loss of information compared to full-length barcodes.


gi/gb numbers and sample IDs of DNA sequences downloaded from GenBank and BOLDSystems.

Gel images generated with 313-bp and 658-bp PCR products of the same sample in each lane.

Supplementary figures and tables.


Ministry of Education, Award: R-154-000-A22-112