Skip to main content
Dryad

Evaluating the genetic variation of the COI gene of Insecta: Implications for DNA barcoding, metabarcoding and species delimitation studies

Cite this dataset

Zhang, Haiguang; Bu, Wenjun (2022). Evaluating the genetic variation of the COI gene of Insecta: Implications for DNA barcoding, metabarcoding and species delimitation studies [Dataset]. Dryad. https://doi.org/10.5061/dryad.qnk98sff2

Abstract

The genetic variation of the COI gene has a great effect on the final results of the species delimitation studies. However, little research has comprehensively investigated the genetic divergence in COI among Insecta. The fast-growing COI data in BOLD provide an opportunity for comprehensively appraising the genetic variation in COI among Insecta. We calculated the K2P distance of 64,414 insect species downloaded from BOLD. The match ratios of the clustering analysis based on different thresholds were compared among 4,288 genera (35,068 species). Besides, we also compared the match ratios obtained from two species delimitation methods: the clustering analysis (distance-based method) and the bPTP analysis (tree-based method). Furthermore, the effectiveness of two different results of the bPTP analysis: bPTP_h and bPTP_ml was also tested. Approximately one-quarter of the species of Insecta showed high intraspecific genetic variation (> 3%), and a conservative estimate of this value is 12.05-22.58%. The application of empirical thresholds (e.g., 2% and 3%) in the clustering analysis may result in the overestimation of species diversity. In metabarcoding studies, a threshold of 3% can only be used to estimate the insect diversity roughly. As for the clustering analysis, the "threshOpt" or "localMinima" algorithms can provide a priori value for the researcher. Nevertheless, if the minimum interspecific genetic distance of congeneric species was greater than or equal to 2%, it is possible to avoid overestimating the species diversity based on the empirical thresholds. Besides, the match ratios of the bPTP_ml results were higher than those of the bPTP_h results. As for the bPTP analysis, the bPTP_ml results were recommended. If a proper threshold was selected, the clustering analysis may outperform the bPTP analysis.

Usage notes

Fasta_Files_Datasets1
The sequence files of 64,414 species after data filtering

Fasta_Files_Datasets2
The sequence files of 4,288  genera after data filtering

Funding

National Natural Science Foundation of China, Award: 31820103013

National Natural Science Foundation of China, Award: ZR2020QC053