Skip to main content
Dryad logo

Biodiversity soup II


Yu, Douglas et al. (2021), Biodiversity soup II, Dryad, Dataset,


1. Despite widespread recognition of its great promise to aid decision-making in environmental management, the applied use of metabarcoding requires improvements to reduce the multiple errors that arise during PCR amplification, sequencing, and library generation. We present a co-designed wet-lab and bioinformatic workflow for metabarcoding bulk samples that removes both false-positive (tag jumps, chimeras, erroneous sequences) and false-negative (‘dropout’) errors. However, we find that it is not possible to recover relative-abundance information from amplicon data, due to persistent species-specific biases. 

2. To present and validate our workflow, we created eight mock arthropod soups, all containing the same 248 arthropod morphospecies but differing in absolute and relative DNA concentrations, and we ran them under five different PCR conditions. Our pipeline includes qPCR-optimized PCR annealing temperature and cycle number, twin-tagging, multiple independent PCR replicates per sample, and negative and positive controls. In the bioinformatic portion, we introduce Begum, which is a new version of DAMe (Zepeda-Mendoza et al. 2016. BMC Res. Notes 9:255) that ignores heterogeneity spacers, allows primer mismatches when demultiplexing samples, and is more efficient. Like DAMeBegum removes tag-jumped reads and removes sequence errors by keeping only sequences that appear in more than one PCR above a minimum copy number per PCR. The filtering thresholds are user-configurable. 

3.  We report that OTU dropout frequency and taxonomic amplification bias are both reduced by using a PCR annealing temperature and cycle number on the low ends of the ranges currently used for the Leray-FolDegenRev primers. We also report that tag jumps and erroneous sequences can be nearly eliminated with Begum filtering, at the cost of only a small rise in dropouts. We replicate published findings that uneven size distribution of input biomasses leads to greater dropout frequency and that OTU size is a poor predictor of species input biomass. Finally, we find no evidence for ‘tag-biased’ PCR amplification.

4. To aid learning, reproducibility, and the design and testing of alternative metabarcoding pipelines, we provide our Illumina and input-species sequence datasets, scripts, a spreadsheet for designing primer tags, and a tutorial.


In this study, we tested the Begum pipeline with eight mock soups that differed in their absolute and relative DNA concentrations of 248 arthropod species. We metabarcoded the soups under five different PCR conditions that varied annealing temperatures (Ta) and PCR cycles, and we filtered the OTUs under different stringencies. 

Usage Notes

We have archived a tutorial with a reduced sequence dataset and simplified scripts (PCR B only, 253 MB), and we have archived the full dataset (~9.75 GB) with reference files, folder structure, output files, and scripts. To run the scripts, remove the included output files listed in the README file. 

The scripts are kept updated at


Chinese Academy of Sciences, Award: XDA20050202: Strategic Priority Research Program

National Natural Science Foundation of China, Award: 41661144002

National Natural Science Foundation of China, Award: 31670536

National Natural Science Foundation of China, Award: 31500305

National Natural Science Foundation of China, Award: 31400470

Key Research Program of Frontier Science, Chinese Academy of Sciences, Award: QYZDY-SSW-SMC024

Bureau of International Cooperation, Chinese Academy of Sciences, Award: GJHZ1754

Ministry of Science and Technology of the People's Republic of China, Award: 2012FY110800

State Key Laboratory of Genetic Resources and Evolution, Award: GREKF18-04

Leverhulme Trust, Award: RF-2017-342

Danish Council for Independent Research, Award: DFF-5051-00140

Danish Council for Independent Research, Award: DFF-5051-00140