Skip to main content

Extracting abundance information from DNA-based data

Cite this dataset

Luo, Mingjie; Ji, Yinqiu; Warton, David; Yu, Douglas (2022). Extracting abundance information from DNA-based data [Dataset]. Dryad.


The accurate extraction of species-abundance information from DNA-based data (metabarcoding, metagenomics) could contribute usefully to the reconstruction of diets and quantitative foodwebs, the inference of species interactions, the modelling of population dynamics and species distributions, the biomonitoring of environmental state and change, and the inference of false positives and negatives. However, capture bias, capture noise, species pipeline biases, and pipeline noise all combine to inject error into DNA-based datasets. This review focuses on methods for correcting the latter two error sources, as the first two are addressed extensively in the ecological survey literature. To extract abundance information from DNA-based data, it is useful to distinguish two concepts. (1) Across-species quantification describes relative species abundances within a single sample. (2) In contrast, within-species quantification describes how the abundance of each individual species varies across samples, where the samples could be a time series, an environmental gradient, or different experimental treatments. In the first part of this paper, we review methods to remove species pipeline biases and pipeline noise. In the second part, we provide a detailed protocol and demonstrate experimentally how to use a ‘DNA spike-in’ (an internal standard) to remove pipeline noise and recover within-species abundance information.


2.1 Mock soup construction.

286 arthropods were collected in Kunming, China (25°8’23” N, 102°44’17” E) (Luo et al., 2022). DNA was extracted from each individual using the DNeasy Blood & Tissue Kit (Qiagen GmbH, Germany). Genomic DNA concentration of each individual was quantified from three replicates using PicoGreen fluorescent dye. 658-bp COI barcoding sequences were PCR’d with Folmer primers (LCO1490 and HCO2198) (Folmer et al., 1994) and Sanger-sequenced. After the 658-bp COI sequences were trimmed to 313 bp based on our metabarcoding primers (see 2.4 Primer design), 286 arthropods were clustered to 168 OTUs at 97% similarity. We selected 52 individuals with genomic DNA > 20 ng/µl, representing 52 OTUs.

We created a mock-soup gradient of seven dilution levels. First, we created the highest concentration-level soup by pooling 61 ng of each of the 52 OTUs. The next soup was created by pooling 48.8 ng (= 0.8 x 61) of each of the 52 OTUs, and so on to create a gradient of seven mock soups of differing absolute abundances, stepping down 0.8X each time. To make it possible to check for mundane experimental error (as opposed to failure of the spike-in to recover the gradient), we independently created this mock-soup gradient three times, for ntot = 21 independent poolings.

2.2 Preparation of Malaise-trap samples.

244 Malaise-trap samples from 96 sites, using 99.9 % ethanol as the trapping liquid, were collected in and near a 384 km2 forested landscape containing the H.J. Andrews Experimental Forest (44.2° N, 122.2° W), Oregon, United States in July 2018 (Luo et al., 2022). Traps were left for 7 non-rainy days. To equalize biomass across individuals, we only kept the heads of large individuals (body lengths >2 cm) and then transferred the samples to fresh 99.9% ethanol to store at room temperature until extraction. The samples were air dried individually on filter papers for less than an hour and then transferred to 50 ml tubes or 5 ml tubes according to sample volume. The samples were then weighed. DNA was non-destructively extracted by soaking the samples in lysis buffer, using the protocol from Ji et al. (2020) and Nielsen et al. (2019). For this study, we selected seven samples spread over the study area, each of which is an independent test of our ability to recover the dilution gradient. After completion of lysis, we serially diluted the 7 samples by using 0.7X lysis buffer volume (500 µl, 350 µl, 245 µl, 171.5 µl, 120 µl and 84 µl) to create six soups per sample (ntot = 42). We used QIAquick PCR purification kit (Qiagen GmbH, Germany) following the manufacturer instructions to purify lysis buffer on one spin column per soup. We used a shallower gradient (0.7X) because our starting DNA amount was lower than with the mock soups.

2.3 Adding spike-in DNA

2.3.1 Spike-in DNA. For our spike-ins, we used three insect species from China (Lepidoptera:Bombycidae, Coleoptera:Elateridae, Coleoptera:Mordellidae), none of which is expected to appear in the Oregon Malaise-trap samples. An alternative is to use one or more synthetic, random DNA sequences(Tkacz et al., 2018. Each of our three spike-ins is represented by a 658-bp COI fragment (Table S1) with primer binding sites that match the Folmer primers HCO2198 and LCO1490. For long-term storage, we inserted the COI fragments as plasmids into monoclonal bacteria. Plasmids were extracted using TIANprep Mini Plasmid Kit (Beijing, China) following manufacturer’s instructions.

2.3.2 Adding spike-in to the mock soups.Adding too much spike-in wastes sequencing data, while adding too little risks loss of abundance information in at least some samples when the number of spike-in reads is too low to use as a reliable correction factor. Thus, we quantified the COI copy numbers of the mock soups and the spike-in DNA by qPCR (Table S2, Figure S1) and chose a volume so that spike-in reads should make up 1% of the total number of COI copies in the lowest-concentration mock soups, balancing efficiency with reliability. We used all three spike-in species here and mixed them (Bombycidae:Elateridae:Mordellidae) in a ratio of 1:2:4, which was added directly to the mock soups’ DNA since they were already purified.

2.3.3 Adding spike-in to the Malaise-trap samples. – From the 244 Malaise-trap samples, we first extracted 17 Malaise-trap samples without adding spike-ins, and then we used qPCR to quantify the mean COI concentrations of these 17 samples in order to decide how much spike-in to add. Before adding the spike-ins, we discovered that the Bombycid DNA spike-in had degraded, and so we used only two spike-in species for the Malaise trap samples, at a ratio of 1:9 (Mordellidae:Elateridae). We then chose 7 other samples for this study. In these samples, lysis buffer (500 µl, 350 µl, 245 µl, 171.5 µl, 120 µl, 84 µl) from each sample was transferred into clean 1.5 ml tubes, and the spike-in DNA was added. We then purified the DNA with the Qiagen QIAquick PCR purification kit, following the manufacturer instructions. DNA was eluted with 200 µl of elution buffer. In this way, the spike-in DNA was co-purified, co-amplified, and co-sequenced along with the sample DNA (Figure 3 B). We also recorded the total lysis buffer volume of each sample, for downstream correction.

2.4 Primer design. For this study, we simultaneously tested two methods for extracting abundance information:  spike-ins and UMIs (Unique Molecular Identifiers). UMI tagging requires a two-step PCR procedure (Hoshino & Inagaki, 2017; Lundberg et al., 2013), first using tagging primers and then using amplification primers (Figure S2). The tagging primers include (1) the Leray-FolDegenRev primer pair to amplify the 313-bp COI amplicon of interest, (2) a 1- or 2-nucleotide heterogeneity spacer on both the forward and reverse primers to increase sequence entropy for the Illumina sequencer, (3) the same 6-nucleotide sequence on both the forward and reverse primers to ‘twin-tag’ the samples for downstream demultiplexing, (4) a 5N random sequence on the forward primer and a 4N random sequence on the reverse primer (9N total) as the UMI tags, (5) and parts of the Illumina universal adapter sequences to anneal to the 3’ ends of the forward and reverse primers for the second PCR. By splitting the 9N UMI into 5N + 4N over the forward and reverse primers, we avoid primer dimers. The amplification primers include (1) an index sequence on the forward primer pair for Illumina library demultiplexing, and (2) the full length of the Illumina adapter sequences. For further explanation of the design of the tagging primers (except for the UMI sequences), see Yang et al. (2021).

2.5 PCR and the Begum pipeline. – The first PCR amplifies COI and concatenates sample tags and UMIs and runs for only two cycles using KAPA 2G Robust HS PCR Kit (Basel, Roche KAPA Biosystems). We used the mlCOIintF-FolDegenRev primer pair (Leray et al., 2013; Yu et al., 2012, p. 2012), which amplifies a 313-bp fragment of the COI barcode; and we followed the Begum protocol (Yang et al., 2021; Zepeda-Mendoza et al., 2016), which is a wet-lab and bioinformatic pipeline that combines multiple independent PCR replicates per sample, twin-tagging and false positive controls to remove tag-jumping and reduce erroneous sequences. Twin-tagging means using the same tag sequence on both the forward and reverse primers in a PCR, and we use this design because during library index PCR for Illumina sequencing, occasional incomplete extensions can create new primers that already contain the tag of one amplicon, resulting in chimeric sequences with tags from two different amplicons (Schnell et al., 2015). Tag jumps thus almost always result in non-matching tag sequences, and these are identified and removed in the Begum pipeline. We performed 3 PCR replicates per sample, which means we used 3 different twin-tags to distinguish the 3 independent PCR replicates. Begum removes erroneous sequences by filtering out the reads that appear in a low number of PCR replicates (e.g. only one PCR) at a low number of copies per PCR (e.g. only 2 copies), because true sequences are more likely to appear in multiple PCRs with higher copy numbers per PCR. The 20 µl reaction mix included 4 µl Enhancer, 4 µl Buffer A, 0.4 µl dNTP (10 mM), 0.8 µl per primer (10 mM), 0.08 µl KAPA 2G HotStart DNA polymerase (Basel, Roche KAPA Biosystems), 5 µl template DNA and 5 µl water. PCR conditions were initial denaturation at 95°C for 3 minutes, followed by two cycles of denaturation at 95°C for 1 minute, annealing at 50°C for 90 seconds, and extension at 72°C for 2 minutes. Then the products were purified with 14 µl of KAPA pure beads (Roche KAPA Biosystems, Switzerland) to remove the primers and PCR reagents and were eluted into 16 µl of water.

The second PCR amplifies the tagged templates for building the libraries that can be sequenced directly on Illumina platform. The 50 µl reaction mix included 5 µl TAKARA buffer, 4 µl dNTP (10 mM), 1.2 µl per primer (10 mM), 0.25 µl TAKARA Taq DNA polymerase, 15 µl DNA product from the first PCR, and 23.35 µl water. PCR conditions were initial denaturation at 95°C for 3 minutes, 5 cycles of denaturation at 95°C for 30 seconds, annealing at 59°C for 30 seconds (-1 °C per cycle), extension at 72°C for 30 seconds, followed by 25 cycles of denaturation at 95°C for 30 seconds, annealing at 55°C for 30 seconds, extension at 72°C for 30 seconds; a final extension at 72°C for 5 minutes, and cool down to 4°C.

From all second PCR products, 2 µl was roughly quantified on 2% agarose gel with Image Lab 2.0 (Bio-Rad, USA). For each set of PCR reactions with the same index, amplicons were mixed at equimolar ratios to make a pooled library. One PCR negative control were set for each library. We sent our samples to Novogene (Tianjin, China) to do PE250 sequencing on Illumina NovaSeq 6000, requiring a 0.8 GB raw data from each PCR reaction.

2.6 Bioinformatic processing. – AdapterRemoval 2.1.7 was used to remove any remaining adapters from the raw data (Schubert et al., 2016). Sickle 1.33 was used to trim away low-quality bases at the 3’ends. BFC V181 was used to denoise the reads (Li, 2015). Read merging was performed using Pandaseq 2.11 (Masella et al., 2012). Begum was used to demultiplex the reads by sample tag and to filter out erroneous reads (, accessed 07 Sep 2021). We allowed 2-bp primer mismatches to the twin-tags while demultiplexing, and we filtered at a stringency of accepting only reads that appeared in at least two PCRs at a minimum copy number of 4 reads per PCR, with minimum length of 300 bp. This stringency minimized the false positive reads in the negative PCR control.

For mock-soup data, we need to compare the UMI and read numbers in each PCR set. However, Begum cannot recognize UMIs. Also because of our complicated primer structure, there is no software available for our data to count the UMIs per OTU in each PCR set. Thus, we wrote a custom bash script to process the mock-soup data from the Pandaseq output files, which include all the UMIs, tags, and primers. First, we used Begum-filtered sequences as a reference to filter reads for each PCR set and put the UMI information on read headers. Then we carried out reference-based OTU clustering for each PCR set with QIIME 1.9.1 ( -m uclust_ref -s 0.99) (Caporaso et al., 2010; Edgar, 2010), using the OTU representative sequences from barcoding Sanger sequencing as the reference, counted UMIs and reads for each OTU in each PCR set, and generated two OTU tables, separately with UMI and read numbers.

For the Malaise-trap data, we directly used the Begum pipeline. After Begum filtering, vsearch 2.14.1 (--uchime_denovo) (Rognes et al., 2016) was used to remove chimeras. Sumaclust 1.0.2 was used to cluster the sequences of Malaise-trap samples into 97% similarity OTUs. The python script from the DAMe toolkit was used to generate the OTU table. Finally, we applied the R package LULU 0.1.0 with default parameters to merge oversplit OTUs (Frøslev et al., 2017). The OTU table and OTU representative sequences were used for downstream analysis.

2.6 Statistical analyses. – All statistical analyses were carried out in R 4.1.0 (R Core Team, 2021), and we used the {lme4} 1.1-27 package (Bates et al., 2015) to fit linear mixed-effects models, using OTU, soup replicate, and PCR replicates as random factors, to isolate the variance explained by the sole (fixed-effect) predictor of interest:  OTU size. Model syntax is given in the legend of Figure 4. We used the {MuMIn} 1.43.17 package (, accessed 2 Jan 2022) to calculate the variance explained by fixed effects only (marginal R2). To carry out spike-in correction, we first calculated a weighted mean from the added spike-ins (e.g. mean(Bombycidae + Elateridae/2 + Mordellidae/4)), rescaled the new mean spike-in so that the smallest value is equal to 1, and divided each row’s OTU size and UMI number by the weighted, scaled spike-in.

Usage notes

Submitted data files are fastq files. 


State Key Laboratory of Genetic Resources and Evolution, Award: GREKF19-01

State Key Laboratory of Genetic Resources and Evolution, Award: GREKF20-01

State Key Laboratory of Genetic Resources and Evolution, Award: GREKF21-01