Sources of uncertainty in DNA metabarcoding of whole communities: implications for its use in biomonitoring
Data files
Jun 12, 2025 version files 11.91 MB
-
Bioinformatics_output_raw_READ_ME.txt
6.63 KB
-
Bioinformatics_output_raw.csv
11.70 MB
-
Morphological_analysis_raw_READ_ME.txt
2.26 KB
-
Morphological_analysis_raw.csv
8.67 KB
-
Morphological_Bioinformatics_Harmonised_Assemblage_Data_READ_ME.txt
3.30 KB
-
Morphological_Bioinformatics_Harmonised_Assemblage_Data.csv
158.28 KB
-
README.md
11.17 KB
-
Sample_location_details.csv
1.31 KB
-
SAS_code.txt
12.15 KB
Abstract
These data were derived from a structured experiment to assess factors influencing the precision and accuracy of molecular methods for freshwater macroinvertebrate community assessment. Benthic macroinvertebrates were sorted, identified, and counted from nine individual kick-net samples using standard morphometric protocols, then reconstituted. These bulk specimen samples were then homogenised. From each bulk sample, aliquots of the homogenate were distributed among seven laboratories across Europe for DNA extraction, PCR amplification, library preparation, and sequencing. Additionally, each laboratory was provided with a DNA extract from the bulk homogenate to allow for amplification through to sequencing, in order to assess the influence of DNA extraction. The data consist of abundances of individual taxa identified using morphometric protocols and read counts of operational taxonomic units identified by the molecular methods. In addition, we provide a taxonomically harmonised matrix of the community composition derived from morphometric and molecular approaches to facilitate comparison between the two methods.
Dataset DOI: 10.5061/dryad.dncjsxm9t
Description of the data and file structure
Advancements have been made in the use of DNA-based methods for detection of single species. However, the routine application of DNA-based methods to monitor whole communities using a metabarcoding approach and derive ecosystem status continues to be limited. We undertook a structured experiment to assess factors influencing the precision and accuracy of molecular methods for freshwater macroinvertebrate community assessment. To quantify the contributions of sequencing, laboratory, DNA extraction and sample composition to uncertainty and biases in macroinvertebrate taxonomic detection and identification through metabarcoding, we compared data generated by a well-established biomonitoring protocol used in the UK from a set of nine kick-net samples with data simultaneously produced by seven molecular laboratories. The experimental design allowed us to: (1) compare the accuracy and precision of taxa identification within and between molecular laboratories and benchmark results to morphological outputs; (2) investigate factors influencing variability in the detection of taxa derived through DNA metabarcoding; and (3) utilise information gathered from our analyses to provide advice on best practise to support development of reproducible protocols for metabarcoding and identify future research requirements.
Files and variables
File: Morphological_analysis_raw.csv
Description: These are the abundances of the different taxa morphologically identified in each of the nine samples processed by qualified taxonomists at Queen Mary University of London. Here we provide an explanation for each of the column headers in the Morphological_analysis_raw.csv file.
Variables
- Major Group: Higher taxonomic level of each morphologically identified taxon
- Taxon Name: Taxon name of each morphologically identified taxon
- A - I: Count of each morphologically identified taxon in stream kick samples A - I
File: Morphological_Bioinformatics_Harmonised_Assemblage_Data.csv
Description: Here we provide an explanation for each of the column headers in the Morphological_Bioinformatics_Harmonised_Assemblage_Data.csv file. These are the morphological abundance data and DNA metabarcoding read count data following harmonisation. DNA metabarcoding data from each laboratory have also been corrected for the associated negative control results as detailed in the related work.
Variables
- Major Group: Higher taxonomic level of each morphologically identified taxon
- Taxon Name: Taxon name of each morphologically identified taxon
- A - I: Count of each morphologically identified taxon in samples A - I
- L2_a_Lab2_rep1 - L8_i_NHM_rep3: Read count for each PCR replicate designated by a sample code made up of four elements separated by underscores. The four elements indicate (i) which laboratory processed the PCR replicate, (ii) which sample was processed, (iii) whether the DNA was extracted by the laboratory itself or by the Natural History Musueum (NHM), (iv) which replicate was processed.
| Molecular analysis laboratory | Stream kick sample | DNA Extraction | PCR Replicate |
|---|---|---|---|
| L2 | a | NHM | rep1 |
| L3 | b | Lab2 | rep2 |
| L4 | c | Lab3 | rep3 |
| L5 | d | Lab4 | |
| L6 | e | Lab5 | |
| L7 | f | Lab6 | |
| L8 | g | Lab7 | |
| h | Lab8 | ||
| i |
File: Bioinformatics_output_raw.csv
Description: Here we provide an explanation for each of the column headers in the Bioinformatics_output_raw.csv file. These are the data outputted by the bioinformatics pipeline as detailed in the related work. Raw read data were compiled with participating laboratories anonymised, and then processed with the APSCALE pipeline (v 1.6.3, https://github.com/DominikBuchner/apscale) using default settings. Taxonomic assignment was performed using BOLDigger (v 2.1.1, https://github.com/DominikBuchner/BOLDigger). The best hit was determined with the BOLDigger method and then further corrected using the API correction function (see related work for more details).
Variables
- OTU_STATUS: The OTU returned by the bioinformatics was assigned to one of five categories for exclusion from subsequent data analyses or it was retained for data analyses. The five exclusion categories were: un-matched, non-target taxa, improbable taxa, poorly resolved taxa, and OTUs with <80% similarity library match
- ID: Unique identifier for each operational taxonomic unit (OTU) detected
- Phylum: Assigned phlyum for OTU
- Class: Assigned class for OTU. The cell is left blank if the OTU could not be assigned to a Class.
- Order: Assigned order for OTU. The cell is left blank if the OTU could not be assigned to an Order.
- Family: Assigned family for OTU. The cell is left blank if the OTU could not be assigned to a Family.
- Genus: Assigned genus for OTU. The cell is left blank if the OTU could not be assigned to a Genus.
- Species: Assigned species for OTU. The cell is left blank if the OTU could not be assigned to a Species.
- Similarity: Percent similarity of OTU base pair sequence to that of assigned library match
- Status: Source of assigned library match
- Flags: Flags are added when assigning the library match via BOLDigger and can be looked up on the Github readme: https://github.com/DominikBuchner/BOLDigger. A closer look at all flagged hits is advised since they represent a certain degree of uncertainty for the selected hit. If no flags were added when assigning the library match then the cell is left empty.
- L2_a_Lab2_rep1 - L8_i_NHM_rep3: Read count for each PCR replicate designated by a sample code made up of four elements separated by underscores. The four elements indicate (i) which laboratory processed the PCR replicate, (ii) which sample was processed, (iii) whether the DNA was extracted by the laboratory itself or by the Natural History Musueum (NHM), (iv) which replicate was processed.
-
Molecular analysis laboratory Stream kick sample DNA Extraction PCR Replicate L2 a NHM rep1 L3 b Lab2 rep2 L4 c Lab3 rep3 L5 d Lab4 L6 e Lab5 L7 f Lab6 L8 g Lab7 h Lab8 i
File: Sample_location_details.csv
Description: Location and environmental description of the river sampling locations from which the nine samples have been taken.
Variables
- Location and environmental description of the river sampling locations from which the nine samples have been taken.: Location and environmental description of the river sampling locations from which the nine samples have been taken
- River Name: Name of sampled river
- Latitude: Latitude
- Longitude: Longitude
- Distance from source (km): distance from sampled site along river channel to river source
- Elevation (m above sea level): elevation of sampled location
- Channel slope (m km-1): slope of river channel at sampled location
- Channel width (m): mean width of water surface at sampled location at time of sampling
- Channel depth (m): mean depth of water at sampled location at time of sampling
- Conductivity (microScm-1): measured conductivity of stream water at time of sampling
- Stream bed substrate composition1 (%): visually assesed percent cover of boulders and cobbles in stream bed at sampled location at time of sampling
- Stream bed substrate composition2 (%): visually assesed percent cover of pebbles and gravel in stream bed at sampled location at time of sampling
- Stream bed substrate composition3 (%): visually assesed percent cover of sand in stream bed at sampled location at time of sampling
- Stream bed substrate composition4 (%): visually assesed percent cover of silt and clay in stream bed at sampled location at time of sampling
File: Bioinformatics_output_raw_READ_ME.txt
Description: Associated information to aid understanding of Bioinformatics_output_raw.csv file
File: Morphological_analysis_raw_READ_ME.txt
Description: Associated information to aid understanding of Morphological_analysis_raw.csv file
File: Morphological_Bioinformatics_Harmonised_Assemblage_Data_READ_ME.txt
Description: Associated information to aid understanding of Morphological_Bioinformatics_Harmonised_Assemblage_Data.csv file
File: SAS_code.txt
Description: Hierarchical analysis of variance (ANOVA) was used to estimate the variance in estimates of the various calculated metrics. Read count data were log10 transformed before analysis to avoid heteroscedasticity. The hierarchical analysis of variance was repeated for each metric derived from the taxa detected, and the variance and significance (i.e., a consistent directional difference), and relative contributions to the total and within-sample variance calculated to test the following hypotheses. All analyses were undertaken using General Linear Models in SAS/STAT® (SAS Institute Inc., Cary, NC, USA). Here we provide the SAS code used to run the hierarchical analysis of variance.
Code/software
All files are either in comma-separated values (.csv) or text (.txt) format and as such should be readable by any spreadsheet or text based software.
All analyses in the related work were undertaken in SAS/STAT® (SAS Institute Inc., Cary, NC, USA). The SAS code used to run the hierarchical analysis of variance is provided in SAS_code.txt
Sample collection and processing
Field sampling
Nine samples of the benthic macroinvertebrate community were collected from seven river sites in southern England, UK (see Jones et al., 2025). Each had been the subject of at least three years study prior to the current sample collection, providing a comprehensive record of taxa likely to be present. Samples were collected using the UK standard macroinvertebrate biomonitoring protocol, comprising a 3-minute kick sample and 1-minute search with a 1 mm mesh size net sampling all habitats in proportion to their areal occurrence (BS EN 16150:2012). All material retained in the net was preserved on-site in 96% ethanol and returned to the laboratory.
Morphological identification
Macroinvertebrates were manually sorted from other material, identified to the lowest practicable taxonomic level (most to species or genus level, see Jones et al., 2025) and counted. Samples were processed by qualified, experienced freshwater biologists. All material and any picked animals were then reconstituted in the original 96% ethanol and stored at 4 oC. To ensure data obtained through morphological analysis did not influence molecular workflows or outputs, the nine samples were anonymised (A to I at random).
Molecular Analysis
Molecular analysis was carried out independently at seven laboratories across Europe, which were assigned a code (2–8) at random to preserve anonymity and eliminate bias.
At the Natural History Museum (NHM) in London, the nine samples were homogenised following Pereira-da-Conceicoa et al. (2021), with minor modifications (see Jones et al., 2025). Sub-samples of the homogenate were sent to the seven participating laboratories, where each performed their own DNA extraction. The NHM also extracted DNA from each of the homogenised samples.
Library preparation and sequencing
Each laboratory processed and prepared their library within their own institution following the two-step PCR approach as outlined by Buchner et al. (2021) using the BF3/BR2 freshwater invertebrate primer set developed for the cytochrome c oxidase subunit I (COI) gene (Elbrecht et al. 2019). Minor modifications to the protocol were made by individual laboratories due to facility constraints (see Jones et al., 2025), however, key steps were kept consistent: DNA extraction kit and primers used, PCR profiles and clean up steps. Two-step PCRs were performed on each extraction replicate using the Qiagen Multiplex PCR Plus Kit (Qaigen, Germany) with 0.2 μM of each primer in a final volume of 25 µl. PCRs were run with the following conditions: 95 °C for five minutes followed by 30 cycles of 95 °C for 30 seconds; 50 °C for 30 seconds; 72 °C for 30 seconds with a final extension at 68 °C for 10 minutes. Three PCR replicates were carried out from each sub-sample of NHM/Lab extract. In PCR 2, 1 μl of amplicon from PCR 1 was used in the following PCR conditions: 95 °C for five minutes, followed by 20 cycles of 95 °C for 30 seconds; 61 °C for 30 seconds; 72 °C for 42 seconds and a final extension at 68 °C for 10 minutes. All positive and negative controls were processed alongside samples. PCR products were then cleaned using the recommended Agencourt AMPure beads at a 0.7x ratio. The PCR product size was then checked on an agarose gel. The concentration of each of these samples was quantified using Qubit, Tapestation or qPCR (dependant on laboratory facilities) and samples were then pooled in equimolar concentrations with the negative controls added at the maximum volume for any single sample. Libraries were loaded at 12 pM concentration, with 5% PhiX control. Samples were run on an Illumina MiSeq (MiSeq Reagent Kit v3, 600 cycles) following the manufacturer's run protocols for 300 bp PE sequencing (Illumina, Inc. San Diego, CA, USA). Raw read data were compiled with participating laboratories anonymised, and then processed with the APSCALE pipeline (Buchner et al., 2022: v 1.6.3, https://github.com/DominikBuchner/apscale) using default settings. Taxonomic assignment was performed using BOLDigger (Buchner & Leese, 2020) (v 2.1.1). The best hit was determined with the BOLDigger method and then further corrected using the API correction function (see Jones et al., 2025).
Data harmonisation and index calculation
Data harmonisation
Output from the bioinformatics pipeline comprised 7680 operational taxonomic units (OTUs) identified from the 504 PCR replicates (six replicate PCRs for nine samples, as well as nine negative controls and nine positive controls processed by each of the seven laboratories). Records of non-target taxa were removed (see Jones et al., 2025), where target taxa were defined as the freshwater macroinvertebrates considered in the mixed taxonomic level (MTL) system of the Environment Agency (2014). The resultant matrix consisted of 519 OTUs detected across the 504 PCR replicates.
To facilitate comparison between morphologically derived and metabarcoding data, both datasets were harmonised to the same operational MTL (see Jones et al., 2025), ensuring that it only contained discrete taxa (see Jones et al., 2025). The final harmonised list comprised 162 discrete taxa.
