Post-bioinformatic methods to identify and reduce the prevalence of artefacts in metabarcoding data

Published Mar 30, 2021; Updated Apr 01, 2021 on Dryad. https://doi.org/10.5061/dryad.2jm63xsp4

Abstract

Metabarcoding provides a powerful tool for investigating biodiversity and trophic interactions, but the high sensitivity of this methodology makes it vulnerable to errors, resulting in artefacts in the final data. Metabarcoding studies thus often utilise minimum sequence copy thresholds (MSCTs) to remove artefacts that remain in datasets; however, there is no consensus on best practice for the use of MSCTs. To mitigate erroneous reporting of results and inconsistencies, this study discusses and provides guidance for best-practice filtering of metabarcoding data for the ascertainment of conservative and accurate data. The most common MSCTs identified in the literature were applied to example datasets of Eurasian otter (Lutra lutra) and cereal crop spider (Araneae: Linyphiidae and Lycosidae) diets. Changes in both the method and threshold value considerably affected the resultant data. Of the MSCTs tested, it was concluded that the optimal method for the examples given combined a sample-based threshold with removal of maximum taxon contamination, providing stringent filtering of artefacts whilst retaining target data. Choice of threshold value differed between datasets due to variation in artefact abundance and sequencing depth, thus studies should employ controls (mock communities, negative controls with no DNA and unused MID-tag combinations) to select threshold values appropriate for each individual study.

Example dataset one: British otter diet

Faecal samples were collected during otter post-mortems by the Cardiff University Otter Project. Extracted faecal DNA was amplified using two metabarcoding primer pairs designed to amplify regions of the 16S rRNA and cytochrome c oxidase subunit I (COI) genes, each primer having ten-base-pair molecular identifier tags (MID tags) to facilitate post-bioinformatic sample identification. Extraction and PCR negative controls, unused MID tag combinations, repeat samples and mock communities were included alongside the focal eDNA samples. Mock communities comprised standardised mixtures of DNA of marine species not previously detected in the diet of Eurasian otters. The resultant DNA libraries for each marker were sequenced on separate MiSeq V2 chips with 2x250bp paired-end reads.

Example dataset two: cereal crop spider diet

Money spiders (Bathyphantes, Erigone, Microlinyphia and Tenuiphantes; Araneae: Linyphiidae) and wolf spiders (Pardosa; Araneae: Lycosidae) were visually located on transects through barley fields. Gut DNA, extracted from the whole spider abdomen, was amplified using two COI metabarcoding primer pairs. One primer pair was selected for broad amplification of all invertebrates present, including the predator, and the other designed to exclude spider DNA to avoid predator amplification, each primer having ten-base-pair MID tags to facilitate post-bioinformatic sample identification. Extraction and PCR negative controls, unused MID tag combinations, repeat samples and mock communities were included alongside the focal eDNA samples. Mock communities comprised standardised mixtures of DNA of exotic species not previously recorded in Britain. The resultant DNA libraries for each marker were sequenced on a MiSeq V3 chip with 2x300bp paired-end reads.

Sequence analysis

Bioinformatic analyses were carried out using a custom pipeline. Sequences were first checked for truncation of MID-tags by determining the proportion of sequence files containing exactly 10bp before their respective primer. In all cases, the degree of truncation was deemed acceptable (≤10%).

FastP (Chen et al. 2016) was used to check the quality of reads, discard poor quality reads (<Q30, <125bp long or too many unqualified bases, denoted by ‘N’) and merge read pairs from MiSeq files (R1 and R2). Merged reads were assigned a sample ID based on the MID tags associated with each primer using the ‘trim.seqs’ function of Mothur (Schloss et al. 2009); this also removed the MID tag and primer sequences from the reads. Using the files created by Mothur, reads were demultiplexed to obtain one file per sample ID. Read headers were modified for each file to include the sample ID and reads were then concatenated back into one file. Sequences were denoised (removal of PCR and sequencing errors), clustered into zero-radius operational taxonomic units (zOTUs) and an OTU table was created using the commands ‘fastx_uniques’, ‘unoise3’ and ‘otutab’ in Usearch (v. 11) (Edgar 2016; Edgar 2020). Taxonomic assignment for each zOTU was obtained using the ‘blastn’ command in BLAST+, using a threshold of 97% similarity and e-value of 0.00001, against a downloaded database of DNA barcoding sequences submitted to online databases (e.g. Genbank; National Center for Biotechnology Information 2008; Camacho et al. 2009).

Before assigning taxonomic identities to each zOTU, BLAST results were filtered using the ‘dplyr’ package in R [version 3.6.0] using R Studio [version 1.2.1335] (R Core Team 2019). This was used to retain only accession codes with the top BIT score for each zOTU. These data were then processed via MEGAN [version 6.12.3] (Huson et al. 2016) to assign taxonomic names to each zOTU. As erroneous entries on online databases can prevent species-level assignments, zOTUs for which the top BLAST hit (i.e. top BIT score) was not resolved to species-level were thus manually checked and assigned the most appropriate taxon. Taxonomic identity for each zOTU was added to the OTU table produced by Usearch and reads were aggregated by taxonomic identity for each sample in R using the ‘aggregate’ function with a sum base function. OTUs were allocated taxonomic identities to overcome issues such as over-splitting of taxonomic groups, and to facilitate ecological interpretation of the data, particularly regarding identification of artefacts (e.g. identifying marine species in non-coastal otters).

Minimum Sequence Copy Thresholds (MSCTs)

The seven most common MSCTs identified from the literature (Table 1) were tested and their efficacy in cleaning all datasets compared. Filtering methods were enacted in excel using IF formulae.

Table 1: Seven post-bioinformatic filtering methods often applied to eDNA metabarcoding datasets, selected from those identified in a review of 154 metabarcoding studies (Table S1). The ‘method name’, herein used to refer to these methods, is given alongside the description (how the methods are executed) and the aim of each.

Method name	Method Description	Method Aim
No filter	No OTU or sample filtering.	No clean-up/maximum preservation of data.
Singletons	Remove any read counts of one.	Remove extremely low frequency artefacts (e.g. sequencing artefacts).
<10	Remove any read counts that are less than ten.	Remove low frequency artefacts (e.g. sequencing artefacts, low-lying PCR contamination)
Max Contamination	Remove any read counts within each OTU that are lower than the highest read count within a negative/blank control for that OTU.	Remove contamination detected by the negative controls (e.g. extraction/PCR contamination, tag-jumping)
Total %	Remove any read counts less than a proportion of the total dataset read count for all reads.	Remove low frequency artefacts (e.g. sequencing artefacts, PCR contamination)
Sample %	Remove any read counts within a sample that are less than a proportion of the total sample read count for that sample.	Remove sample contamination (e.g. environmental, extraction or PCR contamination)
Taxon %	Remove read counts with an abundance less than a proportion of the total OTU read count for that OTU.	Remove cross contamination (e.g. cross contamination, tag-jumping)

If the read count (i.e. number of reads per sample per taxon) did not pass the designated threshold, then it was converted to zero (rather than subtracting the threshold, thus not altering the remaining read counts). For proportional methods (5-7, Table 1), a variety of thresholds were tested to explore how choice of threshold can affect data output. The range of thresholds tested were chosen based upon artefacts identified in control samples; we started with a low threshold and increased the value until most of the identifiable artefacts were removed. We also explored the effectiveness of using different MSCTs in pairwise combinations; this involved simultaneously applying ‘Max Contamination’ with each proportional threshold method (5-7), and ‘Sample %’ with ‘Taxon %’.

Basic statistics were calculated to assess the effectiveness of each filtering method; total read count was used to assess the loss of reads across the whole dataset, presence of singleton reads was used to assess removal of PCR and sequencing errors, reads in blanks (negative controls and unused MID-tags) were used to assess levels of contamination and tag-jumping, and mock communities were used to assess presence of false positives within samples. Artefacts could also be identified through taxa unexpectedly occurring in samples, such as taxa from dietary samples in controls, marine taxa associated with otters that did not have access to marine habitats, exotic taxa in British spider samples and mock community taxa in negative controls, unused MID tags or dietary samples.

To visualise the results of each method, tables of reads were converted into heat charts using the ‘ggplot2’ package (Warnes et al., 2012) in R. Frequency of occurrence for each taxon across all MID-tag combinations was also calculated for each filtering method and used to create heat charts. Relative frequencies were calculated by dividing frequency of occurrence by the total number of MID-tag combinations; these values then underwent non-metric multidimensional scaling (NMDS) to visualise dissimilarity between the taxa present following application of each MSCT. This was conducted using the ‘metaMDS’ function in the ‘vegan’ package (Oksanen et al. 2013) with two dimensions (stress <0.1) and a Bray-Curtis dissimilarity calculation (Bray and Curtis 1957). Ellipses were created using the ‘ordiellipse’ function with the default ‘sd’ setting (standard deviation).

README.txt - README file

LD-16S-200519_S1_L001_R1_001.fastq - Otter 16S raw sequencing output for forward reads

LD-16S-200519_S1_L001_R2_001.fastq - Otter 16S raw sequencing output for reverse reads

LD-CO1-220519_S1_L001_R1_001.fastq - Otter COI raw sequencing output for forward reads

LD-CO1-220519_S1_L001_R2_001.fastq - Otter COI raw sequencing output for reverse reads

JC-G_S1_L001_R1_001.fastq - Spider general amplification raw sequencing output for forward reads

JC-G_S1_L001_R2_001.fastq - Spider general amplification raw sequencing output for reverse reads

JC-S_S2_L001_R1_001.fastq - Spider exclusion amplification raw sequencing output for forward reads

JC-S_S2_L001_R2_001.fastq - Spider exclusion amplification raw sequencing output for reverse reads

Otter Diet 16S Clean-up.xlsx - The raw files of cleaned data for the otter 16S dataset

Otter Diet COI Clean-up.xlsx - The raw files of cleaned data for the otter COI dataset

Spider Diet Exclusion Clean-up.xlsx - The raw files of cleaned data for the spider general amplification dataset

Spider Diet General Clean-up.xlsx - The raw files of cleaned data for the spider exclusive amplification dataset

Otter_Diet_16S_Oligos.txt - Molecular identifier tags and sample identities for the otter 16S dataset bioinformatics

Otter_Diet_COI_Oligos.txt - Molecular identifier tags and sample identities for the otter COI dataset bioinformatics

Spider_Diet_Exclusion_Oligos.txt - Molecular identifier tags and sample identities for the spider exclusive amplification dataset bioinformatics

Spider_Diet_General_Oligos.txt - Molecular identifier tags and sample identities for the spider general amplification dataset bioinformatics

R markdown Bioinformatics.pdf - The R markdown giving the script and explanations for the bioinformatics, specifically with the otter datasets as an example

R Markdown Statistics and Plots.pdf - The R markdown giving the script and explanations for the statistics, specifically with the otter datasets as an example

Otter Diet 16S NMDS FOO.csv - Frequency of occurrence in each clean-up method for each taxon for the otter 16S dataset

Otter Diet 16S presences.csv - Counts in each clean-up method for each taxon for the otter 16S dataset

Otter Diet 16S Reads.csv - Read counts in each clean-up method for each sample for the otter 16S dataset

Otter Diet COI NMDS FOO.csv - Frequency of occurrence in each clean-up method for each taxon for the otter COI dataset

Otter Diet COI presences.csv - Counts in each clean-up method for each taxon for the otter COI dataset

Otter Diet COI Reads.csv - Read counts in each clean-up method for each sample for the otter COI dataset

Spider Diet General NMDS FOO.csv - Frequency of occurrence in each clean-up method for each taxon for the spider general amplification dataset

Spider Diet General presences.csv - Counts in each clean-up method for each taxon for the spider general amplification dataset

Spider Diet General Reads.csv - Read counts in each clean-up method for each sample for the spider general amplification dataset

Spider Diet Exclusion NMDS FOO.csv - Frequency of occurrence in each clean-up method for each taxon for the spider exclusive amplification dataset

Spider Diet Exclusion presences.csv - Counts in each clean-up method for each taxon for the spider exclusive amplification dataset

Spider Diet Exclusion Reads.csv - Read counts in each clean-up method for each sample for the spider exclusive amplification dataset

Post-bioinformatic methods to identify and reduce the prevalence of artefacts in metabarcoding data

Data files

Abstract

Methods

Usage notes

Works referencing this dataset