Skip to main content

Datasets from: Validated removal of nuclear pseudogenes and sequencing artefacts from mitochondrial metabarcode

Cite this dataset

Andujar, Carmelo; Arribas, Paula; Creedy, Thomas (2021). Datasets from: Validated removal of nuclear pseudogenes and sequencing artefacts from mitochondrial metabarcode [Dataset]. Dryad.


Metabarcoding of Metazoa using mitochondrial genes may be confounded by both the accumulation of PCR and sequencing artefacts and the co-amplification of nuclear mitochondrial pseudogenes (NUMTs). The application of read abundance thresholds and denoising methods is efficient in reducing noise accompanying authentic mitochondrial amplicon sequence variants (ASVs). However, these procedures do not fully account for the complex nature of concomitant sequences and the highly variable DNA contribution of specimens in a metabarcoding sample. We propose, as a complement to denoising, the metabarcoding Multidimensional Abundance Threshold Evaluation (metaMATE) framework, a novel approach that allows comprehensive examination of multiple dimensions of abundance filtering and the evaluation of the prevalence of unwanted concomitant sequences in denoised metabarcoding datasets. metaMATE requires a denoised set of ASVs as input, and designates a subset of ASVs as being either authentic (mtDNA haplotypes) or non-authentic ASVs (NUMTs and erroneous sequences) by comparison to external reference data and by analysing nucleotide substitution patterns. metaMATE (i) facilitates the application of read abundance filtering strategies, which are structured with regard to sequence library and phylogeny and applied for a range of increasing abundance threshold values, and (ii) evaluates their performance by quantifying the prevalence of non-authentic ASVs and the collateral effects on the removal of authentic ASVs. The output from metaMATE facilitates decision-making about required filtering stringency and can be used to improve the reliability of intraspecific genetic information derived from metabarcode data. The framework is implemented in the metaMATE software, available at


To ensure uniform treatment of datasets, raw sequence reads were re-processed following a uniform protocol including primer removal, paired end merging, quality filtering, length filtering for reads ranging between 416-420 bp (the expected 418 bp amplicon ± 2 bp), followed by denoising library by library using UNOISE 3 in USEARCH v11 (Edgar, 2016). The last step in USEARCH included chimera filtering, dereplication, and removal of all singleton reads which were not considered further.

Steps followed are in file "Standard processing from Raw reads to"

The sequences surviving the cleaning and denoising steps (hereafter ASVs) were classified to order or superfamily level, and only ASVs classified as the target taxon or taxa were retained: Apoidea (BEE dataset), Coleoptera (COL) and Coleoptera, Acari and Collembola (CAC).

These correspond to datasets:

"COL_Dataset. ASVs Coleoptera from Tenerife.fas"

"CAC_Dataset. ASVs Coleoptera Collembola Acari from Grazalema.fasta"

"BEE_Dataset. ASVs Apoidea from Mock communities.fasta"

To perform classification of ASVs, we generated a reference database comprised of the NCBI nt database (downloaded 17 June, 2018) combined with either (i) 1,011 additional reference COI sequences from Coleoptera, Acari and Collembola specimens collected in the Canary Islands and Sierra de Grazalema (COL and CAC datasets) or (ii) the BEEEE reference database described in Creedy et al 2019. For each of the three datasets, searches against the reference database were performed using the BLASTn algorithm, with the following settings:  -evalue 0.001, -max_target_seqs 100. Blast results were then processed with MEGAN6, using the weighted lowest common ancestor algorithm with default settings to assign taxonomy to ASVs.COL and CAC ASVs datasets were obtained processing COI metabarcode raw data from soil samples according to procedures in Andújar et al 2021, and filtering the obtained total dataset of ASVs to retain only those classified as Coleoptera, Collembola, and Acari respectively, using Blast and MEGA tools. Details in Andújar et al 2021.

Sequences used to identify va-ASVs are provided in the following files:

"Sanger sequences used to Identify va-ASVs on the CAC and COL datasets.fasta"

"Reference sequences used to identify va-ASVs on the BEE dataset.fasta"


MINECO, Award: CGL2015-74178-JIN

MINECO, Award: CGL2015-74178-JIN