Skip to main content

Data from: The role of taxonomic expertise in interpretation of metabarcoding studies

Cite this dataset

Pappalardo, Paula et al. (2021). Data from: The role of taxonomic expertise in interpretation of metabarcoding studies [Dataset]. Dryad.


The performance of DNA metabarcoding approaches for characterizing biodiversity can be influenced by multiple factors. Here we used morphological assessment of taxa in zooplankton samples to develop a large barcode database and to assess the congruence of taxonomic identification with metabarcoding under different conditions. We analyzed taxonomic assignment of metabarcoded samples using two genetic markers (COI, 18S V1-2), two types of clustering into molecular operational taxonomic units (OTUs, ZOTUs), and three methods for taxonomic assignment (RDP Classifier, BLASTn to GenBank, BLASTn to a local barcode database). The local database includes 1042 COI and 1108 18S (SSU) barcode sequences, and we added new high-quality sequences to GenBank for both markers, including 109 contributions at the species level. The number of phyla detected and the number of taxa identified to phylum varied between genetic marker and among the three methods used for taxonomic assignments. Blasting the metabarcodes to the local database generated multiple unique contributions to identify OTUs and ZOTUs. We argue that a multi-marker approach combined with taxonomic expertise to develop a curated, vouchered, local barcode database increases taxon detection with metabarcoding, and its potential as a tool for zooplankton biodiversity surveys.


This data package includes data and code associated to the publication “The role of taxonomic expertise in interpretation of metabarcoding studies”. The data package has been organized as an R project, so if the R user downloads the full package including the file “StreamCode-Rproject.Rproj”, the scripts provided can be used without further path modifications. The project is organized as:

  1. Data folders (original-data, clean-data, input-data)
  2. Results folder
  3. R Code including two R files with functions and objects that are needed, and four rmarkdown files with code and text describing the analysis in the paper.
  4. Bibliography files
  5. Pdfs version of the rmarkdown files to give alternatives to non R users

The “original-data” folder includes the StreamCode_data.csv file with all the information for the zooplankton samples collected during the StreamCode project. The StreamCode_metadata.csv file describes the columns in the StreamCode_data.csv file.

The “input-data” folder includes the results of the taxonomic assignment for the metabarcoding data, BLASTn results, taxonomy dictionaries used to standardize taxonomy between disparate datasets, information on sample location, and availability of images for the StreamCode samples.

The “clean-data” folder has the final taxonomic assignment for the metabarcodes (OTUs and ZOTUs) for each genetic marker and for each method of taxonomic assignment. This is the final assignment after the confidence thresholds were applied and with the names matched to the WoRMS taxonomy (for marine organisms) or the NCBI taxonomy. The data in this folder is automatically generated when running the code in the file 01_CleanDatasets.Rmd. For the RDP Classifier the confidence threshold was set at 0.9, for the BLASTn the confidence thresholds were 85% percent coverage and 85% percent similarity.

The “results” folder includes several tables with results, some that correspond with tables in the paper, some that were used as part of the analysis. These files were generated running the code in the rmarkdown files (01_CleanDatasets.Rmd, 02_Barcodes_success_contributions.Rmd, 03_Barcodes_MethodsValidation.Rmd, 04_MetabarcodingResults.Rmd).

The genetic data associated with this project has been deposited in GenBank and can be found NCBI BioProject PRJNA421480 ( Specific information on accession numbers for each sample can be found in the publication Supplementary Material.

All the details on data collection and data processing are available in the publication and the supporting information. The Supporting Information associated with the publication is included in the data package as an html file (that allows for interactive tables).

Usage notes

The R code used in the analysis is available in the R markdown files, which contains explanations of the different steps. We also provide the code and explanations as pdf files for non-R users. Names of files and folders aim to be as descriptive as possible.

Regarding files and subfolders in the “input-data” folder:

The names of the columns of the BLASTn results are ("", "match.accnum", "", "evalue", "bitscore", "seq.length", "percent.match", "percent.coverage"), and are appended to the raw results files using the R code.

The results from Midori are the original text files provided when running the query set of metabarcodes in the Midori server ( on 03-October-2020 using the Midori-LONGEST database and extracting the data in the “allrank” format. The raw ouput from Midori includes all the range of RDP Classifier probabilities, despite the default on the online server is fixed at 0.8. We used custom R functions (provided in the R file “00_FunctionsWeNeed.R”) to clean the Midori output and to filter out entries with confidence threshold lower than 0.9.


National Museum of Natural History, Award: Associate Director for Science Core Grant (2017)

National Museum of Natural History, Award: GGI-Rolling-2017-109a: Global Genome Initiative

Smithsonian Institution, Award: SIBN, FY2017 Award cycle: Barcode Network

Robert and Arlene Kogod Secretarial Scholar

Robert and Arlene Kogod Secretarial Scholar