Data from: The role of taxonomic expertise in interpretation of metabarcoding studies
Cite this dataset
Pappalardo, Paula et al. (2021). Data from: The role of taxonomic expertise in interpretation of metabarcoding studies [Dataset]. Dryad. https://doi.org/10.5061/dryad.tdz08kpzx
This data package includes data and code associated to the publication “The role of taxonomic expertise in interpretation of metabarcoding studies”. The data package has been organized as an R project, so if the R user downloads the full package including the file “StreamCode-Rproject.Rproj”, the scripts provided can be used without further path modifications. The project is organized as:
- Data folders (original-data, clean-data, input-data)
- Results folder
- R Code including two R files with functions and objects that are needed, and four rmarkdown files with code and text describing the analysis in the paper.
- Bibliography files
- Pdfs version of the rmarkdown files to give alternatives to non R users
The “original-data” folder includes the StreamCode_data.csv file with all the information for the zooplankton samples collected during the StreamCode project. The StreamCode_metadata.csv file describes the columns in the StreamCode_data.csv file.
The “input-data” folder includes the results of the taxonomic assignment for the metabarcoding data, BLASTn results, taxonomy dictionaries used to standardize taxonomy between disparate datasets, information on sample location, and availability of images for the StreamCode samples.
The “clean-data” folder has the final taxonomic assignment for the metabarcodes (OTUs and ZOTUs) for each genetic marker and for each method of taxonomic assignment. This is the final assignment after the confidence thresholds were applied and with the names matched to the WoRMS taxonomy (for marine organisms) or the NCBI taxonomy. The data in this folder is automatically generated when running the code in the file 01_CleanDatasets.Rmd. For the RDP Classifier the confidence threshold was set at 0.9, for the BLASTn the confidence thresholds were 85% percent coverage and 85% percent similarity.
The “results” folder includes several tables with results, some that correspond with tables in the paper, some that were used as part of the analysis. These files were generated running the code in the rmarkdown files (01_CleanDatasets.Rmd, 02_Barcodes_success_contributions.Rmd, 03_Barcodes_MethodsValidation.Rmd, 04_MetabarcodingResults.Rmd).
The genetic data associated with this project has been deposited in GenBank and can be found NCBI BioProject PRJNA421480 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA421480). Specific information on accession numbers for each sample can be found in the publication Supplementary Material.
All the details on data collection and data processing are available in the publication and the supporting information. The Supporting Information associated with the publication is included in the data package as an html file (that allows for interactive tables).
The R code used in the analysis is available in the R markdown files, which contains explanations of the different steps. We also provide the code and explanations as pdf files for non-R users. Names of files and folders aim to be as descriptive as possible.
Regarding files and subfolders in the “input-data” folder:
The names of the columns of the BLASTn results are ("seq.id", "match.accnum", "match.name", "evalue", "bitscore", "seq.length", "percent.match", "percent.coverage"), and are appended to the raw results files using the R code.
The results from Midori are the original text files provided when running the query set of metabarcodes in the Midori server (http://www.reference-midori.info/server.php) on 03-October-2020 using the Midori-LONGEST database and extracting the data in the “allrank” format. The raw ouput from Midori includes all the range of RDP Classifier probabilities, despite the default on the online server is fixed at 0.8. We used custom R functions (provided in the R file “00_FunctionsWeNeed.R”) to clean the Midori output and to filter out entries with confidence threshold lower than 0.9.
National Museum of Natural History, Award: Associate Director for Science Core Grant (2017)
National Museum of Natural History, Award: GGI-Rolling-2017-109a: Global Genome Initiative
Smithsonian Institution, Award: SIBN, FY2017 Award cycle: Barcode Network
Robert and Arlene Kogod Secretarial Scholar
Robert and Arlene Kogod Secretarial Scholar