Bioinformatic pipeline from: Increasing confidence for discerning species and population compositions from metabarcoding assays of environmental samples: case studies of fishes in the Laurentian Great Lakes and Wabash River
Snyder, Matt (2021), Bioinformatic pipeline from: Increasing confidence for discerning species and population compositions from metabarcoding assays of environmental samples: case studies of fishes in the Laurentian Great Lakes and Wabash River, Dryad, Dataset, https://doi.org/10.5061/dryad.7m0cfxprx
Community composition data are essential for conservation management, facilitating identification of rare native and invasive species, along with abundant ones. However, traditional capture-based morphological surveys require considerable taxonomic expertise, are time consuming and expensive, can kill rare taxa and damage habitats, and often are prone to false negatives. Alternatively, metabarcode assays can be used to assess the genetic identity and compositions of entire communities from environmental samples, comprising a more sensitive, less damaging, and relatively time- and cost-efficient approach. However, there is a trade-off between the stringency of bioinformatic filtering needed to remove false positives and the potential for false negatives. The present investigation thus evaluated use of four mitochondrial (mt) DNA metabarcode assays and a customized bioinformatic pipeline to increase confidence in species identifications by removing false positives, while achieving high detection probability. Positive controls were used to calculate sequencing error, and results that fell below those cutoff values were removed, unless found with multiple assays. The performance of this approach was tested to discern and identify North American freshwater fishes using lab experiments (mock communities and aquarium experiments) and processing of a bulk ichthyoplankton sample. The method then was applied to field environmental (e)DNA water samples taken concomitant with electrofishing surveys and morphological identifications. This protocol detected 100% of species present in concomitant electrofishing surveys in the Wabash River and an additional 21 that were absent from traditional sampling. Using single 1 L water samples collected from just four locations, the metabarcoding assays discerned 73% of the total fish species that were discerned in comparison to four months of an extensive electrofishing river survey in the Maumee River, along with an additional nine species. In both rivers, total fish species diversity was best resolved when all four metabarcode assays were used together, which identified 35 additional species missed by electrofishing. Ecological distinction and diversity levels among the fish communities also were better resolved with the metabarcode assays than with morphological sampling and identifications, especially with the combined assays. At the population-level, metabarcode analyses targeting the invasive round goby Neogobius melanostomus and the silver carp Hypophthalmichthys molitrix identified all population haplotype variants found using Sanger sequencing of morphologically sampled fish, along with additional intra-specific diversity, meriting further investigation. Overall findings demonstrated that the use of multiple metabarcode assays and custom bioinformatics that filter potential error from true positive detections improves confidence in evaluating biodiversity.
These scripts were written and databases curated by Matthew Snyder during his PhD Dissertation research in Dr. Carol Stepien's Genetics and Genomics Group at the Pacific Marine Environmental Laboratory, National Oceanic and Atmospheric Administration, Seattle, WA.
- Demultiplexed raw reads returned from an Illumina HTS platform were trimmed with MetaTrim.py (see MetaTrim_README.md)
- Trimmed reads were merged in the R package Dada2 following Dada2Workflow.R
- The resulting sequence table was dmuxed into fastas by SeqTabToFasta.pl
- FASTAs were subjected to a BLAST search against multiple custom databases with BlastCycle500.pl
- BLAST results were summarized with SummarizeBlast.pl
Scripts and usage:
- MetaTrim.py: See MetaTrim_README.md
- Dada2Workflow.R: workflow for Dada2 R package
- SeqTabToFasta.pl: Run in the directory with the sequence table returned from Dada2. Sequence table must be named SeqTab.txt. Creates a subdir called Dada2ASVs and places FASTA files for each sample in this dir. Sequence titles in these FASTAS have the format > <ASV #> | <# of reads>.
- BlastCycle500.pl: Run in Dada2ASVs. Performs a BLAST search for each ASV in each FASTA against custom databases, returning the top 500 results per ASV. Results are saved as <Sample name>Res.txt
- SummarizeBlast.pl: Run in Dada2ASVs. Calculates the summarized species string: a list of species for each ASV that all had the lowest e value (best match) in the BLAST Results. Hits with <90% identity or query cover are removed. Creates a tabular file with structure: Sample \t Summarized Species String \t N ASVs with identical Summarized Species String \t Number of reads corresponding to the ASV \t Proportion of reads in the sample corresponding to the ASV \t ASV sequence title from FASTA.
Each database is a tar.gz file containing the files required for a BLAST database.
- Cytb_blast_db: Formatted database from all sequences returned from a GenBank search for "Actinopterygii AND Cyt b" and "Actinopterygii AND Cytochrome b"
- Cytb_GL_blast_db: Formatted database from all species present in the Laurentian Great Lakes from Cytb_blast_db
- 12S_blast_db: Formatted database from all sequences returned from a GenBank search for "Actinopterygii AND 12S"
- 12S_GL_blast_db: Formatted database from all species present in the Laurentian Great Lakes from 12S_blast_db
U.S. Environmental Protection Agency, Award: GL-00E01149-0
U.S. Environmental Protection Agency, Award: GL-00E01898