Improving metabarcoding taxonomic assignment: A case study of fishes in a large marine ecosystem
Gold, Zachary (2021), Improving metabarcoding taxonomic assignment: A case study of fishes in a large marine ecosystem, Dryad, Dataset, https://doi.org/10.5068/D1H963
DNA metabarcoding is an important tool for molecular ecology. However, its effectiveness hinges on the quality of reference sequence databases and classification parameters employed. Here we evaluate the performance of MiFish 12S taxonomic assignments using a case study of California Current Large Marine Ecosystem fishes to determine best practices for metabarcoding. Specifically, we use a taxonomy cross-validation by identity framework to compare classification performance between a global database comprised of all available sequences and a curated database that only includes sequences of fishes from the California Current Large Marine Ecosystem. We demonstrate that the curated, regional database provides higher assignment accuracy than the comprehensive global database. We also document a tradeoff between accuracy and misclassification across a range of taxonomic cutoff scores, highlighting the importance of parameter selection for taxonomic classification. Furthermore, we compared assignment accuracy with and without the inclusion of additionally generated reference sequences. To this end, we sequenced tissue from 597 species using the MiFish 12S primers, adding 252 species to GenBank’s existing 550 California Current Large Marine Ecosystem fish sequences. We then compared species and reads identified from seawater environmental DNA samples using global databases with and without our generated references, and the regional database. The addition of new references allowed for the identification of 16 native taxa and 17.0% of total reads from eDNA samples, including species with vast ecological and economic value. Together these results demonstrate the importance of comprehensive and curated reference databases for effective metabarcoding and the need for locus-specific validation efforts.
Reference Barcode Generation from Fish Tissue Samples
To generate a more complete 12S barcode reference database for California Current Large Marine Ecosystem fishes, we assembled a list of the 1,144 marine teleost and elasmobranch species that occur in this system (Allen & Horn, 2006; Froese & Pauly, 2010; Hastings & Burton, 2008; Love, & Passarelli, 2020) (Table S1). From this list, we acquired 741 ethanol-preserved voucher specimens representing 597 species (Table S1, Table S2) from the Scripps Institution of Oceanography Marine Vertebrate Collection at the University of California San Diego. DNA was extracted from each tissue sample using a Chelex 100 extraction method (Walsh, Metzger, & Higuchi, 1991), as described in the Supplemental Methods. We amplified all teleost DNA extracts (n=701) using the MiFish Universal Teleost Primers (Miya et al., 2015), and all elasmobranchs (n=55) using the MiFish Elasmobranch primers (Miya et al., 2015) following the thermocycler profile of Curd et al., (2019) (Table S3). We Sanger sequenced purified amplicons (see Supplemental Methods for details), and aligned and trimmed forward and reverse sequences in Sequencher version 5.4.6 (Nishimura, 2000). We used R package taxize (version 0.9.99) (Chamberlain & Szöcs, 2013) to synonymize taxonomic names of all vouchered specimens and GenBank. We then checked the accuracy of generated reference barcodes by building a UPGMA phylogenetic tree of all reference sequences and California Current Large Marine Ecosystem fishes using phangorn (2.5.5). In addition, we queried each sequence using blastn (Camacho et al., 2009) and removed any sequence that did not cluster or align to known taxonomic lineages (data available at https://doi.org/10.5068/D1H963). The resulting 12S reference barcodes were deposited into GenBank (SAMN19289093–SAMN19289810; Table S2).
Reference Database Creation
To test variation in taxonomic assignment among reference databases, we generated three distinct reference sequence databases: “CRUX-GenBank”, “global”, and “regional” (Table 1 and Table 2). CRUX-GenBank is a custom 12S reference database generated using Creating Reference libraries Using eXisting tools (CRUX) module of the Anacapa Toolkit to query GenBank for reference barcodes conducted with standard search parameters (Benson et al., 2018; Curd et al., 2019) and MiFish Universal 12S sequences (Table S1) as the user‐defined primers. Briefly, we created this reference database by running in silico PCR (Ficetola et al., 2010) on the European Molecular Biology Laboratory (EMBL) standard nucleotide database (Stoesser et al., 2002) to generate a seed library of 12S references. Next, we used blastn (Camacho et al., 2009) to capture reference barcodes without included primer sequences and to query the seed database against the NCBI non‐redundant nucleotide database (Gold, 2020; Pruitt et al., 2005; sequences downloaded in October 2019). The resulting blastn hits were de‐replicated by retaining only the longest version of each sequence and taxonomy for each accession was retrieved using Entrez‐qiime (Baker, 2016). The resulting set of reference sequences in the CRUX-GenBank database included any GenBank reference barcodes that in silico amplified to the MiFish 12S primers at the time of this analysis.
We created the global database to evaluate whether increasing database completeness improves taxonomic assignment. To create the global database, we supplemented the CRUX-GenBank database with 741 additional California Current Large Marine Ecosystem fish 12S barcodes generated for this study (Table S2). Thus, the global database includes all fish 12S reference sequences available at the time of download. From this global database, we created the regional database, including only 12S sequences of fishes known to occur in the California Current Large Marine Ecosystem. We created this database to specifically test whether databases curated to specific ecosystems enhance taxonomic assignment performance relative to more comprehensive databases (“global”). Because of the high degree of similarity between the MiFish Universal and Elasmobranch loci and the flexibility built into CRUX, a single CRUX generated 12S reference database performs well for both markers (Curd et al., 2019), so we did not create separate teleost and elasmobranch databases. Additionally, because the MiFish primer set amplifies nearly all vertebrate taxa (Miya et al., 2015; Valsecchi et al., 2019), the global database include teleosts, elasmobranchs, mammals, reptiles, amphibians, birds, etc. All databases are available at https://doi.org/10.5068/D1H963.
Taxonomy cross-validation by identity comparisons
We implemented the taxonomy cross-validation by identity (TAXXI) framework developed by (Edgar, 2018a) to 1) compare taxonomic assignment performance metrics for global versus regional reference databases, 2) determine the resolution of taxonomic assignments for all available MiFish barcodes in the global database, and 3) understand the performance of the MiFish barcode across taxonomic classifier cutoff scores. Although we use three databases (global, CRUX-GenBank and regional) on our test dataset below, we did not include the CRUX-GenBank database in taxonomic cross validation comparisons because the global database contains all these sequences.
The TAXXI analyses were implemented using scripts from Curd et al. (2019) which adapted TAXXI to the Anacapa Toolkit (https://drive5.com/taxxi/doc/index.html and https://github.com/limey-bean/Anacapa). We conducted taxonomic assignments using the Anacapa Toolkit classifier which implements the Bayesian Lowest Common Ancestor (BLCA) classifier (Gao et al., 2017) modified to incorporate sequences from Bowtie2 (Langmead & Salzberg, 2012). In brief, amplicon sequence variants (ASVs; exact unique sequences dereplicated from generated metabarcoding data) are first aligned to reference barcodes using Bowtie2 retaining the top 100 alignments. Then the BLCA classifier conducts multiple sequence alignment for each query ASV to inform a weighted Bayesian posterior probability of taxonomic assignment. Taxonomy is then ultimately assigned based on the lowest common ancestor of the total weighted reference database matches; reliability is evaluated through bootstrap confidence scores which are analogous to percent identity metrics provided by other metabarcoding classifiers (Gao et al., 2017; See Curd et al. 2019 for full description).
We evaluated taxonomic assignment performance by comparing the following metrics: 1) true positive rate – the number of correct taxonomic assignments divided by the total opportunities for correct classification, 2) over-classification rate - the number of assignments incorrectly made to additional lower taxonomic ranks divided by the total opportunities to make an over-classification error, 3) under-classification rate - the number of assignments incorrectly made to fewer taxonomic ranks divided by the total opportunities to make an under-classification error, 4) misclassification rate - the number of assignments incorrectly predicted divided by the opportunities for correct classification, and 5) accuracy - the number of correct assignments divided by the taxonomic assignment opportunities for which correctness can be determined (R. C. Edgar, 2018a). The 6) sensitivity was calculated as the true positive rate / (true positive rate + under-classification rate) as under-classification is analogous to a false negative rate. The 7) specificity was calculated as 1- (misclassification rate + over-classification rate) as the combination of the misclassification rate and over-classification rate is analogous to the false positive rate.
Taxonomic Resolution of the MiFish 12S primer
To provide insights into which fishes can be resolved to species level using the MiFish 12S primer set, we conducted TAXXI comparisons using the global database as both the test and training database to assign taxonomy to itself. We then calculated the seven taxonomic assignment metrics described above. Additionally, we identified families and genera of fishes for which the MiFish 12S locus performed poorly, defined as frequently failing to assign species level identification. Although all vertebrate sequences in the global database were used in the taxonomic cross validation, only results for fishes are discussed here.
Regional vs. global reference databases
To compare the relative ability of regional versus global reference databases to accurately assign taxonomy, we conducted two additional TAXXI comparisons using the reference databases created for this study. First, we used the global reference database as a training database to assign taxonomy to the regional reference database that only contained sequences for fishes known from the California Current Large Marine Ecosystem. Second, we used the regional reference database as both the test and training database to assign taxonomy against itself. The taxonomic assignments made by the global and regional reference databases were compared across the taxonomic assignment metrics described above.
Effect of Bootstrap Confidence Scores on Taxonomic Assignment
To understand the performance of the MiFish barcode across a range of taxonomic classifier cutoff scores, we repeated each of the three TAXXI analyses described above (global-regional, regional-regional, global-global) using bootstrap confidence cutoff scores of 40, 50, 60, 70, 80, 90, 95, and 100. We then evaluated the effect of bootstrap confidence cutoff scores across the various taxonomic assignment metrics, as described above.
eDNA Metabarcoding Case Study
Seawater Sample Collection, DNA Extraction, and Library Generation
To specifically test the impact of 12S database design on taxonomic assignment in real world applications, we compared the performance of the three databases in assigning taxonomy to existing eDNA sequence data as a test case. Briefly, we used MiFish 12S metabarcoding sequence data generated from three seawater samples collected from 10 m depth from three sites off eastern Santa Cruz Island, CA in 2017 that were part of a larger ecological study of biodiversity patterns within rocky reef ecosystems. These sequences were generated using standard eDNA collection, processing, and sequencing methods, as outlined in Gold et al., (2021).
We processed this eDNA metabarcoding data three separate times using the Anacapa Toolkit (Curd et al., 2019), assigning taxonomy using the CRUX-GenBank, global, and regional reference databases (Table 2). We used the default Anacapa Toolkit parameters and a bootstrap confidence cutoff score of 60. We then examined the total number of ASVs and taxonomic ranks identified by each of the three reference databases. We also investigated differences in taxonomic assignment between single direction ASVs (comprised of forward- and reverse-only sequence reads) and merged ASVs (merged paired-end sequence reads) to understand the importance of full length vs. partial length sequences for taxonomic assignment (See Supplemental Results and Discussion).
Data sets included:
1. List of California Current fish species
2. List of California current species lacking 12S MiFish reference barcodes
3. Regional Reference database of only California fish species with matching 12S MiFish reference barcodes. Stored as both a .fasta and taxonomy .txt file.
4. CRUX-12S reference database including additional 12S California fish reference barcodes generated in this study. Stored as both a .fasta and taxonomy .txt file.
5. 12S MiFish Metabarcoding data from 3 sea water samples collected off Santa Cruz Island, CA, USA. Data was prepared following the methods of Curd et al. (2019). Previously published : https://doi.org/10.1371/journal.pone.0238557
National Science Foundation, Award: 2015204395, GRFP
National Science Foundation, Award: 2015204395, GRIP
University of California, Los Angeles, Award: La Kretz Center for Conservation Genomics
University of California, Los Angeles, Award: Department of Ecology and Evolutionary Biology