Data for: Optimizing a metabarcoding marker portfolio for species detection from complex mixtures of globally diverse fishes
Data files
Oct 02, 2023 version files 12.21 KB
-
accession_to_lineage_blast.sh
-
example_loci_file.txt
-
example_settings_file.txt
-
multilocus_metabarcoding_pipeline.sh
-
ncbi_database_download.sh
-
README.md
Abstract
DNA metabarcoding is used to enumerate and identify taxa in both environmental samples and tissue mixtures, but the effectiveness of particular markers depends on their sensitivity to the taxa involved. Using multiple primer sets that amplify different genes can mitigate biases in amplification efficiency, sequence resolution, and reference data availability, but few empirical studies have evaluated markers for complementary performance. Here, we assess the individual and joint performance of 22 markers for detecting species in a DNA pool of 98 species of marine and freshwater bony fishes from geographically and phylogenetically diverse origins. We find that a portfolio of four markers targeting 12S, 16S, and two regions of COI identifies 100% of reference taxa to family and nearly 60% to species. We then use these four markers to evaluate metabarcoding of heterogeneous tissue mixtures, using experimental fishmeal to test: 1) the tissue input threshold to ensure detection; 2) how read depth scales with tissue abundance; and 3) the effect of non-target material in the mixture on recovery of target taxa. We consistently detect taxa that make up >1% of fishmeal mixtures and can detect taxa at the lowest input level of 0.01%, but rare taxa (<1%) were detected inconsistently across markers and replicates. Read counts showed only a weak correlation with tissue input, suggesting they are not a reliable quantitative proxy for relative abundance. Despite the limitations arising from primer specificity and reference data availability, our results demonstrate that a modest portfolio of markers can perform well in detecting and identifying aquatic species in complex mixtures despite heterogeneity in tissue representation, phylogenetic affinities, and from a broad geographic range.
README: Data for: Optimizing a metabarcoding marker portfolio for species detection from complex mixtures of globally diverse fishes
https://doi.org/10.5061/dryad.w3r2280xm
This repository includes:
- the NCBI SRA information where the sequencing data are deposited,
- the bioinformatic scripts to process the sequencing data, and
- a link to the Git repository that includes intermediate data files (metabarcoding feature and taxonomy tables) and the R markdown files for analyzing the data for the manuscript and generating figures for the publication.
Data
Sequencing fastq files are deposited in the NCBI SRA Bioproject # PRJNA1021650.
All analyses begin with these fastq files and follow the analysis outlined in the scripts below.
Analysis
Scripts for initially processing the raw data files are included in this Dryad repository.
- ncbi_database_download.sh - This script provides code for downloading the NCBI nucleotide databases to be used with local BLAST for generating taxonomy assignments. The NCBI databases must be downloaded prior to running the analysis scripts.
- example_settings_file.txt - An example format for the Settings file required for the multilocus_metabarcoding_pipeline.sh script.
- example_loci_file.txt - An example format for the Locus file required for the multilocus_metabarcoding_pipeline.sh script.
- accession_to_lineage_blast.sh - A script that uses the software `taxonkit` to link the BLAST accession number to taxonomic lineage.
- multilocus_metabarcoding_pipeline.sh - The script that performs the metabarcoding data analysis for the loci in the Locus file based on the settings in the Settings file.
Outputs from those scripts are the metabarcoding feature and taxonomy tables that are used as inputs for the R markdown files accessible on GitHub at the following repository:
- https://github.com/DianaBaetscher-NOAA/marker-portfolio
Methods
Metabarcoding markers
Twenty-two markers for mitochondrial (COI, 12S, 16S) and nuclear (18S, 28S) barcoding genes were selected from metabarcoding, eDNA, and Sanger sequencing barcoding studies of marine and freshwater fishes, including seafood products (Table 1). Most of these markers were designed to target bony fishes (teleosts), but we added markers targeting elasmobranchs, crustaceans, and cephalopods – taxonomic groups that are often poorly resolved by universal fish barcodes. Only markers that amplified targets <300 bp were selected because shorter fragments are more likely to amplify degraded DNA (Devloo-Delva et al., 2019, Shokralla et al., 2015; Staats et al., 2016), which is expected to be the case for highly-processed fishmeal and oil.
Reference DNA pools
To compare the amplification and resolution of the 22 markers before determining complementarity, we constructed two pools with equal concentrations of extracted DNA from 98 marine and freshwater teleost fishes and five elasmobranch, crustacean, and cephalopod species, in total spanning 88 genera and 60 families (full reference pool; Table 2). Samples were obtained primarily from vouchered collections, but also from fish markets to encompass commercially-important groups. We sampled muscle tissue from inside the body wall (i.e., no surface contact) for DNA extractions, in an attempt to avoid trace contamination from contact with other species. To further minimize the potential for detecting false positives from tissue contamination, we constructed a second, more restricted reference pool including only the 73 DNA extracts from vouchered museum specimens (vouchered reference pool).
Experimental tissue mixture samples
Metabarcoding is typically used to detect both rare and abundant constituents in mixtures, and most applications include species in unequal proportions along with varying amounts and types of non-target material. In aquaculture feeds, we will refer to the non-target material as “filler.” To evaluate detection power in actual tissue mixtures (as opposed to pools of DNA extracts), we used fishmeal mixed with different fillers. The purpose of the filler was to test whether metabarcoding data are negatively impacted by fillers, either because of a loss of on-target sequencing reads or because of potential PCR inhibition. Similarly complex and heterogeneous mixtures of tissue sources might be expected in gut content or fecal samples in more ecological applications.
To create experimental fishmeal, we freeze-dried tissue from 30 of the unvouchered fish species in the full reference pool (muscle tissue from market samples; whole fish from research samples), coarsely homogenized each sample in a coffee grinder, and then finely ground using a freezer mill where each tissue sample is pulverized within a container submerged in liquid nitrogen. Each species was added one-by-one, and we cleaned all containers and tools by wiping them with a 10% bleach solution followed by 70% ethanol to decontaminate between samples. Species were assigned to one of six abundance levels: 13.33%, 3.65%, 1.91%, 1%, 0.1%, or 0.01% of the mixture (by weight), thereby spanning >3 orders of magnitude variation in representation (Table 3). Each abundance level was represented by five species, which were assigned to balance freshwater and marine habitats, major phylogenetic groups, and degree of fishery interest across levels. This experimental design allowed us to assess how dominant and rare taxa added at discrete proportions to a heterogeneous mixture relate to the proportion of sequencing reads attributed to each taxon and to compare amplification biases across multiple taxa added in the same amount to the fishmeal.
To test the effect of the non-target material, the fishmeal mixtures were combined with two unique fillers for a total of seven individual experimental feeds with low (2%), medium (10%), and high (25%) proportions of fishmeal relative to filler (Table 3). Fillers included plant-derived materials – grain and grass flours – and animal byproducts – bloodmeal and feathermeal – to represent mixture constituents used in aquaculture feeds. Fishmeal proportions also mimicked potential levels of fish tissue added to aquaculture feeds, from low (0%-2%) to high (25%) proportions of fish in the feed mixture. By multiplying the proportion of fishmeal in the experimental feed by the proportion of a particular fish species in the fishmeal, we could test the detection threshold for individual taxa down to 0.0002% of total experimental mixture mass (i.e., minimum of 0.01% of a particular species in the fishmeal and 2% fishmeal in the feed).
DNA extracts were quantified by a Qubit fluorometer (high-sensitivity or broad-range dsDNA assay depending on concentration range), diluted with DNAse-free water, and added in equal proportion to the full reference and vouchered reference DNA pools. DNA extracts from the 30 fishmeal species were combined in two additional mock DNA pools: one with equal concentration among all taxa (mock equal) and the other in which DNA extract concentration was proportionate to the amount of tissue included in the fishmeal (mock variable). Similar to the previous reference DNA pools, DNA pools for the mock equal and mock variable pools were prepared in triplicate (Fig. S1).
Metabarcoding sequencing libraries were prepared from each pool using a two-step amplicon protocol (D’Aloia, Bogdanowicz, Harrison, & Buston, 2017) in which an initial PCR targets the gene region of interest using locus-specific primers with Nextera 5’ tails (5’-TC GTCGGCAGCGTCAGATGTGTATAAGAGACAG appended to each forward primer and 5’ -GTCTCGTGGGCTC GGAGATGTGTATAAGAGACAG to each reverse primer, details in the SI). Equal volumes of the locus-specific PCR products for each sample were pooled and a second PCR added Nextera-style sequencing adapters with unique i5 and i7 indexes that allow sequencing reads to be assigned to samples during analysis (details about reagent concentrations and PCR conditions in the SI). Rather than using combinatorial indexing, which can lead to mis-assigned reads caused by index-swapping (Carøe & Bohmann, 2020; Schnell, Bohmann, & Gilbert, 2015), we used custom-synthesized adapters with unique dual indexes (Table S1) that can unequivocally identify samples by 8-base indexes on both ends of the molecule.
For each sample, PCR products for all markers were pooled into a single indexed library and sequenced using paired-end 150-bp on one lane of a HiSeq X Ten (Novogene, Inc.) with 15% PhiX to account for moderately low library complexity (following Aizpurua et al., 2018).