Data for: Optimizing a metabarcoding marker portfolio for species detection from complex mixtures of globally diverse fishes

Baetscher, Diana1

Research facility: Cornell University

Published Oct 02, 2023 on Dryad. https://doi.org/10.5061/dryad.w3r2280xm

Data files

Oct 02, 2023 version files 12.21 KB

accession_to_lineage_blast.sh

2.33 KB
example_loci_file.txt

479 B
example_settings_file.txt

208 B
multilocus_metabarcoding_pipeline.sh

6.41 KB
ncbi_database_download.sh

832 B
README.md

1.95 KB

Abstract

DNA metabarcoding is used to enumerate and identify taxa in both environmental samples and tissue mixtures, but the effectiveness of particular markers depends on their sensitivity to the taxa involved. Using multiple primer sets that amplify different genes can mitigate biases in amplification efficiency, sequence resolution, and reference data availability, but few empirical studies have evaluated markers for complementary performance. Here, we assess the individual and joint performance of 22 markers for detecting species in a DNA pool of 98 species of marine and freshwater bony fishes from geographically and phylogenetically diverse origins. We find that a portfolio of four markers targeting 12S, 16S, and two regions of COI identifies 100% of reference taxa to family and nearly 60% to species. We then use these four markers to evaluate metabarcoding of heterogeneous tissue mixtures, using experimental fishmeal to test: 1) the tissue input threshold to ensure detection; 2) how read depth scales with tissue abundance; and 3) the effect of non-target material in the mixture on recovery of target taxa. We consistently detect taxa that make up >1% of fishmeal mixtures and can detect taxa at the lowest input level of 0.01%, but rare taxa (<1%) were detected inconsistently across markers and replicates. Read counts showed only a weak correlation with tissue input, suggesting they are not a reliable quantitative proxy for relative abundance. Despite the limitations arising from primer specificity and reference data availability, our results demonstrate that a modest portfolio of markers can perform well in detecting and identifying aquatic species in complex mixtures despite heterogeneity in tissue representation, phylogenetic affinities, and from a broad geographic range.

README: Data for: Optimizing a metabarcoding marker portfolio for species detection from complex mixtures of globally diverse fishes

Methods

Metabarcoding markers

Twenty-two markers for mitochondrial (COI, 12S, 16S) and nuclear (18S, 28S) barcoding genes were selected from metabarcoding, eDNA, and Sanger sequencing barcoding studies of marine and freshwater fishes, including seafood products (Table 1). Most of these markers were designed to target bony fishes (teleosts), but we added markers targeting elasmobranchs, crustaceans, and cephalopods – taxonomic groups that are often poorly resolved by universal fish barcodes. Only markers that amplified targets <300 bp were selected because shorter fragments are more likely to amplify degraded DNA (Devloo-Delva et al., 2019, Shokralla et al., 2015; Staats et al., 2016), which is expected to be the case for highly-processed fishmeal and oil.

Reference DNA pools

To compare the amplification and resolution of the 22 markers before determining complementarity, we constructed two pools with equal concentrations of extracted DNA from 98 marine and freshwater teleost fishes and five elasmobranch, crustacean, and cephalopod species, in total spanning 88 genera and 60 families (full reference pool; Table 2). Samples were obtained primarily from vouchered collections, but also from fish markets to encompass commercially-important groups. We sampled muscle tissue from inside the body wall (i.e., no surface contact) for DNA extractions, in an attempt to avoid trace contamination from contact with other species. To further minimize the potential for detecting false positives from tissue contamination, we constructed a second, more restricted reference pool including only the 73 DNA extracts from vouchered museum specimens (vouchered reference pool).

Experimental tissue mixture samples

Metabarcoding is typically used to detect both rare and abundant constituents in mixtures, and most applications include species in unequal proportions along with varying amounts and types of non-target material. In aquaculture feeds, we will refer to the non-target material as “filler.” To evaluate detection power in actual tissue mixtures (as opposed to pools of DNA extracts), we used fishmeal mixed with different fillers. The purpose of the filler was to test whether metabarcoding data are negatively impacted by fillers, either because of a loss of on-target sequencing reads or because of potential PCR inhibition. Similarly complex and heterogeneous mixtures of tissue sources might be expected in gut content or fecal samples in more ecological applications.

To create experimental fishmeal, we freeze-dried tissue from 30 of the unvouchered fish species in the full reference pool (muscle tissue from market samples; whole fish from research samples), coarsely homogenized each sample in a coffee grinder, and then finely ground using a freezer mill where each tissue sample is pulverized within a container submerged in liquid nitrogen. Each species was added one-by-one, and we cleaned all containers and tools by wiping them with a 10% bleach solution followed by 70% ethanol to decontaminate between samples. Species were assigned to one of six abundance levels: 13.33%, 3.65%, 1.91%, 1%, 0.1%, or 0.01% of the mixture (by weight), thereby spanning >3 orders of magnitude variation in representation (Table 3). Each abundance level was represented by five species, which were assigned to balance freshwater and marine habitats, major phylogenetic groups, and degree of fishery interest across levels. This experimental design allowed us to assess how dominant and rare taxa added at discrete proportions to a heterogeneous mixture relate to the proportion of sequencing reads attributed to each taxon and to compare amplification biases across multiple taxa added in the same amount to the fishmeal.

To test the effect of the non-target material, the fishmeal mixtures were combined with two unique fillers for a total of seven individual experimental feeds with low (2%), medium (10%), and high (25%) proportions of fishmeal relative to filler (Table 3). Fillers included plant-derived materials – grain and grass flours – and animal byproducts – bloodmeal and feathermeal – to represent mixture constituents used in aquaculture feeds. Fishmeal proportions also mimicked potential levels of fish tissue added to aquaculture feeds, from low (0%-2%) to high (25%) proportions of fish in the feed mixture. By multiplying the proportion of fishmeal in the experimental feed by the proportion of a particular fish species in the fishmeal, we could test the detection threshold for individual taxa down to 0.0002% of total experimental mixture mass (i.e., minimum of 0.01% of a particular species in the fishmeal and 2% fishmeal in the feed).

DNA extracts were quantified by a Qubit fluorometer (high-sensitivity or broad-range dsDNA assay depending on concentration range), diluted with DNAse-free water, and added in equal proportion to the full reference and vouchered reference DNA pools. DNA extracts from the 30 fishmeal species were combined in two additional mock DNA pools: one with equal concentration among all taxa (mock equal) and the other in which DNA extract concentration was proportionate to the amount of tissue included in the fishmeal (mock variable). Similar to the previous reference DNA pools, DNA pools for the mock equal and mock variable pools were prepared in triplicate (Fig. S1).

Metabarcoding sequencing libraries were prepared from each pool using a two-step amplicon protocol (D’Aloia, Bogdanowicz, Harrison, & Buston, 2017) in which an initial PCR targets the gene region of interest using locus-specific primers with Nextera 5’ tails (5’-TC GTCGGCAGCGTCAGATGTGTATAAGAGACAG appended to each forward primer and 5’ -GTCTCGTGGGCTC GGAGATGTGTATAAGAGACAG to each reverse primer, details in the SI). Equal volumes of the locus-specific PCR products for each sample were pooled and a second PCR added Nextera-style sequencing adapters with unique i5 and i7 indexes that allow sequencing reads to be assigned to samples during analysis (details about reagent concentrations and PCR conditions in the SI). Rather than using combinatorial indexing, which can lead to mis-assigned reads caused by index-swapping (Carøe & Bohmann, 2020; Schnell, Bohmann, & Gilbert, 2015), we used custom-synthesized adapters with unique dual indexes (Table S1) that can unequivocally identify samples by 8-base indexes on both ends of the molecule.

For each sample, PCR products for all markers were pooled into a single indexed library and sequenced using paired-end 150-bp on one lane of a HiSeq X Ten (Novogene, Inc.) with 15% PhiX to account for moderately low library complexity (following Aizpurua et al., 2018).