Ichthyoplankton metabarcoding: an efficient tool for early detection of invasive species establishment
Data files
Aug 25, 2025 version files 284.33 MB
-
00_initial_database.zip
14.79 MB
-
01_sample_map.zip
2.52 MB
-
02_database_build.zip
16.85 MB
-
03_qiime2_analysis.zip
229.88 MB
-
04_read_correction.zip
42.10 KB
-
05_sampling_stats.zip
80.55 KB
-
06_river_richness.zip
268.11 KB
-
07_gear_richness.zip
27 KB
-
08_primer_consistency.zip
4.31 MB
-
09_sampling_depth.zip
20.42 KB
-
10_database_comparison.zip
8.52 MB
-
11_amplicon_distance.zip
7 MB
-
README.md
25.68 KB
Abstract
Detection of invasive species is critical for management but is often limited by challenges associated with capture, processing, and identification of early life stages. DNA metabarcoding facilitates large-scale monitoring projects to detect establishment early. Here, we test the use of DNA metabarcoding to monitor invasive species by sequencing over 5000 fishes in bulk ichthyoplankton samples (larvae and eggs) from four rivers of ecological and cultural importance in southern Canada. We were successful in detecting species known from each river, and three invasive species in two of the four rivers. This includes the first detection of early life-stage rudd in the Credit River. We evaluated whether sampling gear affected detection of invasive species and estimates of species richness and found that light traps outperform bongo nets in both cases. We also found that the primers used for amplification of target sequences and the number of sequencing reads generated per sample, affect the consistency of species detections. However, these factors have less impact on detections and species richness estimates than the number of samples collected and analyzed. Our analyses also show that incomplete reference databases can result in incorrectly attributing DNA sequences to invasive species. Overall, we conclude that DNA metabarcoding is an efficient tool for monitoring the early establishment of invasive species by detecting evidence of reproduction, but requires careful consideration of sampling design and the primers used to amplify, sequence, and classify the diversity of native and potential invasive species.
In this study, we used DNA metabarcoding of bulk samples of early life-stage fishes to analyze species composition and monitor invasive fishes in four major rivers in southern Ontario, Canada. Samples were collected using two different gear types, and we analyzed each sample by amplifying and sequencing two different barcode sequence markers in the COI and 12S mitochondrial genes, allowing comparison of how differences in methodology affect our results.
DNA was extracted from homogenized bulk ichthyoplankton samples using salt extraction protocol. 1st PCR amplification using MiFish 12S or modified PS1 Teleost Primers with heterogeneity spacers and Nextera adaptor sequences. 2nd PCR using combinatorial Nextera i5/i7 indices, purified using Ampure XP magnetic beads and quantification via QuBit broad range kit. Sequenced using an Illumina MiSeq using the V2 (150bp x 2).
We compared invasive species detections, species richness, and community composition among the four rivers sampled. We also evaluate invasive species detections and species richness in relation to the number of individuals sequenced, gear type, primers, sequence markers, depth of sequencing, and reference database completeness. Based on the results of this study, we evaluated the most effective methods for detecting invasive species early in establishment using DNA metabarcoding.
Description of the data and file structure
ABBREVIATIONS AND ACRONYMS:
ASV : amplicon sequence variant
NTC : no template control
├── 00_initial_database.zip # Files from an initial analysis using a reference database that did not include Notropis hudsonius, Notropis buchanani, and Notropis volucellus 12S sequences
│ ├── Database_build # Files used to generate the reference database for QIIME2/VSEARCH classification of sequencing reads
│ │ ├── GL_mito_seq.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region
│ │ ├── GL_mito_tax.txt # Taxonomy file for the reference mitochondrial DNA sequences NOT including Notropis hudsonius, Notropis buchanani, and Notropis volucellus
│ │ ├── GL_mito_seq_plus.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region and Notropis hudsonius, Notropis buchanani, and Notropis volucellus 12S sequences generated in this study
│ │ ├── GL_mito_tax_plus.txt # Taxonomy file for the reference mitochondrial DNA sequences including Notropis hudsonius, Notropis buchanani, and Notropis volucellus
│ │ ├── sequences_plus.qza # QIIME2 formatted sequence feature file of GL_mito_seq_plus.fas
│ │ ├── sequences.qza # QIIME2 formatted sequence feature file of GL_mito_seq.fas
│ │ ├── taxonomy_plus.qza # QIIME2 formatted taxonomy feature file of GL_mito_tax_plus.txt
│ │ └── taxonomy.qza # QIIME2 formatted taxonomy feature file of GL_mito_tax.txt
│ └── Read_correction # Files from the classification of sequencing reads using both reference databases
│ ├── qiime2_*_out.csv # Raw output tables of reads per ASV by sample from QIIME2 sequence classification pipeline
│ ├── _Read_correction.R` # R code used to correct reads for contamination from NTCs and single fish samples and generate supp_fig1_corrected_data.pdf
│ ├── sample_data_DFO.csv # Table containing location data and collection information for each sampling effort
│ ├── supp_fig1_corrected_data.pdf # Output figure visualizing the information from "metabarcoding_results_corrected.csv"
│ └── metabarcoding_results_corrected.csv # Output table containing read counts by classified species for each sample after correction for contamination
├── 01_sample_map.zip # Files used to generate the map in figure 1 showing sampling locations and the drainages sampled
│ ├── sample_data_DFO.csv # Table containing location data and collection information for each sampling effort
│ ├── _sample_location_map.R # R code used to generate map figures
│ ├── thames_river.pdf # Output plot for the Thames River samples
│ ├── ausable_river.pdf # Output plot for the Ausable River samples
│ ├── credit_river.pdf # Output plot for the Credit River samples
│ ├── grand_river.pdf # Output plot for the Grand River samples
│ └── ont_map.pdf # Output plot for the map of Ontario with the sampled rivers and watersheds displayed
├── 02_database_build.zip # Files used to generate the reference database of mitochondrial sequences for ASV classification
│ │ ├── GL_mito_seq.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region
│ │ ├── GL_mito_tax.txt # Taxonomy file for the reference mitochondrial DNA sequences NOT including Notropis hudsonius and Notropis volucellus
│ │ ├── GL_mito_seq_plus.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region and Notropis hudsonius, and Notropis volucellus 12S sequences generated in this study
│ │ ├── GL_mito_tax_plus.txt # Taxonomy file for the reference mitochondrial DNA sequences including Notropis hudsonis and Notropis volucellus
│ ├── Ontario_freshwater_fishes_db.csv # Table of fish species that inhabit or are likely invaders of rivers in Ontario
│ ├── _Qiime2_database_formatter.sh # Bash script used to generate the QIIME2 formatted files for ASV classification
│ ├── _Reference_database_build.R # R code used to format the fasta files and taxonomy files for QIIME2
│ ├── sequences_plus.qza # QIIME2 formatted sequence feature file of GL_mito_seq_plus.fas
│ ├── sequences.qza # QIIME2 formatted sequence feature file of GL_mito_seq.fas
│ ├── taxonomy_plus.qza # QIIME2 formatted sequence feature file of GL_mito_tax_plus.txt
│ └── taxonomy.qza # QIIME2 formatted sequence feature file of GL_mito_seq.txt
├── 03_qiime2_analysis.zip # Files used to run the QIIME2 pipeline that trims/denoises/classifies metabarcoding reads (raw sequencing reads can be found at SRA sequence archive BioProject PRJNA948129)
│ ├── 01_sample_manifest.sh # Bash script used to generate the tab-delimited file required for QIIME2 sample processing
│ ├── 02_qiime2_denoising.sh # Bash script used to trim and denoise samples using cutadapt and dada2 in QIIME2
│ ├── 03_qiime2_classification.sh # Bash script used to classify ASVs using VSEARCH in QIIME2
│ ├── glcoi_results # Results from the analysis of the sequencing reads generated using the GLcoi primers
│ │ ├── qiime2_output # Output files from QIIME2 pipeline
│ │ │ ├── classified_rep_seqs_98.qza # Qiime2 output data file of the representative sequences classified using 98% cutoff for classification and the reference database WITHOUT addition of Notropis species
│ │ │ ├── classified_rep_seqs_98.qzv # Qiime2 output visualization file of the representative sequences using 98% cutoff for classification and the reference database WITHOUT addition of Notropis species
│ │ │ ├── classified_rep_seqs_98_supp.qza # Qiime2 output data file of the representative sequences classified using 98% cutoff for classification and the reference database WITH additional Notropis species
│ │ │ ├── classified_rep_seqs_98_supp.qzv # Qiime2 output visualization file of the representative sequences using 98% cutoff for classification and the reference database WITH additional Notropis species
│ │ │ ├── dada2_rep_set.qza # Qiime2 output data file of the representative sequences from denoised raw sequencing reads
│ │ │ ├── dada2_stats.qza # Qiime2 output data file containing the dada2 denoising statistics
│ │ │ ├── dada2_table.qza # Qiime2 output data file containing the dada2 denoised reads per sample by ASV
│ │ │ ├── dada2_table.qzv # Qiime2 output visualization file of the denoised reads per sample by ASV
│ │ │ ├── MiSeq_raw_reads.qza # Qiime2 output data file of the raw reads
│ │ │ ├── MiSeq_raw_reads.qzv # Qiime2 output visualization file of the raw reads
│ │ │ ├── MiSeq_trimmed_reads.qza # Qiime2 output data file of the trimmed reads (trimmed using cutadapt)
│ │ │ ├── MiSeq_trimmed_reads.qzv # Qiime2 output visualization file of the reads remaining after trimming
│ │ │ ├── rep-seqs.qzv # Qiime2 output visualization file of the representative sequences after denoising
│ │ │ ├── taxa_barplot_98.qzv # Qiime2 output visualization file summarizing the number of reads by classified species per sample for the reference database WITHOUT addition of Notropis species
│ │ │ └── taxa_barplot_98_supp.qzv # Qiime2 output visualization file summarizing the number of reads by classified species per sample for the reference database WITH additional Notropis species
│ │ ├── sample_manifest.txt # tab-delimited file containing sample IDs and path to the forward and reverse reads
│ │ └── sample_metadata.txt # tab-delimited file containing sample IDs and other pertinent sample metadata
│ ├── mifish_results # Results from the analysis of the sequencing reads generated using the MiFish-U primers (all output files are in the same format as for the GLcoi primer analysis above)
│ │ ├── qiime2_output
│ │ │ ├── classified_rep_seqs_98.qza
│ │ │ ├── classified_rep_seqs_98.qzv
│ │ │ ├── classified_rep_seqs_98_supp.qza
│ │ │ ├── classified_rep_seqs_98_supp.qzv
│ │ │ ├── dada2_rep_set.qza
│ │ │ ├── dada2_stats.qza
│ │ │ ├── dada2_table.qza
│ │ │ ├── dada2_table.qzv
│ │ │ ├── MiSeq_raw_reads.qza
│ │ │ ├── MiSeq_raw_reads.qzv
│ │ │ ├── MiSeq_trimmed_reads.qza
│ │ │ ├── MiSeq_trimmed_reads.qzv
│ │ │ ├── rep-seqs.qzv
│ │ │ ├── taxa_barplot_98.qzv
│ │ │ └── taxa_barplot_98_supp.qzv
│ │ ├── sample_manifest.txt # tab-delimited file containing sample IDs and path to the forward and reverse reads
│ │ └── sample_metadata.txt # tab-delimited file containing sample IDs and other pertinent sample metadata
│ └── sequence_db
│ ├── sequences_plus.qza # QIIME2 formatted sequence feature file of GL_mito_seq_plus.fas
│ ├── sequences.qza # QIIME2 formatted sequence feature file of GL_mito_seq.fas
│ ├── taxonomy_plus.qza # QIIME2 formatted taxonomy feature file of GL_mito_tax_plus.txt
│ └── taxonomy.qza # QIIME2 formatted taxonomy feature file of GL_mito_seq.txt
├── 04_read_correction.zip # Files used to remove erroneous sequencing reads through comparison to no template controls and single-individual samples
│ ├── _Read_correction.R # R code used to correct reads for contamination from NTCs and single fish samples and generate Supp_table_X.csv and supp_fig1_corrected_data.pdf
│ ├── metabarcoding_results_corrected.csv # Output table containing read counts by classified species for each sample after correction for contamination
│ ├── qiime2_glcoi_out.csv # Tabular output of the QIIME2 visualization file taxa_barplot_98.qzv from the glcoi analysis
│ ├── qiime2_glcoi-supp_out.csv # Tabular output of the QIIME2 visualization file taxa_barplot_98_supp.qzv from the glcoi analysis
│ ├── qiime2_mifish_out.csv # Tabular output of the QIIME2 visualization file taxa_barplot_98.qzv from the MiFish-U analysis
│ ├── qiime2_mifish-supp_out.csv # Tabular output of the QIIME2 visualization file taxa_barplot_98_supp.qzv from the MiFish-U analysis
│ ├── sample_data_DFO.csv # Table containing location data and collection information for each sampling effort
│ ├── supp_fig1_corrected_data.pdf # Output figure visualizing the information from "metabarcoding_results_corrected.csv" (Supplementary Figure S1 in manuscript)
│ └── Supp_table_X.csv # Simplified table of results in "metabarcoding_results_corrected.csv" for manuscript (Supplementary table S3 in manuscript)
├── 05_sampling_stats.zip # Files used to summarize and generate simple statistics relating to the number of fish collected from each river and the number of reads generated per sample
│ ├── metabarcoding_results_corrected.csv # Output table containing read counts by classified species for each sample after correction for contamination
│ ├── _Sample_stats.R # R code used to summarize the number of samples and fishes collected in each river and the number of reads generated per sample
│ └── Table2.numbers # Table containing the summary of the number of samples and fishes collected in each river (Table 1 in manuscript)
├── 06_river_richness.zip # Files used for the comparison of species detections (total and per sample) across rivers (Figure 2 in manuscript)
│ ├── metabarcoding_results_corrected.csv # Table containing read counts by classified species for each sample after correction for contamination
│ ├── Fig2A_sp_richness_per_river.pdf # Output barplot showing the incidence of each species by river
│ ├── Fig2B_sp_richness_per_sample.pdf # Output scatterplot showing the number of species detected in each sample by the total number of individuals in the sample
│ ├── Fig2C_jaccard_dist.pdf # Output violin plots visualizing the pairwise Jaccard distances of species incidence between samples in each river
│ ├── regression line.png # Screenshot of the output from the regression analysis in Figure 2B
│ └── _Species_richness_by_river.R # R code used to generate figure 2 showing the comparisons of species richness in the four rivers sampled
├── 07_gear_richness.zip # Files used for the analysis of differences in species detections between gear types (Figure 3 in manuscript)
│ ├── metabarcoding_results_corrected.csv # Table containing read counts by classified species for each sample after correction for contamination
│ ├── Fig3A_frequency_by_gear.pdf # Output dumbbell plot comparing the detection frequency of species with each gear type
│ ├── Fig3B_sp_accumulation_plot.pdf # Output scatter plot showing the accumulation of species and individuals after X number of random samples
│ └── _Species_richness_by_gear.R # R code used to generate figure 3 showing the differences in frequencies of species detection by gear type
├── 08_primer_consistency.zip # Files used for the comparison of the coi (GLcoi) and 12S (Mi-Fish-U) primer efficiency and consistency in species detection across samples (Figure 4 in manuscript)
│ ├── metabarcoding_results_corrected.csv # Table containing read counts by classified species for each sample after correction for contamination
│ ├── Fig4A_primer_consistency_plot.pdf # Output barplot showing the frequency each species was detected with each primer type
│ ├── Fig4B_primers_missed_species.pdf # Output scatterplot showing the sample properties associated with the failure to detect specific species
│ ├── GL_mito_seq_supp.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region
│ ├── GL_mito_tax_supp.txt # Taxonomy file for the reference mitochondrial DNA sequences including Notropis hudsonis, Notropis buchanani, and Notropis volucellus
│ ├── in_silico_PCR_df.csv # Output table of the results of the in silico analysis of the species detected using both primer sets
│ └── _Primer_efficiency_and_consistency.R # R code used to run the in silico PCR and generate figure 4 in the manuscript
├── 09_sampling_depth.zip # Files used to compare if species incidence data are better estimated from fewer samples with more reads or more samples with fewer reads (Figure 5 in manuscript)
│ ├── metabarcoding_results_corrected.csv # Table containing read counts by classified species for each sample after correction for contamination
│ ├── Fig5_reads_vs_samples.pdf # Scatter plot showing the reduction in the number of species detected after a reduction in the number of reads or samples
│ └── _Sample_frequency_vs_depth.R # R code used to run simulations, subsampling reads and samples to compare the importance of each in the total number of species detected across rivers (figure 5 in manuscript)
├── 10_database_comparison.zip # Files used to identify the number of individuals of each species represented in the reference sequence database for each gene fragment
│ ├── _Amplicon_matches_in_database.sh # Shell script used to call QIIME2 in silico processes used to identify which sequences in the reference database overlap with each sequence fragment targeted by each pair of metabarcoding primers (Figure 6 in manuscript)
│ ├── glcoi_is_output # Output file from the QIIME2 analysis of sequence overlap with the target region of the GLcoi primers
│ │ └── dna-sequences.fas # FASTA file of glcoi sequences in the reference database
│ ├── Glcoi_is_pcr_seq.qza # QIIME2 output data file from the QIIME2 analysis of sequence overlap with the GLcoi primers
│ ├── glcoi_species_detected.txt # List of individuals of each species detected in database with sequences overlapping the target region of the GLcoi primers
│ ├── GL_mito_seq.fas # Fasta file containing mitochondrial DNA sequences for fishes from the Great Lakes region
│ ├── mitofish_is_output # Output file from the QIIME2 analysis of sequence overlap with the target region of the MiFish-U primers
│ │ └── dna-sequences.fasta # FASTA file of glcoi sequences in the reference database
│ ├── Mitofish_is_pcr_seq.qza # QIIME2 output data file from the QIIME2 analysis of sequence overlap with the MiFish-U primers
│ ├── mitofish_species_detected.txt # List of individuals of each species detected in database with sequences overlapping the target region of the MiFish-U primers
│ ├── Primer_list.tsv # List of primers used in this study (GLcoi and MiFish-U)
│ ├── sequence_frequency_plot.pdf # Output barplot showing the number of representative individuals for each species included in the reference database
│ └── sequences.qza # QIIME2 formatted sequence feature file of GL_mito_seq.fas
├── 11_amplicon_distance.zip # Files used to compare the genetic distance between species for each sequence fragment targeted by the two primer sets used and generate a gene tree for comparing distances among Notropis species (Figure 6 in manuscript)
│ ├── Database_distance_differences.R # R code used to align the metabarcoding sequence fragment libraries and calculate and compare pairwise distances between species
│ ├── dist_df.csv # Output data table of all pairwise distances between species
│ ├── glcoi_is_output # Output file from the QIIME2 analysis of sequence overlap with the target region of the GLcoi primers
│ │ └── dna-sequences.fas # FASTA file of glcoi sequences in the reference database
│ ├── glcoi_seq_aln.fas # Sequence alignment of the glcoi sequence fragments
│ ├── glcoi_seq.fas # Unaligned glcoi sequence fragments
│ ├── mitofish_is_output # Output file from the QIIME2 analysis of sequence overlap with the target region of the MiFish-U primers
│ │ └── dna-sequences.fasta # FASTA file of glcoi sequences in the reference database
│ ├── mitofish_seq_aln.fas # Sequence alignment of the MiFish-U sequence fragments
│ ├── sequence_distance_plot.pdf # Output scatter plot of the proportion of species with pairwise distances less than a series of frequently used cutoffs for species classification
│ ├── mitofish_seq.fas # Unaligned MiFish-U sequence fragments
│ ├── Notropis_12S.fas # Sequence alignment of the MiFish-U 12S sequence fragments for the Notropis species that were difficult to classify in our analyses
│ ├── Notropis_12S_tree.pdf # Phylogenetic tree of Notorpis species
│ └── notropis_gene_tree # Files used to generate the Notropis gene tree
│ ├── 12S_distance_tree.R # R code used to reconstruct and plot the Notropis gene tree
│ ├── Notripis_distance_tree.pdf # Output plot of the Notropis 12S gene tree
│ ├── Notropis_12S.fas # Notropis MiFish-U 12S gene fragment alignment (FASTA file)
│ └── Notropis_12S.fas.nex # Notropis MiFish-U 12S gene fragment alignment nexus file for Mesquite
File/Folder Details
Details for: sample_data_DFO.csv
- Description: a comma-delimited file containing location data and collection information for each sampling effort
- Format(s): .csv
- Size(s): 7 KB
- Dimensions: 73 rows by 12 columns
- Variables:
- Extraction_code: Code relating sample information to the DNA extraction
- Project_code: River that the sample(s) was collected from
- Date: Date of collection; dd-mm-yyyy
- Latitude: Initial latidude for sample collection (DD)
- Longitude: Initial longitude for sample collection (DD)
- Gear: Gear type used to collect sample
- Total_fish: Total number of larval fishes collected
- Total_eggs: Total number of eggs collected; blank cells = 0 eggs
- Total_weight: Total weight of the ichthyoplankton sample (g)
- Start_time: Start time of the sampling effort; blank cells = not recorded; hh:mm
- Stop_time: Stop time of the sampling effort; blank cells = not recorded; hh:mm
- Notes: Additional habitat information pertaining to each sample; "voucher combined" refers to the pooling of larval fish from each of the three light traps set at a location; blank cells = nothing reported
Details for: metabarcoding_results_corrected.csv
- Description: a comma-delimited file containing read counts by classified species for each sample after correction for contamination
- Format(s): .csv
- Size(s): 73 KB
- Dimensions: 479 rows by 17 columns
- Variables:
- index: Sequencing code referring to the demulitplexed forward and reverse reads from the MiSeq sequencer
- Extraction_code: Code relating sample information to the DNA extraction
- Species: Species classification in scientific format
- Reads: Number of denoised sequencing reads generated for each classified species
- Primers: Refers to primer set and reference database; glcoi/mitofish refer to the primer set used to amplify the extracted DNA; (-supp) identifies which reference database was used for the species classification; no suffix = original reference database was used; -supp = supplemented database was used.
- Corrected_reads: Number of denoised sequencing reads remaining after correcting for contamination (by subtracting maximum number of reads detected for each species in NTCs)
- Project_code: River that the sample(s) was collected from
- Date: Date of collection; dd-mm-yyyy
- Latitude: Initial latidude for sample collection (DD)
- Longitude: Initial longitude for sample collection (DD)
- Gear: Gear type used to collect sample
- Total_fish: Total number of larval fishes collected
- Total_eggs: Total number of eggs collected; blank cells = 0 eggs
- Total_weight: Total weight of the ichthyoplankton sample (g)
- Start_time: Start time of the sampling effort; blank cells = not recorded; hh:mm
- Stop_time: Stop time of the sampling effort; blank cells = not recorded; hh:mm
- Read_sums: Total number of denoised sequencing reads in each sample after correcting for contamination
Usage notes for QIIME formatted files
- Files with ".qza" file extensions are zipped QIIME2 data artifacts containing data and metadata for qiime2 analyses used as input for further processing and/or exporting data to other tools using the suite of software included in the QIIME2 bioinformatics platform
- Files with ".qzv" file extensions are zipped QIIME2 archives containing interactive graphical output from QIIME2 analyes that can be opened using the online QIIME2 viewer (https://view.qiime2.org).
Sharing/Access information
Raw metabarcoding sequencing data available on NCBI SRA (BioProject PRJNA948129)
Genbank accession numbers for the Notropis 12S sequences OQ679414:OQ679415
DNA extracted from homogenized bulk ichthyoplankton samples using salt extraction protocol, 1st PCR amplification using MiFish 12S or modified PS1 Teleost Primers with heterogeneity spacers and Nextera adaptor sequences, 2nd PCR using combinatorial Nextera i5/i7 indices, purified using Ampure XP magnetic beads and quantification via QuBit broad range kit, Illumina MiSeq using the V2 (150bp x 2).
Sequencing data were processed using the Qiime2 software package, utilizing the following algorithms (cudadapt / DADA2 / Vsearch). Processed sequencing data were analyzed using custom scripts in R and software packages VEGAN, MUSCLE, APE, and DECIPHER. Plots were generated using ggplot2.
Qiime2
R
Mesquite
