Malaise-trap metabarcoding dataset from temperate-zone forest Oregon, USA
Data files
Aug 24, 2023 version files 6.40 GB
Abstract
DNA-based biodiversity surveys involve collecting physical samples from survey sites and assaying the contents in the laboratory to detect species via their diagnostic DNA sequences. DNA-based surveys are increasingly being adopted for biodiversity monitoring and decision-making. The most commonly employed method is metabarcoding, which combines PCR with high-throughput DNA sequencing to amplify and then read `DNA barcode' sequences. This process generates count data indicating the number of times each DNA barcode was read. However, DNA-based data are noisy and error-prone, with several sources of variation. In this paper, we present a unifying modelling framework for DNA-based survey data, eDNAPlus, for the first time simultaneously allowing for key sources of variation, error and noise in the data-generating process. As we discuss, metabarcoding data alone cannot be used to estimate the species-specific amount of DNA present, or DNA concentration, at surveyed sites. Instead, we estimate changes in DNA biomass within species, across sites, and link those changes to environmental covariates, while accounting for between-species and between-sites correlation. Inference is performed using MCMC, where we employ Gibbs or Metropolis-Hastings updates with Laplace approximations. We further implement a re-parameterisation scheme, appropriate for crossed-effects models, leading to improved mixing, and an adaptive approach for updating latent variables, which reduces computation time. We discuss study design and present theoretical and simulation results to guide decisions on replication at different survey stages and on the use of quality control methods. Finally, we demonstrate the new framework on a dataset of Malaise-trap samples. Specifically, we quantify the effects of elevation and distance-to-road on each species, infer species correlations, and produce maps identifying areas of high biodiversity and species DNA biomass, which can be used to rank areas by conservation value. We also estimate the level of noise between sites and within sample replicates, and the probabilities of error at the PCR stage, which are found to be close to zero for most species considered, validating the employed laboratory processing.
Methods
Sample information: We collected 121 Malaise-trap samples from 89 sample sites in and around the HJ Andrews Experimental Forest and surroundings, Oregon, USA. Each sample was subjected to the Begum metabarcoding pipeline (described in `Yang, C.Y., Bohmann, K., Wang, X.Y., Wang, C., Wales, N., Ding, Z.L., Gopalakrishnan, S., Yu, D.W. (2021) Biodiversity Soup II: A bulk-sample metabarcoding pipeline emphasizing error reduction. Methods in Ecology and Evolution 12:1252-1264. doi: 10.1111/2041-210X.13602`.). In short, each sample was DNA-extracted and then PCR-amplified for a 313 base-pair fragment of the COI DNA-barcode gene (using the Leray-FolDegenRev primer pair, described in Yang et al. 2021). In the Begum pipeline, each sample is independently PCRd three times and then library prepped and sequenced on an Illumina sequencer (amplicon sequencing). Finally, we processed the Illumina output files to trim low-quality sequences, merge read pairs, and assign the orginating sample name to each read.
Unlike in the standard Begum pipeline, we did not use the 3 separate PCRs per sample to detect and filter out erroneous sequences. Instead, we accepted all reads (i.e. by filtering at the no stringency: accepting reads if they appeared in ≥1 PCR with ≥1 read). This resulting fasta-format read dataset is the starting dataset for this archive (`data/seqs/folderhja2/hja_filtered_1pcr_1read.fna.gz`).
We wrote a custom pipeline to process `hja_filtered_1pcr_1read.fna.gz` and generate 3 separate sample x species tables (known technically as OTU tables, where OTU means Operational Taxonomic Unit), one table per PCR.
The 3 OTU tables are the raw inputs to eDNAPlus (<https://github.com/alexdiana1992/eDNAplus>).
Usage notes
The README file and the comments within the scripts listed below provide step-by-step instructions on how to reconstruct the 3 OTU tables from the raw sequencing file, which is also provided.
The first script (1_reformat_and_cluster_Begum_output_pipeline_20230806.sh) is a Unix bioinformatic pipeline, and the utility software packages are seqkit (https://bioinf.shenwei.me/seqkit) and vsearch (https://github.com/torognes/vsearch). The first script also calls a sub-script (1.1_sum_reads_for_usearch.R) written in R (https://cran.r-project.org) plus these R packages: tidyverse, seqinr, here, and glue.
The second script (2_parse_uc_file_tidyverse.Rmd) is an R pipeline and uses these R packages: tidyverse, here, glue, tictoc, beepr, and sjmisc.
These scripts have also been archived on Zenodo: https://doi.org/10.5281/zenodo.8220863