Systematic review of marine environmental DNA metabarcoding studies: Toward best practices for data usability and accessibility
Citation
Shea, Meghan et al. (2023), Systematic review of marine environmental DNA metabarcoding studies: Toward best practices for data usability and accessibility , Dryad, Dataset, https://doi.org/10.5061/dryad.95x69p8pd
Abstract
The emerging field of environmental DNA (eDNA) research lacks universal guidelines for ensuring data produced are FAIR–findable, accessible, interoperable, and reusable–despite growing awareness of the importance of such practices. In order to better understand these data usability challenges, we systematically reviewed 60 peer-reviewed articles conducting a specific subset of eDNA research: metabarcoding studies in marine environments. For each article, we characterized approximately 90 features across several categories: general article attributes and topics, methodological choices, types of metadata included, and availability and storage of sequence data. Analyzing these characteristics, we identified several barriers to data accessibility, including a lack of common context and vocabulary across the articles, missing metadata, supplementary information limitations, and a concentration of both sample collection and analysis in the United States. While some of these barriers require significant effort to address, we also found many instances where small choices made by authors and journals could have an outsized influence on the discoverability and reusability of data. Promisingly, articles also showed consistency and creativity in data storage choices as well as a strong trend toward open access publishing. Our analysis underscores the need to think critically about data accessibility and usability as marine eDNA metabarcoding studies, and eDNA projects more broadly, continue to proliferate.
Methods
Literature Selection
Using standard systematic review protocols (Moher et al., 2015), we conducted a literature search of peer-reviewed articles indexed in Web of Science, PubMed, and Scopus (see Figure 1). On all platforms, we used the search string (“environmental DNA” OR eDNA) AND (marine OR ocean* OR seawater OR saltwater OR sea) across titles, abstracts, and keywords to broadly identify articles using eDNA in marine environments, published up to 31 December 2020. Using this search strategy, we were only selecting for articles that self-identified as studying “environmental DNA,” rather than articles using the same, or similar, methods. Because many eDNA articles are published in the journal Environmental DNA, which at the time of searching was not yet indexed in any of the databases above, we additionally searched that journal’s corpus using the same search string and date range and added the returned articles to our sample. After removing duplicates using the systematic review organizational platform Covidence, 1014 articles remained for us to screen.
We utilized a two-phase screening process, first identifying potentially relevant articles from the title and abstract, and then further investigating the full text of articles passing the initial screen. During both phases, all articles were screened by two members of the research team (MS, JK, MR, or DS); any disagreements over the relevancy of a given article in either phase were resolved with the full screening team (MS, JK, MR, and DS). Selected peer-reviewed articles met the following five criteria: 1) were published in English, 2) primarily reported novel scientific findings (no book chapters, review papers, perspective pieces, or similar), 3) collected eDNA samples directly (no modeling papers), 4) reported eDNA data from at least one water sample (no papers with samples exclusively from sediment, tissue, gut, or similar) from a marine environment (no papers with exclusively freshwater sampling), and 5) utilized a metabarcoding approach to sequence their sample(s). The title/abstract screening yielded 276 potentially relevant articles, which were narrowed to 120 relevant articles by the full-text screening. Sixty of these articles were selected for inclusion in the analysis, as detailed below. A list of all articles screened, and the resulting decision, can be found in this dataset.
Data Collection
Elements to be extracted from relevant articles were developed from criteria used in a review of freshwater eDNA metadata practices (Nicholson et al., 2020) and expanded using criteria from existing eDNA metadata frameworks, including the Global Biodiversity Information Facility (GBIF) recommendations for metabarcoding data (Andersson et al., 2021), as well as the research team’s knowledge of environmental DNA research. Extraction elements were clarified and refined via a pilot using approximately 10% of the relevant articles, with the full data collection team (MS, JK, MR, and DS) independently extracting each pilot article and discussing all differences.
Ultimately, we compiled a list of approximately 90 elements to extract, which fell into several broad categories. The first set of categories helped provide overarching context. For each article, we first recorded general article characteristics, including basic publication information (authors, year published, journal name, open access status) and information about the methodological scope (target taxa, metabarcoding loci used, type of environment sampled, complementary methods used beyond eDNA metabarcoding). We also extracted information about the geographic scope of the articles, including the institutions of first and last authors of the paper (to represent the dominant location where the full project was conducted) and where samples were collected. For papers that gave geographic coordinates for sampling sites, sampling location was recorded as the centermost coordinate for each distinct geographic area. When specific coordinates were not given, sampling location was estimated from included maps or location information in the text. For each article, the relationship between the institutions and the sampling location(s) was mapped using the associated institution (first or last) with the smallest average distance to its sampling sites. Because crucial metadata and data storage information was often contained in the supplementary information of the articles we analyzed, we cataloged the number and type of supplementary information files associated with each article.
We extracted several elements related to data storage. We recorded whether articles stated that they had published their underlying sequence data (as in: uploaded a FASTQ, or similar, file as supplementary information or in an external repository), where in the article that statement was made (in a data availability statement, in the methods section, elsewhere). If articles published underlying sequence data, we then extracted information about the platform on which the data was published, and followed the link or accession number given to record whether the sequence data was indeed accessible at that location, the file format of the data, and whether the platform provided an easily-accessible citation for that dataset. In cases where the link or accession number did not lead to a valid dataset, we emailed the corresponding author(s) and asked whether they knew the data was not available as stated, and why that might have occurred.
Beyond just understanding how articles stored and cataloged their sequence data, we also wanted to understand how articles captured and recorded metadata related to the project, a category we termed metadata inclusion. Across all articles in the sample, we recorded whether papers included 60 different types of metadata across 13 categories (see Figure 6). Importantly, we were only assessing the presence or absence of the information, making no value claim about the validity of the information included (cf. Dickie et al., 2018). For example: we recorded whether articles included any information about filter size & type, not whether articles used particular filter sizes and types. We then averaged the percent inclusion across the elements within each metadata category; these averages help show general trends across the different categories, but we do not intend to suggest that all of these metadata elements are equally important. Because of our interest in data accessibility, we also recorded additional information about two of our metadata elements, statistics and bioinformatics analysis scripts, including where the scripts were published.
Additionally, there have already been some efforts to provide metadata guidance for eDNA metabarcoding studies, so we wanted to further use our metadata inclusion data to assess how easily existing studies would be able to comply with new standards: that is, are studies already including the recommended information for these standards, or would it likely be challenging for studies to adopt them. We selected one illustrative standard to study–the GBIF guide for publishing DNA-derived data (Andersson et al., 2021). GBIF is a global repository of biodiversity observations originally designed for traditional biodiversity sampling records: where an organism has been collected and observed, and then taxonomically identified, either visually or morphologically (Andersson et al., 2021). In contrast, occurrences derived from eDNA sampling involve many additional steps between the collection of material to a final list of species, steps that all necessitate additional metadata in order for the final occurrence to be sufficiently contextualized (Andersson et al., 2021). Recognizing that DNA-derived occurrences need specialized standards, GBIF released a set of additional recommended fields for submitting DNA-derived data, including separate guidance for both metabarcoding and ddPCR/qPCR (Andersson et al., 2021). While journals and funders sometimes require that eDNA sequence data be submitted to sequence read archives, there are rarely mandates that the associated biodiversity occurrences are submitted to repositories like GBIF. Therefore, the GBIF recommendations for metabarcoding data represent a case where the difference between what studies are already including and what the recommendations necessitate really matters; if eDNA projects do not already have the metadata on hand to upload their data to GBIF, it seems unlikely that they will. To assess this potential discrepancy, we included 13 metadata elements that corresponded with GBIF recommendations in order to see how well existing studies would be able to comply with these proposed standards; these criteria are asterisked in Figure 6. The selection of 13 elements that we chose was only a subset of the full list of GBIF recommendations. We excluded all fields that would have been identical across all studies in our sample (such as environmental medium) or were more broad than other metadata criteria we investigated (such as sampling protocol).
Finally, beyond just studying the different types of metadata included in the articles, we were also interested in what studies referred to as metadata, what we termed metadata language. That is, we were curious as to what types of information were designated as metadata by authors. We anticipated that different articles might use the word “metadata” to refer to very different kinds of information; for example, one article could call a supplementary table of temperature and salinity table “metadata,” whereas another article could use the term to describe all of the information needed to construct reference libraries.
Due to the comprehensive nature of these elements, we opted to fully extract half of our sample of 120 relevant articles. These were selected via a stratified random sample by publication year, so that the articles included would be representative of any changes in metadata or data storage practices over time. All articles were extracted by two researchers independently (of MS, JK, MR, DS); a third researcher (MS, JK, or MR) compared these extractions and resolved any differences across all elements. A sample article extraction data sheet can be found in this dataset. While the particular configuration of researchers extracting and resolving each article varied to reduce bias, one researcher (MS) either extracted or resolved every article in the sample to ensure consistency. Articles were analyzed using basic descriptive statistics. A combined spreadsheet of all extracted information by article can be found in this dataset.
Funding
Stanford University