Data from: Integrating environmental DNA metabarcoding and remote sensing reveals known and novel fish diversity hotspots in a World Heritage Area

Bizzozzero, Manuela R.1 ; Marfurt, Svenja M.1 ; Altermatt, Florian1 2; Willems, Erik P.1; Damm-Reiser, Alexander1; Allen, Simon J.1 3 4; Walser, Jean-Claude5; Krützen, Michael1

Research facility: Swiss National Science Foundation

Published Nov 12, 2025 on Dryad. https://doi.org/10.5061/dryad.kh18932jk

Data files

Nov 12, 2025 version files 13.92 GB

eDNA_unveils_fine-scale_fish_biodiversity_datasets.zip

13.92 GB
README.md

2.78 KB

Abstract

Aim
Shark Bay, a UNESCO World Heritage site in Western Australia, is highly vulnerable to climate change, yet its fish biodiversity remains poorly understood at fine spatial scales. We integrated environmental DNA (eDNA) metabarcoding with high-resolution remote sensing to assess and extrapolate fish diversity patterns, providing a scalable framework for biodiversity monitoring in dynamic coastal ecosystems.

Location

Shark Bay, Western Australia.

Methods

We analysed 270 water samples across 560 km² using fish-specific 16S and 12S rRNA metabarcoding, linking biodiversity patterns to key environmental variables—including depth, salinity, sea surface temperature, and habitat characteristics—derived from high-resolution satellite imagery. To predict fish biodiversity across unsampled areas, we employed machine-learning models, enabling spatial extrapolation of eDNA data across the seascape.

Results

eDNA metabarcoding identified 107 fish species across 132 genera and 71 families, with substantial overlap with conventional monitoring but broader coverage at higher taxonomic levels. Fish richness increased with decreasing salinity, high channel habitat coverage, and moderate depths with high seagrass coverage. We delineated five distinct fish communities (A–E): Two shallow seagrass communities — one in sparse seagrass (A) and another dense seagrass (B), one in channel habitats (C) with the greatest fish diversity; one in deep sandy waters (D) and one in medium-depth, seagrass-free areas (E). Additionally, we detected several tropical species, suggesting poleward shifts due to rising water temperatures.

Main conclusions

This study highlights the utility of combining marine eDNA metabarcoding with remote sensing to detect fine-scale biodiversity. The integration of machine learning enables spatial upscaling and timely responses to habitat changes, enhancing marine conservation and management. By identifying key environmental drivers of fish diversity, this approach supports proactive conservation strategies, providing a scalable model for biodiversity monitoring under climate change.

The dryad repository contains all the eDNA raw data, filtering steps, and meta data used in the framework of this in this study associated with the manuscript Integrating Environmental DNA Metabarcoding and Remote Sensing Reveals Known and Novel Fish Diversity Hotspots in a World Heritage Area(DDI-2025-0112).
The data consists of 273 eDNA samples and 17 negative controls.
All samples have been sequences in two sequencing runs p751_run_250522 for the MiFish12S metabarcode and p751_run_220617 for the Fish16S metabarcode

Datasets:

All datasets are contained in the eDNA_unveils_fine-scale_fish_biodiversity_datasets.zip file

Sample_Overview.csv: an overview of all samples and their metadata used in the study:
- Sample.NR: unique Sample ID
- Location: ID of location, samples taken from the same location have the same ID
- Extraction Date: Date of DNA extraction
- Sampling Date: Date of eDNA sample collection
- Study Site: Which gulf (western or eastern) of Shark Bay the sample was collected in. WG = Western Guld, EG = Eastern Gulf
- longitude/latitude: GPS location of sample taken
- Sample.Depth: Depth at which the sample was taken [m]
- type: S for Sample, FNC for Field negative Control, ENC for Extraction negative Control
- Bathymetry [m], Channel_Perc. [% within 500 m cell], Complexity [Bathymetry SD within 500 m cell], Distance_to_Shore [m], Sand_Perc.[% within 500 m cell], Sand_Silt_Perc.[% within 500 m cell], Seagrass_Perc.[% within 500 m cell], Slope [Degree], Salinity [psu], Sea surface temperature daily difference (SST_Daily_Diff)[°C], Sea surface temperature (SST) [°C], Turf_Algae_Perc.[% within 500 m cell]: Environmental variables extracted from remote sensing data at the sampling location
- in_rich: if yes the sample was included for the analysis on richness in our study
- in_comp: if yes the sample was included for the composition analysis in our study

The rest of the data is organised by metabarcode sequencing runs. There is one folder for each run:

Fish16S
MiFish12S
In each folder there are:
- Mapfile: containing the index, primer and run information for each sample
- xx__RawData.zip: contains an 'a_data' folder with all the raw reads for all samples, R1 denotes forward reads, R2 reverse reads; and 'y_help' folder containing md5sums for each sample.
- WorkflowSummaryLog: Log file of the data filtering, ZOTU clustering and taxonomic assignment steps
- xx_ZOTU_tax: Taxonomic assignments for each ZOTU with assignment confidence in brackets
- xx_ZOTU_Count_TH90: ZOTU count table and taxonomic assignment as used in the study

1 Environmental DNA

1.1 Sampling Design

Our sampling areas (combined ca. 557 km²) comprised two long-term dolphin research sites within the eastern (ca. 230 km²) and western (ca. 327 km²) gulfs of Shark Bay, Western Australia. To support future research on the feeding ecology of Shark Bay’s iconic bottlenose dolphins (Connor and Krützen, 2015), we focused our biodiversity assessment on fish taxa. As dolphin behavioural data is typically collected during austral winter, we aimed to capture a representative snapshot of fish biodiversity during this season.

To maximise the biological signal while minimising sampling effort, we employed a stratified random sampling design, thus enhancing sample representativeness and efficient capture of underlying biological patterns (Altermatt et al., 2023; Carvalho et al., 2016). Sampling units were derived from the 2016 “Shark Bay Marine Habitat Classification” a byproduct of the 2016 seagrass extent from Strydom et al., (2020) published as map in Sutton and Shaw, (2020).

To account for the diffuse nature of eDNA samples, we divided both gulf study sites into 500 x 500 m grid cells (hereafter sampling grid), ensuring a minimum sampling distance of 500 m. We considered this distance adequate as other eDNA studies in nearshore marine environments report effective sampling ranges from less than 100 m (O’Donnell et al., 2017; Port et al., 2016) to 800 m (Yamamoto et al., 2017).

1.2 Sampling and extraction

All eDNA samples were collected between August 30 and September 27, 2021, by filtering seawater through 0.45 µm CN (Cellulose-Nitrate) filters using a peristaltic pump (GeoPump^TM, Geotech Environmental Equipment, Inc., Denver, Colorado) on site. In total, we sampled 45 locations and collected 274 samples, including four field negative controls. Samples were collected at the geographical centre of the selected grid cells at mid-water depth.

At each location, we collected six samples of 3 L each, filtering a total of 18 L of sea water per location. We immediately stored the filter papers in Longmire’s solution (Longmire et al., 1997) at room temperature until eDNA extraction, following the procedure described by Bizzozzero et al. (2024).

1.3 PCR, library preparation, and sequencing

To cover a broad range of fish species, we amplified the samples targeting two fish-specific metabarcodes in different genomic regions as recommended by Kumar et al., (2022): a 16S rRNA gene fragment, hereafter Fish16S, and a 12S rRNA gene fragment, hereafter MiFish12S. For each metabarcode, we generated and sequenced separate libraries following a published protocol (Bizzozzero et al., 2024). We checked for possible contaminants, including several negative controls and two positive controls: a mock community (MC); and a positive index control (PC_Index).

1.4 Data processing and taxonomic assignments

To facilitate data processing, the UNOISE3 workflow, as part of the USERACH framework (v11.0.667_i86linux64), was applied (Edgar, 2016). After removing PhiX-related reads and those with low complexity, the paired-end reads were merged. To improve the merging process, low-quality read ends were trimmed. The primer sites were then removed, and the amplicon reads were filtered based on standard quality criteria (e.g., minimum mean quality, length range, and GC-content range). The cleaned amplicon reads were processed into operational taxonomic units (OTUs) using the zero-radius clustering approach (ZOTUs). Finally, the cleaned amplicon reads were mapped to the ZOTUs to generate count tables (detailed workflow and thresholds provided in the summary script of this repository).

Taxonomic classification was performed using SINTAX, a k-mer-based method (Edgar, 2016). ZOTUs from the MiFish12S dataset were annotated with the MIDORI2 srRNA database (GB248), whereas the Fish16S dataset was enriched through annotations from multiple sources, including MIDORI2 (GB259), MitoFish (v397), and NCBI RefSeq (Fish-16S-v240202).

We processed and analysed our data in Rstudio V2022.07.2 (RStudio Team, 2022), using R 4.3.0 (R Core Team, 2023). To improve data quality, we used read counts from positive and negative controls to remove non-target taxa and external contaminants. Negative controls helped identify and exclude contamination. ZOTUs were filtered using a false assignment threshold (MiFish12S: 0.155%, Fish16S: 0.048%) based on PC_Index reads~~ to correct for sequencing errors (Galan et al., 2018). Finally, samples with dysfunctional PCRs (MiFish12S and Fish16S: M2046, M2117; MiFish12S only: M1027) were visually identified and removed following Taberlet et al. (2018).

We evaluated correctness of taxonomic assignments by checking whether the identified taxa were documented in the tropical Indo-West Pacific marine bioregion (Briggs and Bowen, 2012). This was based on data from the Global Biodiversity Information Facility (GBIF, 2001; accessed: 24.04.2024), the Australian Faunal Directory (ABRS, 2020; accessed: 24.04.2024) and FishBase (FishBase, 2021; accessed 24.04.2024). If a taxon was not recorded in the region, we reassigned it to a lower taxonomic level that more plausibly occurs in Shark Bay.

2 Environmental data acquisition and processing

We extracted marine habitat types, i.e., channel, sand, sand/silt, seagrass, and turf algae, from the 2016 “Shark Bay Marine Habitat Classification” a byproduct of the 2016 seagrass extent from Strydom et al., (2020) published as map in Sutton and Shaw, (2020), which also informed our sampling design. Given the potential variability in the extent of seagrass in Shark Bay across different years (Strydom et al., 2020), we adjusted the seagrass extent in the habitat map using 2021 Sentinel-2 (level 2A) satellite imagery (Copernicus Marine Service Information, 2023a) applying a random forest algorithm in the Google Earth Engine (Gorelick et al., 2017) with a custom JavaScript. We acquired the highest available resolution (10–1000 m) of satellite-derived data describing SST, Chlo-a, and total suspended matter (TSM) of our region of interest. Furthermore, we constructed bathymetry derived values (depth, slope, complexity) based on a publicly available bathymetries (Beaman, 2023; Lebrec et al., 2021). Environmental variables were aggregated to the 500 × 500 m eDNA sampling grid. Continuous variables with higher resolution, such as bathymetry, Chlo-a, and TSM, were processed using ‘bilinear’ reprojection (‘raster’ package; Hijmans, 2023) for bathymetry, while median values were calculated for Chlo-a and TSM. Percentage coverage of categorical habitat types (10 × 10 m grid) was calculated within the sampling grid. For further details on data acquisition and processing refer to the Supplementary Information of the associated Manuscript DDI-2025-0112 (in Diversity and Distributions).