Data from: Identification of plankton habitats in the North Sea
Abstract
The definition of an ecological niche makes it possible to anticipate the responses of a species to changing environmental conditions. Broad tolerance limits and a paucity of readily observable niches in the pelagic zone make it difficult to anticipate responses of the plankton community related to anthropogenic or environmental changes. Plankton distributions are closely linked to climate change and shape the seascape for higher trophic levels, so monitoring plankton distributions and defining ecological niches will help to understand and predict ecosystem responses. Here we apply a machine learning autoencoder and a density‐based clustering algorithm to high‐frequency datasets sampled with a ROTV Triaxus in the North Sea. The results indicate that in this highly dynamic environment, local hydrography prevents niche‐based separation of plankton species at the sub‐mesoscale, despite the availability of different habitats. Plankton patches were associated with naturally occurring frontal systems and anthropogenically induced upwelling‐downwelling dipoles in the vicinity of offshore wind farms (OWFs).
README: Identification of plankton habitats in the North Sea
[https://doi.org/10.5061/dryad.34tmpg4s4](Dataset DOI link)
This dataset includes environmental variables sampled using a ROTV TRIAXUS during various cruises in multiple years. The TRIAXUS is towed behind a research vessel but can move through the water column. Next to sensors for temperature, oxygen, salinity, and chlorophyll it was equipped with a Video Plankton Recorder (VPR). The plankton abundances presented here are based on the automatic classification of the VPR images using the model presented in Plonus et al 2021 [https://doi.org/10.1002/lom3.10413].
The data is provided as .rda-Files and can easily be loaded into the statistical software R:
load('filename.rda') #assuming the file 'filename' is in the current working directory of your R session, use getwd() to print the current working directory to your R console.
Description of the data and file structure
File: OWF_positions.rda
Description: The file contains the postional data of wind turbines in two Offshore Wind Parks BARD Offshore 1 (BARD) and Global Tech I (GTI).
Variables:
WEA.positionsLAT: Latitudinal Coordinate.
WEA.positionsLON: Longitudinal Coordinate.
File: history_AE_190922_4nodes.csv
Description: Documentation of the trainings process. Generated using AE_euclidean.py.
Variables:
Epoch: Each Epoch represents one iteration during which all the training data is presented to the model once in machine learning.
RMSE: The Root Mean Squared Error between the original Input and the model generated output during the training. Further
information can be found in [https://doi.org/10.3389/fmars.2021.754375].
Val_RMSE: The RMSE during the validation process that follows the training phase in each Epoch.
LR: The current learning rate used to adjust the model weights during this specific Epoch.
File: trainAE4.rda
Description: Contains the data used to train the Autoencoder. The original data of multiple transects was combined in one file for reasons of accessability during training. Physical variables (temperature - chlorophyll) and plankton abundances (appendicularia - marinesnow) were calculated using Ocean Data View with the embedded spatial interpolation software DIVA.
Variables:
Section_Distance_grid_km [km]: The distance along the transect.
Depth_grid_m [m]: The position in the water column.
day_in_seconds [s]: The time of day in seconds since midnight.
temperature [°C]: Ambient water temperature.
oxygen [mol/l]: Ambient oxygen concentration.
salinity [psu]: Ambient salinity.
density_kg_m [kg/qm]: Ambient density calculated using temperature and salinity in ODV.
chlorophyll [rfu]: Ambient chlorophyll concentration.
Lat_N [°N]: Current position of the ship.
Lon_E [°E]: Current position of the ship.
appendicularia [N/l]: Abundance of appendicularia estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
copepoda [N/l]: Abundance of copepoda estimated using the classifications of a machine learning classifier using the
plankton images derived by the VPR mounted on the TRIAXUS.
dinoflagelattes [N/l]: Abundance of dinoflagellates estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
pluteus [N/l]: Abundance of pluteus larvae estimated using the classifications of a machine learning classifier using
the plankton images derived by the VPR mounted on the TRIAXUS.
marinesnow [N/l]: Abundance of marine snow particles estimated using the classifications of a machine learning
classifier using the plankton images derived by the VPR mounted on the TRIAXUS.
Folder: checkpoint
Description: This folder was created by a tensorflow manager and contains the weights of the last 3 Epochs before the best Val_RMSE was reached. Do not change any files since they were created automatically (see: AE_euclidean.py).
File: sobol_HE446_H06T1.txt (in Folder: sensitivity)
Description: All files in the sensitivity folder are identic and contain the results of a sobol sensitivity analysis for the respective transect indicated by the file name. Generated with sensitivity.py.
Variables:
ST: The total Sobol sensitivity index includes the sensitivity of both first order effects as well as the sensitivity
due to interactions (covariance) between a given parameter Qi and all other parameters.
ST_conf: The confidence interval for ST.
S1: The first order Sobol sensitivity index tells us the expected reduction in the variance of the model when we fix
parameter Qi. The sum of the first order Sobol sensitivity indices can not exceed one.
S1_conf: The confidence interval for S1.
dim: The output dimension of the Autoencoder (d1-d4).
File: resultsHEshar.rda (in Folder: shar)
Description: Contains the results of the species-habitat-association analysis performed using the r-Package 'shar'. Generated with shar_analysis.R.
Variables:
cruise: Indicates cruise number (HEXXX), houl number (HXX) and transect (TX) within the houl.
species: Indicates the analysed plankton group.
habitat: The macro-habitat estimated for this specific cell.
breaks: NA.
count: The actual abundance of the respective group in this specific habitat.
lo: The lower threshold below which an avoidance of the habitat by the plankton group is detected.
hi: The upper threshold above which a cummulation within the habitat by the plankton group is detected.
significance: The nature of the habitat-species-associations. Either 'negative' (count < lo), 'positive' (count > hi),
or 'n.s.' (lo < count < hi).
File: sharHE.rda (in Folder: shar)
Description: The randomly generated maps for the species-habitat-associations analysed using 'shar'. Since generating 100 randomized maps was computationally expensive we saved the once generated maps to save time and improve the repeatability and reproducibility of the analysis. Generated with shar_analysis.R.
Variables:
file: Indicates cruise number (HEXXX), houl number (HXX) and transect (TX) within the houl.
data: The original habitat map
Columns:
Depth_grid_m [m]: The depth in the water column.
Section_Distance_grid_km [km]: The distance along the trnasect.
Appendicularia [N/l]: Abundance of appendicularia estimated using the classifications of a machine learning
classifier using the plankton images derived by the VPR mounted on the TRIAXUS.
Copepods [N/l]: Abundance of copepoda estimated using the classifications of a machine learning classifier using the
plankton images derived by the VPR mounted on the TRIAXUS.
Pluteus [N/l]: Abundance of pluteus larvae estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
my_mh: The macro-habitat estimated for this specific grid cell.
random_maps: The 100 randomly generated habitat maps used to perform the shar species-habitat-association analysis.
Created using the 'randomize_raster' function.
File: HE466_H05T1.rda (in Folder: HE466)
Description: This file contains all data accumulated for transect 1 (T1) of haul 5 (H05) from our cruise in June 2016 (HE466). The other files in the same folder contain the data for the respective transects while the other similar folders contain the data for the respective cruises.
Variables:
Section_Distance_grid_km [km]: Distance alonge the transect.
Depth_grid_m [m]: Position in the water column.
day_in_seconds [s]: The time of day in seconds since midnight.
temperature_degc [°C]: Ambient water temperature.
oxygen_mol_l [mol/l]: Ambient oxygen concentration.
salinity_psu [psu]: Ambient salinity.
density_kg_m [kg/qm]: Ambient density calculated using temperature and salinity in ODV.
chlorophyll_a_rfu [rfu]: Ambient chlorophyll concentration.
Lat_N [°N]: Current position of the ship.
Lon_E [°E]: Current position of the ship.
Appendicularia [N/l]: Abundance of appendicularia estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
Copepods [N/l]: Abundance of copepoda estimated using the classifications of a machine learning classifier using the
plankton images derived by the VPR mounted on the TRIAXUS.
Dinoflagelattes [N/l]: Abundance of dinoflagellates estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
Pluteus [N/l]: Abundance of pluteus larvae estimated using the classifications of a machine learning classifier using
the plankton images derived by the VPR mounted on the TRIAXUS.
Snow [N/l]: Abundance of marine snow particles estimated using the classifications of a machine learning classifier
using the plankton images derived by the VPR mounted on the TRIAXUS.
euclidean: The euclidean distance calculated between the first row (Appendicularia - Snow) and all following rows to
indicate the similarity of the plankton community between the respective grid cells.
cluster: The cluster assigned to the grid cell after both Autoencoder and HDBScan were applied to the data. -1
indicates that the cell belongs to no specific cluster.
probabilities [0-1]: The probability for this grid cell to belong to the assigned cluster, provided by the HDBScan
algorithm.
d1: The first output of the final encoding layer.
d2: The second output of the final encoding layer.
d3: The third output of the final encoding layer.
d4: The fourth output of the final encoding layer.
my_mh: The final (manuall) classification into across-transect consistent macro-habitats.
File: grid_HE466_H05T1.rda (in Folder: HE466)
Description: This is an exemplary file containing only the original data after it was exported as a grid from Ocean Data View (ODV). All variables are identical to the ones explained in the previous section.
Variables:
Section_Distance_grid_km, Depth_grid_m, day_in_seconds, temperature_degc, oxygen_mol_l, salinity_psu, density_kg_m, chlorophyll_a_rfu, Lat_N, Lon_E, Appendicularia, Copepods, Pluteus, Snow, Dinoflagelattes
Note: The content of the other folders is similar to HE466
Code/Software
All files necessary to repeat the analysis are provided in scripts.zip [https://doi.org/10.5281/zenodo.13375879]. The folder also includes a README which explains the tasks of the single files.
Methods
Physical and biological oceanographic measurements were recorded on different North Sea surveys with the RV Heincke (Knust et al., 2017) using a MacArtney TRIAXUS ROTV, complemented by a Video Plankton Recorder (VPR). The TRIAXUS was towed behind the research vessel in an undulating fashion between the sea surface and bottom.
Data was processed using a machine learning Autoencoder and a density-based clustering algorythm HDBSCAN. Analysis and data handling were handled with the statistical software R4.4.0 and Python 3.7.
A detailed description can be found in 'Identification of plankton habitats in the North Sea' (the DOI can be found at 'Related works').
Knust, R., Nixdorf, U. and Hirsekorn, M. 2017 ‘Research vessel HEINCKE operated by the alfred-wegener-institute’, Journal of large-scale research facilities JLSRF, 3, pp. A120–A120. doi: 10.17815/jlsrf-3-164.