Data from: Biodiversity forecasting in natural plankton communities reveals temperature and biotic interactions as key predictors

Merz, Ewa 1 2 ; Pomati, Francesco1; Saavedra, Serguei3; Gilarranz, Luis J.1

Published Jun 11, 2025 on Dryad. https://doi.org/10.5061/dryad.0zpc86775

Data files

Jun 11, 2025 version files 1.44 GB

code_data.zip

1.44 GB
README.md

36.68 KB

Abstract

As natural ecosystems face unprecedented human-made degradation, it is urgent to provide quantitative forecasts of changes in biodiversity and identify relevant biotic and abiotic predictors. Forecasting natural ecosystems has proven challenging due to their complexity, chaotic nonlinear nature, and lack of adequate data. In this study, we investigate the predictability of lake plankton biodiversity using four years of daily data of environmental predictors and community metrics derived from state-dependent models. Our findings show that presence-absence-based biodiversity metrics are more predictable than abundance-based metrics. For short-term forecasts, the most significant predictor of species richness is prior richness, while the key predictors for Jaccard dissimilarity are prior richness and prior Jaccard dissimilarity. In long-term forecasts of both metrics, water temperature emerges as the primary predictor, with community connectance (number of interactions) also contributing to improved predictions. We found that richness, connectance, and water temperature can interact in nonlinear and synergistic ways depending on the forecast horizon, enhancing each other's effects on richness and Jaccard dissimilarity. These results underscore the challenges of forecasting biodiversity in natural ecosystems and highlight the importance of monitoring key community metrics and abiotic predictors to anticipate long-term changes.

https://doi.org/10.5061/dryad.0zpc86775

Description of the data and file structure

We obtained high-frequency data on physical and chemical water parameters, plankton abundance, nutrients, and weather from an autonomous lake monitoring platform in the middle of Lake Greifen, a small peri-alpine eutrophic lake in central Switzerland. All missing data represented as NA.

Files and variables

File: code_data.zip

Description:

data

This folder contains the raw and formatted data to reproduce the analyses.

analysis_variables.csv: Contains the variables used for the analysis (i.e., biodiversity change variables, biotic and abiotic predictors. NA values in the data indicate that no values are available for that given time point.

date: The date when the measurement was taken in the format year-month-day.
richness: The number of taxa at a given time point.
diversity: Shannon diversity at a given time point.
bray_turnover: Turnover in community composition calculated using the Bray-Curtis method
trace: Trace of the Jacobian interaction matrix, indicating a community to sensitivity to perturbations. The value has been scaled between 0 and 1, where 1 means high sensitivity to perturbations.
connectance: The number of links in percentage at a given time point when compared to the total number of possible links.
temperature: Water temperature in °C measured with a CTD probe from the lake's surface to the bottom (1-17.5 meters) and then averaged over the whole water column.
mixed_layer: The depth of the mixed layer in meters. The mixed layer depht was calculated based on the temperature profile by the CTD probe using the R package rLakeAnalyzer
depth_at 5PAR: Light penetration in meters, measures the light available in the water column. We calculated light penetration by defining the depth at which PAR (photosynthetically active radiation) measured by a CTD probe reaches 5, which is the lowest detectable value.
max_windspeed: Maximum wind speed reached at a given time point in meters per second (m/s).
nitrate: Nitrate concentration at a given time point in miligrams per liter (mg/L)
ammonium: Ammonium concentration at a given time point in micrograms per liter (ug/L.
phosphate: Phosphate concentration at a given time point in ug/L.

formatted: Contains the formatted plankton taxa abundance data generated by the script: 1_calibrate_abundances_exclude_rare_taxa.R.

DSPC_abundance_2019_2022_excluded.csv: In this file, we excluded taxa that did not occur in more than 50% of the time points.

date: The date when the measurement was taken in the format year-month-day.
taxon: The name of the taxon, classified to the lowest possible level.
abundance: Abundance for each taxon at a given time point in ROI (raw objects of interest) per second. NAs indicate that the instrument was not working correctly and no data could be collected.
magnification: The magnification of the underwater microscope used to derive this taxon's abundance, whereas 0p5x is the lower magnification, mainly used for zooplankton and large phytoplankton colonies, whereas 5p0x is the higher magnification, mainly used for small phytoplankton.
excluded: States if this taxon should be excluded based on a 50% occurrence threshold. All values in this data set should be set to "no".

DSPC_abundance_2019_2022_threshold_calibrated_scaled.csv: We scaled all plankton abundances to z0 mean and a standard deviation of 1 using the data in DSPC_abundance_2019_2022_threshold_calibrated.csv.

date: The date when the measurement was taken in the format year-month-day.
aphanizomenon to vorticella: Plankton taxa abundances in counts/mL. The taxonomic classification and other attributes of each taxon are described in the Supplementary Material of the manuscript or data/raw/taxa_information.csv.

DSPC_abundance_2019_2022_threshold_calibrated.csv: We converted all plankton abundances in the data set DSPC_abundance_2019_2022_threshold.csv from ROI/s (raw objects of interest per second) to counts/mL (counts per milliliter) using a calibration function described in Merz et al. 2021. This data set was used to calculate the interaction coefficients.

date: The date when the measurement was taken in the format year-month-day.
aphanizomenon to vorticella: Plankton taxa abundances in counts/mL. The taxonomic classification and other attributes of each taxon are described in the Supplementary Material of the manuscript or data/raw/taxa_information.csv.

DSPC_abundance_2019_2022_threshold.csv: In this file, we set abundances to zero that did not meet a set threshold (2 images per day) using the data in DSPC_abundance_2019_2022_excluded.csv.

date: The date when the measurement was taken in the format year-month-day.
taxon: The name of the taxon, classified to the lowest possible level.
abundance: Abundance for each taxon at a given time point in ROI (raw objects of interest) per second. NAs indicate that the instrument was not working correctly, and no data could be collected.
magnification: The magnification of the underwater microscope used to derive this taxon's abundance, whereas 0p5x is the lower magnification, used primarily for zooplankton and large phytoplankton colonies, whereas 5p0x is the higher magnification, used primarily for small phytoplankton.
excluded: States if this taxon should be excluded based on a 50% occurrence threshold. All values in this data set should be set to "no".

raw: Contains the raw data used for the study (i.e., CTD, meteo, nutrients, and plankton abundance) and information on the plankton taxa.

CTD_meteo_nutrients_daily_2019_2022.csv: Contains all abiotic parameters used in the study. NA values in the data indicate that no values are available for that given time point.

date: The date when the measurement was taken in the format year-month-day.
temperature: Water temperature in °C measured with a CTD probe from the lake's surface to the bottom (1-17.5 meters) and then averaged over the whole water column.
mixed_layer: The depth of the mixed layer in meters. The mixed layer depth was calculated based on the temperature profile by the CTD probe using the R package rLakeAnalyzer
depth_at 5PAR: Light penetration in meters, measures the light available in the water column. We calculated light penetration by defining the depth at which PAR (photosynthetically active radiation) measured by a CTD probe reaches 5, which is the lowest detectable value.
max_windspeed: Maximum wind speed reached at a given time point in meters per second (m/s).
nitrate: Nitrate concentration at a given time point in miligrams per liter (mg/L)
ammonium: Ammonium concentration at a given time point in micrograms per liter (ug/L.
phosphate: Phosphate concentration at a given time point in ug/L.

DSPC_abundance_2019_2022.csv: Contains the plankton abundances in ROI/s used in the study. We filtered, scaled, and transformed this data as described in the script 1_calibrate_abundances_exclude_rare_taxa.R and saved the output data/formatted (see above).

date: The date when the measurement was taken in the format year-month-day.
aphanizomenon to vorticella: Plankton taxa abundances in ROI/s (raw objects of interest per second). The taxonomic classification and other attributes of each taxon are described in the Supplementary Material of the manuscript or data/raw/taxa_information.csv.

taxa_information.csv: Contains information (e.g., taxonomic classification or average size) of all plankton taxa used in the study.

ID: Unique ID going from 1 to 69 for each plankton taxon.
taxon: Name for each taxon used in the study.
name: Formatted name for each taxon in the study (e.g., for illustration purposes).
occurrence: Occurrence over time for each taxon, where 1 means a taxon was present in each sample (time point) and 0 means the taxon was never observed.
abundance: Average abundance of each taxon for a given time point in counts/mL.
size: Average size (area) derived from the images for each taxon in mm^2.
training_images: Number of images per taxon used for training of the classifier to identify taxa from the images of the underwater microscope.
magnification: The magnification of the underwater microscope used to derive this taxon's abundance, whereas 0p5x is the lower magnification, mostly used for zooplankton and large phytoplankton colonies, whereas 5p0x is the higher magnification, mainly used for small phytoplankton.
empire to species: Taxonomic classification to the lowest possible resolution for each taxon.
guild: Abbrevation for the higher taxonomic guild for each taxon described in guild_name.
guild_name: Higher taxonomic guild following Merz et al. 2023, describing higher-order taxonomic groups and food-web interactions.
guild_color: Distinctive color for each guild described in guild_name that can be used for plotting.
excluded: States if this taxon should be excluded based on a 50% occurrence threshold.

model_out

This folder contains the model outputs, e.g., interaction coefficients and biodiversity forecasts.

biodiversity_forecasting: This folder contains the model outputs for the biodiversity forecasting part using regularized S-maps (e.g., coefficients, model performances).

coefficients: This folder contains the estimated regularized S-map coefficients for all best-performing forecast models. The file names (e.g., coefficients_target_bray_turnover_tp_1_theta_1_alpha_0.1_lambda_0.032.csv) contain the forecasted variable (target, e.g., bray_turnover), time horizon (tp, e.g., 1 = one day ahead forecast) and regularized regression parameters theta, alpha and lambda (e.g., 1, 0.1 and 0.032) used in the regularized regression to produce those coefficients. The files in this folder were created by the script: 4.2_biod_forecast_smap_coefficients.R.

date: The date for when the coefficients were estimated in the format year-month-day.
intercept: The regularized regression's intercept, which was not considered in the analysis.
autoreg_term to phosphate: The estimated model coefficients, where autoreg_term defines the autoregressive term and the other variables are the biotic and abiotic predictors, described in more detail in data/analysis_data.csv.

contour_slopes: This folder contains the estimated interactive effects between two variables on biodiversity metrics, calculated by determining the local slopes of the contour lines. The file names (e.g., contour_slopes_jacc_turnover_temperature_richness_1.csv) contain the forecasted variable (e.g., jacc_turnover), the two variables for which we calculated the interactive effects (e.g., temperature, richness), and the time horizon (e.g., 1 = one-day-ahead forecast). The files in this folder were created by the scripts: 5.3a_two_way_interaction_slopes_richness.R and 5.3b_two_way_interaction_slopes_jaccard.R.

slope_1 and slope_1: The linear approximation (slope) of the contour lines for a given value of x_var_1 and x_var_2.
prediction: The predicted value for the target variable (e.g., jacc_turnover) for the given combination of x_var_1 and x_var_2
cont_line: The contour line number (identifier), which goes to 10 in this case, as we predicted the target biodiversity metric in a 10 by 10 grid.
target: The predicted biodiversity metric, e.g., jacc_turnover.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
x_var_1 and x_var_2: the two variables for which we calculated the interactive effects (e.g., temperature, richness)
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization.

interactions: This folder contains the averaged interaction coefficients from model_out/coefficients, weighted based on model performances. The file names (e.g., interaction_coefficients_target_bray_turnover_tp_1.csv) contain the forecasted variable (target, e.g., bray_turnover) and time horizon (tp, e.g., 1 = one-day-ahead forecast). The files in this folder were created by the script: 4.3_biod_forecast_smap_coefficients_assemble_results.R.

date: The date for when the coefficients were estimated in the format year-month-day.
target: The predicted biodiversity metric, e.g., bray_turnover.
expl_var: The variable for which the coefficient was estimated, i.e., the biotic and abiotic drivers. NA means that no coefficient could be estimated (because of missing values in the data set used for the analysis).
coefficient_w_mean: The averaged coefficient over the best models, weighted by the model's forecast performance.
coefficient_w_se: The standard error of the mean estimated in coefficient_w_mean, weighted by the model's forecast performance.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.

model_coefficients.csv: Contains the assembled data in a .csv file from model_output/biodiversity_forecasting/interactions and was created by using the script 4.3_biod_forecast_smap_coefficients_assemble_results.R.

date: The date for when the coefficients were estimated in the format year-month-day.
target: The predicted biodiversity metric, e.g., bray_turnover.
expl_var: The variable for which the coefficient was estimated, i.e., the biotic and abiotic drivers. NA means that no coefficient could be estimated (because of missing values in the data set used for the analysis).
coefficient_w_mean: The averaged coefficient over the best models, weighted by the model's forecast performance.
coefficient_w_se: The standard error of the mean estimated in coefficient_w_mean, weighted by the model's forecast performance.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.

model_var_importance.csv: Contains the predictive importance (e.g., drop in forecast skill or error when excluding a variable) of all biotic and abiotic predictors for forecasting all target biodiversity metrics. This .csv was assembled from the data in model_output/biodiversity_forecasting/var_importance by the script 4.5_biod_forecast_smap_var_importance_assemble_results.R.

tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
target: The predicted biodiversity metric, e.g., bray_turnover.
excl_var: The excluded biotic or abiotic variable from the forecasts.
rho: Pearson's correlation between predictions and observations.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations.
rmse_null: Rooted mean squared error of the full model (no variable was excluded).
rho_null: Pearson's correlation between predictions and observations of the full model (no variable was excluded).
rho_diff: Difference between rho and rho_null, indicating how much rho would drop if excluding the variable in excl_var.
r_2_diff: Difference between r^{2 and r}2 from a full model, where r^{2 is the explained variance and was calculated by rho}2. This variable indicates how much r^2 would drop if we exclude the variable in excl_var from the forecast model.
rmse_diff: Difference between rmse and rmse_null, indicating how much rmse would increase if excluding the variable in excl_var.

parametrization: This folder contains the results from the parameterization step of the regularized S-map for each target biodiversity metrics. It demonstrates the model’s predictive performance using future-leave-out cross-validation with various parameter combinations. The file names (e.g., smap_parametrization_bray_turnover_tp_1.csv) contain the forecasted variable (target, e.g., bray_turnover) and time horizon (tp, e.g., 1 = one-day-ahead forecast). The files in this folder were created by the script: 2.3_inter_coeffs_smap_parametrization.R.

target: The predicted biodiversity metric, e.g., bray_turnover.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization.
rho: Pearson's correlation between predictions and observations, estimated using leave-future-out cross-validation.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations, estimated using leave-future-out cross-validation.
const_pred_rmse: Rooted mean squared error from a persistence model, assuming no change from one time point to the next.
const_pred_rho: Rho from a persistence model, assuming no change from one time point to the next.

performance: This folder contains the performance of the best regularized S-map models during cross-validation and when using the entire dataset for each biodiversity metric. The file names (e.g., smap_performance_target_bray_turnover_tp_1.csv) contain the forecasted variable (target, e.g., bray_turnover) and time horizon (tp, e.g., 1 = one-day-ahead forecast). The files in this folder were created by the script: 4.2_biod_forecast_smap_coefficients.R.

target: The predicted biodiversity metric, e.g., bray_turnover.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations, using all observations.
rho: Pearson's correlation between predictions and observations, estimated using all observations.
const_pred_rmse: Rooted mean squared error from a persistence model, assuming no change from one time point to the next.
const_pred_rho: Rho from a persistence model, assuming no change from one time point to the next.
cv_rmse: Rooted mean squared error, calculated from the difference between predictions and observations, estimated using leave-future-out cross-validation.
cv_rho: Pearson's correlation between predictions and observations, estimated using leave-future-out cross-validation.

predictions: This folder contains the predicted values from the regularized S-map models generated during leave-future-out cross-validation. The file names (e.g., smap_predictions_bray_turnover_tp_1_theta_0_alpha_0.1_lambda_0.001.csv) contain the forecasted variable (target, e.g., bray_turnover), time horizon (tp, e.g., 1 = one day ahead forecast) and regularized regression parameters theta, alpha and lambda (e.g., 1, 0.1 and 0.032) used in the regularized regression to produce those predictions. The files in this folder were created by the script: 4.1_biod_forecast_smap_parametrization.R.

date: The date for which the forecast were made in the format year-month-day.
target: The predicted biodiversity metric, e.g., bray_turnover.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization.
predicted: The predicted value for that given time point by the forecast model. NAs indicate that no predictions could be made.
observed: The observed (true) values for that given time point.
const_pred: The value from a persistence model, assuming no change from one time point to the next.

two_way_scenexplor: This folder contains the predicted changes in biodiversity metrics when varying two of the explanatory variables. The file names (e.g., two_way_scenexplor_jacc_turnover_temperature_richness_1_1_0.5_0.032.csv) contain the forecasted variable (target, e.g., jacc_turnover), the two explanatory variables (e.g., temperature and richness), time horizon (tp, e.g., 1 = one day ahead forecast) and regularized regression parameters theta, alpha and lambda (e.g., 1, 0.1 and 0.032) used to make the predictions. The files in this folder were created by the script: 5.1a_two_way_interaction_richness.R and 5.1b_two_way_interaction_jaccard.R as well as the function in functions/scenario_exploration_function.R.

timepoint: The time point for which the prediction was made, ranging from 1 to the number of time points used in the study.
columns 2 to 3: Values for the explanatory variables used to predict the response of the biodiversity metric, ranging from the lowest to the highest values observed in the data.
autoreg_term: The autoregressive term for the predicted biodiversity metric.
columns 5 to 12: The other explanatory variables, held constant at their current value for the given time point.
column 13: The predicted biodiversity metric (e.g., richness or jaccard turnover).
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization
rmse: Rooted mean squared error of the model used to create the predictions for the given parameter combination of theta, alpha and lambda.

var_importance: This folder contains the importance of each explanatory variable when predicting biodiversity metrics. The file names (e.g., smap_var_importance_bray_turnover_excl_var_ammonium_tp_1.csv) contain the forecasted variable (target, e.g., bray_turnover), the excluded explanatory variable (excl_var, e.g., ammonium) and time horizon (tp, e.g., 1 = one-day-ahead forecast). The files in this folder were created by the script: 4.4_biod_forecast_smap_var_importance.R.

target: The predicted biodiversity metric, e.g., bray_turnover.
excl_var: The excluded biotic or abiotic variable from the forecasts.
tp: The time horizon used for the forecast, e.g., 1 corresponds to a one-day-ahead forecast.
theta: Regularized S-map parameter that controls the degree of nonlinearity.
alpha: Regularized S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso).
lambda: Regularized S-map parameter that controls the strength of the coefficient's penalization.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations when leaving out excl_var.
rho: Pearson's correlation between predictions and observations when leaving out excl_var.
const_pred_rmse: Rooted mean squared error from a persistence model, assuming no change from one time point to the next.
const_pred_rho: Rho from a persistence model, assuming no change from one time point to the next.

interaction_coefficients: This folder contains the model outputs for the estimation of interactions among plankton taxa part using MDR S-maps (e.g., embeddings, interaction coefficients, model performances).

coefficients: This folder contains the estimated MDR S-map coefficients for all best-performing models. The file names (e.g., coefficients_target_aphanizomenon_tp_1_theta_1_alpha_0.1_lambda_0.032.csv) contain the predicted taxa for which we estimated interactions (target, e.g., aphanizomenon), time horizon (tp, e.g., 1 = one day ahead prediction) and regularized regression parameters theta, alpha and lambda (e.g., 1, 0.1 and 0.032) used in the regularized regression to produce those coefficients. The files in this folder were created by the script: 2.4_inter_coeffs_smap_coefficients.R.

intercept: The MDR S-map's intercept, which has no biological meaning.
other columns: The estimated model coefficients, where column 2 defines the autoregressive term (i.e., the effect of the target taxon on itself) and the other variables show the effect of other taxa on the target taxon.

distance_matrix: This folder contains the estimated multiview distances for each taxon used in the MDR S-map analysis. The file names (e.g., dist.matrix_target_aphanizomenon.csv) contain the target taxa for which we estimated the distances to other points (target, e.g., aphanizomenon). The .csv file contain a matrix showing the distance of one point to all other points. Points with lower distance metrics are closer and thus more similar. Those .csv files were created by the script: 2.2_inter_coeffs_multiview_distance.R.

embeddings: This folder contains the embeddings used to calculate the multiview distances. The file names (e.g., multivariate_embeddings_target_aphanizomenon.csv) contain the target taxa for which we estimated the distances to other points (target, e.g., aphanizomenon). The .csv files in this folder were created using the script: 2.1_inter_coeffs_multivariate_embeddings.R.

interaction_coefficients.csv: Contains the assembled data in a .csv file from model_output/interaction_coefficients/interactions and was created by using the script 2.5_inter_coeffs_model_averaging.R.

date: The date for when the coefficients were estimated in the format year-month-day.
taxon_1: The target taxon, e.g., aphanizomenon.
taxon_2: The taxon interacting with target_1
coefficient_w_mean: The averaged coefficient over the best models, weighted by the model's forecast performance. Shows the effect (interaction) of target_2 on target_1.
coefficient_w_se: The standard error of the mean estimated in coefficient_w_mean, weighted by the model's forecast performance.

interactions: This folder contains the estimated interaction strengths (weighted average of coefficients based on performance) for all taxa. The file names (e.g., interaction_coefficients_target_aphanizomenon.csv) contain the target taxa for which we estimated the interactions (target, e.g., aphanizomenon). The .csv files in this folder were created using the script: 2.4_inter_coeffs_smap_coefficients.R.

time: Time point for which the coefficients were estimated, we later match them with the corresponding date.
taxon_1: The target taxon, e.g., aphanizomenon.
taxon_2: The taxon interacting with target_1
coefficient_w_mean: The averaged coefficient over the best models, weighted by the model's forecast performance. Shows the effect (interaction) of target_2 on target_1.
coefficient_w_se: The standard error of the mean estimated in coefficient_w_mean, weighted by the model's forecast performance.

parametrization: This folder contains the results from the parameterization step of the MDR S-map for each target taxon. It evaluates the model’s predictive performance using future-leave-out cross-validation with various parameter combinations. The file names (e.g., smap_parametrization_aphanizomenon.csv) contain the target taxa for which we parametrized the model (target, e.g., aphanizomenon). The .csv files in this folder were created using the script: 2.3_inter_coeffs_smap_parametrization.R.

target: The predicted taxon, e.g., aphanizomenon
theta: MDR S-map parameter that controls the degree of nonlinearity.
alpha: MDR S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: MDR S-map parameter that controls the strength of the coefficient's penalization.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations, estimated using leave-future-out cross-validation.

rho: Pearson's correlation between predictions and observations, estimated using leave-future-out cross-validation.

performance: This folder contains the performance of the best MDR S-map models during cross-validation and when using the entire dataset for each taxpm. The file names (e.g., smap_performance_target_aphanizomenon.csv) contain the target taxa for which we assembled the model performance results (target, e.g., aphanizomenon). The .csv files in this folder were created using the script: 2.4_inter_coeffs_smap_coefficients.R.

target: The predicted taxon, e.g., aphanizomenon
theta: MDR S-map parameter that controls the degree of nonlinearity.
alpha: MDR S-map parameter that controls the type of regression penalization used (i.e., ridge versus lasso)
lambda: MDR S-map parameter that controls the strength of the coefficient's penalization.
rmse: Rooted mean squared error, calculated from the difference between predictions and observations, using all observations.
rho: Pearson's correlation between predictions and observations, estimated using all observations.
const_pred_rmse: Rooted mean squared error from a persistence model, assuming no change from one time point to the next.
const_pred_rho: Rho from a persistence model, assuming no change from one time point to the next.
cv_rmse: Rooted mean squared error, calculated from the difference between predictions and observations, estimated using leave-future-out cross-validation.
cv_rho: Pearson's correlation between predictions and observations, estimated using leave-future-out cross-validation.

plots

This folder contains the produced plots by the code in scripts. Initially, it only contains plots for the interactive effects of two explanatory variables on biodiversity metrics (two_way_scenexplor) generated by the scripts 5.2a_two_way_interaction_heatmaps_richness.R and 5.2b_two_way_interaction_heatmaps_jaccard.R. Running these scripts in the scripts folder will generate all the main and some supplementary figures, saving them in this folder.

r_environment.Rproj

Running the code from inside this R environment ensures correct dependency paths.

scripts

This folder contains all the necessary scripts to reproduce the analysis, including those used to recover interaction coefficients and forecast biodiversity metrics. Below, we provide a brief description of what each script does. More detailed comments can be found in the script itself or in the methods section of the paper.

1.1_calibrate_abundances_exclude_rare_taxa.R: This script excludes rare taxa, sets taxa abundances to 0 based on an image threshold for a given day (to correct for biases in mislabelling by the classifier), and uses a calibration function to transform ROI/s (raw images per second) into counts/mL.

2.1_inter_coeffs_multivariate_embeddings.R: This script generates random multivariate embeddings for each plankton taxa and estimates their predictive performance using multivariate simplex projection.

2.2_inter_coeffs_multiview_distance.R: This script estimates the distance between sampling points from the top multivariate embeddings from 2.1 and averages them, creating the multiview distance. The multiview distance is used to weight points in the MDR S-map.

2.3_inter_coeffs_smap_parametrization.R: This script performs the parametrization of the MDR S-map model, evaluating the predictive performance of various parameter combinations.

2.4_inter_coeffs_smap_coefficients.R: This script uses the previously estimated best parameter combinations to fit the MDR S-map using the whole data. The coefficients of those models approximate the interactions between taxa.

2.5_inter_coeffs_model_averaging.R: This script is used to average the coefficients generated in 2.4 by weighting them based on their predictive performance.

3.1_estimate_biodiversity_turnover.R: This script is used to estimate the turnover in biodiversity, i.e., Jaccard and Bray-Curtis turnover.

3.2_gather_explanatory_variables.R: This script is used to calculate richness, Shannon diversity, the trace of the interaction matrix, and to create a data frame together with the previously estimated explanatory variables and abiotic drivers (e.g., temperature and nutrients). The generated .csv file can be found in data/analysis.variables.csv.

3.3_Fig_1bc_time_series_plots.R: This script generates plots of all the variables in data/analysis.variables.csv and corresponds to Figure 1bc in the main article.

4.1_biod_forecast_smap_parametrization.R: This script performs the parametrization of the regularized S-map model, which was used to forecast four different biodiversity metrics up to 30 days in advance.

4.2_biod_forecast_smap_coefficients.R: This script uses the previously estimated best parameter combinations for the regularized S-maps to forecast four different biodiversity metrics up to 30 days ahead and save the model's coefficients.

4.3_biod_forecast_smap_coefficients_assemble_results.R: This script assembles the model coefficients generated in the previous script (4.2).

4.4_biod_forecast_smap_var_importance.R: This script estimates the importance of each explanatory variable by leaving it out of the forecasts and recording the change in predictive performance.

4.5_biod_forecast_smap_var_importance_assemble_results.R: This script assembles the results from the previous script (4.4).

4.6_forecast_performance_plots_Fig_2_S4_S6.R: This script generates the plots evaluating the forecast performance of the regularized S-map models shown in Figure 2 of the main article and in the Supplementary Materials Figures S4 and S6.

4.7_biod_forecast_var_importance_and_model_coefficients_plots_Fig_3_S7_S8.R: This script generates the plots showing the importance of all biotic and abiotic explanatory variables in forecasting biodiversity metrics as well as their model coefficients (shown in Figure 3 of the main article and Supplementary Materials Figures S7 and S8).

5.1a_two_way_interaction_richness.R and 5.1 b_two_way_interaction_jaccard.R: The two scripts are used to explore the interactive effects of two explanatory variables on biodiversity metrics. They predicts the biodiversity metric over a gradient of the two explanatory variables while keeping all other explanatory variables at their current values.

5.2a_two_way_interaction_heatmaps_richness.R and 5.2 b_two_way_interaction_heatmaps_jaccard.R: The two scripts are used to plot the results from the previous scripts (5.1a and 5.1b) as a heatmap (see Fig. 4 in the main article).

5.3a_two_way_interaction_slopes_richness.R and 5.3 b_two_way_interaction_slopes_jaccard.R: The two scripts are used to estimate the slope of the interactive effects of the two explanatory variables, depending on their values. The slope is estimated from the contour lines.

functions: This folder contains the scenario_exploration_function.R, which is used to predict the target biodiversity metric over a gradient of two chosen explanatory variables using regularized S-maps.

Code/software

All analyses were done in R (R version 4.4.0 (2024-04-24) -- "Puppy Cup"). The original code for the MDR S-map can be accessed on GitHub (https://github.com/biozoo/MDR_S-map).

Access information

Other publicly accessible locations of the data:

Data was derived from the following sources:

https://doi.org/10.25678/000C2G