Skip to main content
Dryad

Data from: Integrating diverse data for robust species distribution models in a dynamic ocean

Cite this dataset

Farchadi, Nima et al. (2024). Data from: Integrating diverse data for robust species distribution models in a dynamic ocean [Dataset]. Dryad. https://doi.org/10.5061/dryad.7sqv9s51c

Abstract

Aim: Species distribution models (SDMs) are an important tool for marine conservation and management, yet guidance on leveraging diverse data to build robust models is limited. While various approaches can be used to integrate different datasets, studies comparing their performance, particularly for highly migratory and mobile species, are scarce. Here, we assess whether a model-based integrative framework improves performance over traditional data pooling or ensemble approaches when synthesizing multiple data types.

Location: North Atlantic Ocean

Time Period: 1993 - 2019

Major Taxa Studied: Blue shark (Prionace glauca)

Methods: We trained traditional, correlative SDMs and integrated SDMs (iSDMs) with three distinct data types: fishery-dependent marker tags, fishery observer records, and fishery-independent electronic tag data. We evaluated data pooling and ensemble approaches in a correlative SDM framework and compared performance to an iSDM approach designed to explicitly account for data-specific biases while retaining the strengths of each dataset.

Results: While each integration approach yielded robust models, model performance varied among data types, with all models predicting fishery-dependent data more accurately than fishery-independent data. Differences in performance were primarily attributed to each model’s ability to explain the spatiotemporal dynamics of the training data. iSDMs that explicitly accounted for seasonal variability yielded the most accurate and ecologically realistic estimates. However, such approaches are computationally intensive and warrant identifying model purpose as an important step in the data-integration process.

Main Conclusions:  Our findings reveal important trade-offs among the current techniques for integrating data in SDMs, including variability in accurately estimating species distributions, generating ecologically realistic predictions, and practical feasibility. With increasing access to growing and diverse data sources, maximizing our ability to leverage available data with robust analytical approaches will be instrumental in enhancing conservation and management efforts and for understanding current and future species distributions in a dynamic ocean.

README: Data and code for the article "Integrating diverse data for robust species distribution models in a dynamic ocean"

https://doi.org/10.5061/dryad.7sqv9s51c

This repository contains data and code to:

  1. run cross-validation for each integration approach (data pooling, ensemble, integrated SDM with constant spatial effect, and integrated SDM with seasonal spatial effects) and measure performance metrics (predictive skill, ecological realism, computational demand)
  2. Develop full models for each integration approach using all the data
  3. Visualize and interpret results

The data in this repository include the raw data, representing species presence and pseudo-absences, used to construct the different integration approaches presented in the paper. 

Note that the raw data sets in this repository only include the fishery-dependent marker tag and fishery-independent electronic tag data sets. The fishery dependent observer dataset used in this study are considered confidential under the U.S. Magnuson-Stevens Act: qualified researchers may request these data from the NOAA Pelagic Observer Program office by contacting popobserver@noaa.gov; we requested data representing all pelagic longline sets between the years 1993 and 2019. Additionally, each of these data sets have about 1:3 presence:pseudo-absence ratio to ensure there was enough pseudo-absences for a 1:1 ratio. Filtering to a 1:1 ratio is done before every analysis  which can be found at the beginning of each script. 

Description of the data and file structure

Data File Details

Details for: bsh_marker_tag.csv

  • Description: a comma-delimited file containing the raw data for the fishery-dependent marker tags for blue sharks in the North Atlantic Ocean. 
  • Format(s): .csv
  • Size(s): 4.608 Mb
  • Dimensions: 62894 rows by 8 columns
  • Variables:
    • long: longitude
    • lat: latitude
    • pres_abs: a binary integer where 1 indicates presence and 0 indicates absence
    • year_mon: year-month of presence or absence
    • sst: sea surface temperature (degrees C)
    • sst_sd: spatial standard deviation of sst
    • bathy:  bathymetry, bottom depth (in meters)
    • dataset: character value of the data type

Details for: bsh_electronic_tag.csv

  • Description: a comma-delimited file containing the raw data for the fishery-independent electronic tags for blue sharks in the North Atlantic Ocean. 
  • Format(s): .csv
  • Size(s): 1.959 Mb
  • Dimensions: 26798 rows by 8 columns
  • Variables:
    • long: longitude
    • lat: latitude
    • pres_abs: a binary integer where 1 indicates presence and 0 indicates absence
    • year_mon: year-month of presence or absence
    • sst: sea surface temperature (degrees C)
    • sst_sd: spatial standard deviation of sst
    • bathy:  bathymetry, bottom depth (in meters)
    • dataset: character value of the data type

Script File Details

Details for: ModelPerformanceAnalysis

  • Description: contains all code to run the cross-validation to evaluate model performance for each integration approach. Scripts in the folder call on functions from the functions folder to run analyses
    • BRT_Ensemble_Pooling_Analysis.r - runs cross-validation for BRT data pooling and ensemble model approaches
    • iSDM_Constant_Analysis.r - runs cross-validation for iSDM constant model
    • iSDM_Seasonal_Analysis.r - runs cross-validation for iSDM seasonal model

Details for: FullModels

  • Description: contains all the code to run the model with all the data for each integration approach. Scripts in the folder call on functions from the functions folder to run
    • Full_Models.r - fits each integration approach with all the data. BRT pooling and ensemble functions returns the BRT model. INLA functions for the iSDM constant and seasonal models returns spatial predictions, spatial predictions, Gaussian Markov Random Field (GMRF) values, marginal effect values for covariates, summary outputs, and range and variance parameters for the GMRFs

Details for: functions

  • Description: contains wrapper functions to execute analysis in the ModelPerformanceAnalysis and FullModels folder

Sharing/Access information

This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.

Links to other publicly accessible locations of the data:

Environmental data was derived from the following sources:

Code/Software

Model performance analysis was conducted in R version 4.2.3 on the University of California, Davis high-performance computing cluster, Farm (https://hpc.ucdavis.edu/farm-cluster). Model fitting for pooling and ensemble approaches were conducted using default computational specifications in R without explicit user-defined alterations to CPU cores (i.e. 1 CPU core utilized) or RAM. For both iSDMs, model fitting during cross-validation was run in parallel on 20 CPUs with 2 - 5 GB of memory per CPU on a single node (depending on model).

Methods

see manuscript for details

Funding

National Aeronautics and Space Administration, Award: 80NSSC19K0187, Ecological Forecasting

National Oceanic and Atmospheric Administration, Award: NA21OAR4170247