Assessing patterns and risk to Chilean freshwater fish distributions using multi-species occupancy models
Data files
Aug 21, 2025 version files 890.58 KB
Abstract
To advance our understanding of freshwater biodiversity in data-limited systems, this study used multispecies occupancy models to predict species richness and individual species occupancy, providing critical insights for conservation of these rapidly declining ecosystems. We tested several model types and ultimately pursued latent spatial multi-species occupancy models, which gained popularity in wildlife ecology but are relatively underutilized in fisheries ecology. Advantages include simultaneously modeling multiple species to infer both species-specific and assemblage-level responses to hydro-geomorphological conditions while also accounting for imperfect species detections. Data used in this model include fish sampling data and hydro-geomorphological data.
https://doi.org/10.5061/dryad.7d7wm385c
Description of the data and file structure
Fishes were sampled from 2015 to 2023 using a variety of surveys unequally distributed across the 11 river basins. Fishes were collected using standard backpack electrofishing and netting methodologies (i.e., seine, fyke, and gill nets). A suite of abiotic data at the catchment (e.g., elevation, geology, rainfall), valley (e.g., valley and trough width and slope), and channel scales (e.g., belt width, sinuosity, wavelength) were also collated and analyzed. We used ArcGIS tools to extract geomorphic data from our study system at the catchment, valley, and channel scales at 2-5 km intervals.
Because this dataset includes IUCN endangered, vulnerable, and data-deficient species, latitude and longitude are rounded to the nearest degree, which will not allow for the replication of the spatial aspect of the model. Original GPS coordinates will be available upon request from the authors.
Files and variables
File: electrofish_2021_lowflow_Dryad_subset_coords_rounded.csv
Description:
Variables
- scientific_name: fish species
- month: month of survey
- river_seg: unique river segment value associated with map location
- lat: latitude degree
- long: longitude degree
- elevation: elevation (meters)
- mean_annual_ppt: mean annual precipitation (millimeters)
- ratio_valleywidth_to_valleyfloorwidth: river confinement (meters)
- ord_str: Strahler order
- qd_sd: hydropeaking
File: Fishdata_geo_hydro_cleaned_allmethods_subset_Dryad.csv
Description: The description of certain variables are same as above.
Variables
- river_seg: site ID
- elevation:
- mean_annual_ppt:
- ratio_valleywidth_to_valleyfloorwidth:
- qf_sd: standard deviation of streamflow
Code/software
Code file: spatial_MSOM_Dryad
Overview
This workflow fits a spatial, latent-factor multi-species occupancy model using spOccupancy::sfMsPGOcc, then predicts and maps species richness (all species and a native subset) across river segments. It constructs detection histories from species × site data (can include month but did not benefit this model), includes site-level covariates, uses a spatial approximation, and produces summary diagnostics and maps (optionally with dam locations).
Required inputs
- Detection & site-covariate CSV: electrofish_2021_lowflow_Dryad_subset.csv
Required columns (by name or position as used in the script):- river_seg (site ID), scientific_name, month
- Coordinates in columns 4–5 (assumed long, lat)
- Site covariates in columns 6–10, including:
- elevation, mean_annual_ppt, ratio_valleywidth_to_valleyfloorwidth
(ord_str may be present but is not scaled/used in the final formula.)
- elevation, mean_annual_ppt, ratio_valleywidth_to_valleyfloorwidth
- Prediction CSV: Fishdata_geo_hydro_cleaned_allmethods_subset_Dryad.csv
- Must include river_seg, long, lat, and the same covariates as above.
- River network shapefile: Res_FPZs11C_OSt_IHidro_CKoppen_05092022.shp
- Fields: Elevacion, Rat_valle, Pp_anual (renamed in the script).
- (Optional) Dam point layers: Hidroelectricas., Potencial_Hidro.
- (Optional) your_data.csv for the logit→probability utility.
Outputs
- summary_out_sfMsPGOcc.csv — model summary.
- WAIC and PPC results printed to console.
- preds.RData — workspace with prediction objects.
- 9_sp_occ.png — map figure(s) of predicted richness.
- converted_data.csv — if you run the logit→probability utility.
Code File: individ_species_maps_Dryad
Overview
This code specifically maps individual species occurrence using the info from the spOccupancy::sfMsPGOcc described above.
Inputs
- Detection data: A 3D array (species × site × replicate) of detection/nondetection observations.
- Site coordinates: Latitude/longitude or other spatial information to model spatial correlation.
- Model settings:
- Number of species, sites, and replicates
- Priors for occupancy and detection parameters
- Number of spatial factors to capture site correlations
- MCMC settings (chains, iterations, burn-in, thinning)
Outputs
- Posterior estimates of:
- Species-level occupancy probabilities
- Detection probabilities
- Site-level latent occupancy states
- Spatial random effects (factor loadings)
- Model diagnostics (traceplots, convergence summaries)
R Packages:
sf, ggmap, ggplot2, viridis, dplyr, spOccupancy, tidyverse, coda, stars, lme4
External Services: Stadia Maps or equivalent mapping service for base maps
Data collection
The initial dataset used in this analysis included fish sampled from 2015 through 2023 using a variety of standard methods (backpack electrofishing, seines, fykes, and gill nets) with sampling efforts unequally distributed across the 11 river drainage basins. After collection, fish specimens were kept in containers filled with cold, oxygenated water while in the field. All specimens were sedated using BZ-20® (ethyl p-aminobenzoate), then measured for total length and weighed at the collection site. Following this, they were placed back in clean water containers. Once they fully recovered, all specimens were returned to their natural habitats (Orrell et al. 2025). Euthanasia was performed only when necessary, such as if a fish did not recover satisfactorily after sedation. In such cases, an overdose of the same anesthetic was used for euthanasia. Fish data were then subset to include only electrofishing data collected during the low flow period (Dec-April) of 2021 to meet the closed population assumption of an MSOM. The year 2021 was chosen as it had the most repeat samples (a little over a third of all sites were sampled more than once). 10,428 fish individuals representing 25 species (19 native, 6 non-native) were collected in 122 sampling sites in 9 river drainage basins (Maipo, Rapel, Mataquito, Maule, Itata, Biobío, Imperial, Toltén, and Valdivia). (Table 1).
A suite of abiotic data at the catchment (e.g., elevation, geology, rainfall), valley (e.g, valley and trough width and slope), and channel scales (e.,g. belt width, sinuosity, wavelength) were also collated and analyzed from existing datasets using software tools. We used ArcGIS tools to extract geomorphic data from our study system at the catchment, valley, and channel scales at 2-5 km intervals (Harris et al. 2009). Catchment-scale variables, including elevation, were estimated from a 12.5 m digital elevation model (DEM) from the Natural Resources Information Centre of Chile, geology measured from a 1:1.000.000 scale vector geology map aggregated into 5 sediment types, and mean annual precipitation from WorldClim (https://www.worldclim.org/data/index.html). We calculated valley and channel scale variables from transects using the Digital Shoreline Analysis System (Habit et al. 2022).** Finally, the current and future barrier distribution was obtained from the Chilean Ministry of Energy based on hydropower potential and water rights with current dams as of 2018 and planned dams by 2050.
Analysis
Multi-species Occupancy Models
Given the large spatial distribution of the data in this study, it was imperative to explicitly account for spatial autocorrelation by including spatially structured random effects in the model (Banerjee et al. 2003; Shirota et al. 2019; Doser et al. 2023). Additionally, because sampling methodology was not always consistent between surveys and the spatial distribution of sampling was uneven, species presence-absence data were used instead of abundance, and imperfect detection was accounted for in the model (Doser et al. 2023). For example, while the data was subset to only include fish collected with the same standard method during the low flow period of 2021, some regions of the study area were sampled more heavily than others due to different accessibility or resource constraints, introducing potential bias. Studies show that accounting for residual species correlations, imperfect detection, and spatial autocorrelation when present in the data and model leads to better predictive performance, and therefore, the spatial factor MSOM was tested along with several others, including a multi-species occupancy model (MSOM) and a latent factor MSOM (Doser et al. 2022, 2023).
To find the most parsimonious model with the best predictive performance across the region, we followed a model selection procedure to identify the most influential covariates. To do this, we calculated a matrix of Pearson product-moment correlations using an initial set of 12 geomorphological variables at multiple spatial scales (valley, channel). Results indicated several variables were significantly correlated with one another; thus, each correlated variable was separately run through the models to assess its Widely Applicable Information Criterion (WAIC) using the spOccupancy R package, and the correlated variable with the best score was retained (Doser et al. 2022). Variables with the lowest WAIC were then combined in a model, and several iterations of that model were tested until the best WAIC score was found. This technique followed prior research (Hooten and Hobbs 2015; Stewart et al. 2023), indicating information criteria such as WAIC values can be informative towards selecting occupancy covariates, if the purpose of the model is prediction. Finally, before integrating covariates into the model, they were scaled (mean of 0 and standard deviation of 1) using the R scale function.
Prediction and mapping
We used the predict function in the spOccupancy R package to generate a series of posterior predictive samples at new locations, given a set of covariates and spatial locations. When predicting new values using a latent factor, MSOM predictions were made at both sampled and non-sampled locations using latent factors. Using this function, predicted values of detection probability for each species, given their relationship to the covariates, were generated (Doser et al. 2022). Then, maps of species richness predictions for native species and all species, as well as individual species' predicted occupancy maps, were made. Endangered, data-deficient, and non-native species were selected for the creation of individual species occupancy maps from species that had sufficient (>10) detections (Table 1). Finally, the native species richness map was then used as the base layer with a current and planned dams layer overlayed to descriptively assess potential impacts.
