Data from: Advancing single species abundance models by leveraging multi-species data to reveal lakespecific patterns for fisheries predictions
Data files
Jan 12, 2026 version files 182.52 MB
-
Application.zip
182.51 MB
-
README.md
12.56 KB
Abstract
This readme file was generated on 2026-01-08 by Dr. Stahl
# GENERAL INFORMATION
Title of Dataset: Advancing single species abundance models by leveraging multi-species data to reveal lakespecific patterns for fisheries predictions.
Author
Name: Stahl, Aliénor
ORCID: 0000-0002-2297-7379
Institution: Concordia University, Montreal, Canada
Email: alienor.stahl@uqtr.ca
Author
Name: Eric Pedersen
ORCID: 0000-0003-1016-540X
Institution: Concordia University, Montreal, Canada
Email: eric.pedersen@concordia.ca
Author
Name: Pedro Peres-Neto
ORCID: 0000-0002-5629-8067
Institution: Concordia University, Montreal, Canada
Email: pedro.peres-neto@concordia.ca
SHARING/ACCESS INFORMATION
Links to publications that cite or use the data:
Data and code for:
Stahl, A., Pedersen, E., Peres-Neto, P.. Advancing single species abundance models by leveraging multi-species data to reveal lake-specific patterns for fisheries predictions. Accepted in Canadian Journal of Fisheries and Aquatic Sciences (2026). https://doi.org/10.32942/X23H13
See preprint for information on methodological information.
## Abstract
Predicting species abundance is critical for understanding ecological dynamics and guiding conservation and management strategies. Traditional species abundance models (SAMs) rely on environmental variables and the presence or absence of key species, but often overlook community context and unmeasured environmental variation. Community composition can serve as a proxy for both unobserved environmental variables and biotic interactions influencing focal species. Here, we tested whether incorporating community composition via latent variables improves abundance predictions of sport fishing using a large-scale dataset. We assessed how latent variable selection and lake characteristics influences model accuracy in predicting abundance across species. Our results show that low-abundance species were better predicted by models based solely on environment, while high-abundance species benefited from latent variables. Lake contribution to accuracy were correlated among species with similar occurrence, but unrelated to environmental characteristics. Model performance varied by species, with no consistent association with trophic level, occurrence, or abundance. These findings underscore the need to tailor models to species-specific contexts and integrate community composition into abundance modelling.
## Data
All data, whether original or generated by a script, is available in the Data folder.
The raw data is kept in the RawData subfolder and contains:
- A_species_code : correspondance between species code (used in Abundance_biomass_perSpc_lake) and species name (for interpretation of the patterns found)
- Abundance_biomass_perSpc_lake : abundance and biomass data per species and per lake. Fish abundance (CPUE) was collected in 707 lakes by the Ontario Broadscale Monitoring Program. The lakes were sampled during the summers (June to September) from 2008 to 2012 (see Lester et al. 2021 and Sandstrom et al. 2011 for more details on sampling methods).
- dictionary : Contains information on the environmental variables (e.g., what each is and the unit)
- E_Clean_lake_data_CSV : Environmental variables and their values. They were measured for each lake at the same time they were sampled for fish abundances (see Sandstrom et al. 2011 on the choice of variables to measure, and the sampling methods used for each variable). These variables included measurements of local climate conditions (16 variables), hydro morphology (13 variables), lake chemistry (11 variables), lake productivity (10 variables), human activity on the lake (seven variables), watershed characteristics (five variables), as well as latitude and longitude. See dictionary for definition and units.
- NAFMFD_finalcopy : information on anadromy and migratory behavior of the species
The remaining files contained in Data are modified versions of the data. The suffix of each file indicates which script was used to generate it. For example, 01_A_species_code_updated_names was generated with the script 01_Contribution.
- 01_A_species_code_updated_names : correspondance between species name from the raw data and updated versions to ignore typos and changes in names.
- 01_species_DF : information on whether we consider a species a sport fish or not, along with notes from Dr. Dylan Fraser (Concordia University, Montréal) with his opinion.
- 01_species : information on whether we consider a species a sport fish, bait fish and whether fishing it is legal or not.
- 04B_traits_sport: trait information gathered on sport fish
- Map.C2 : Lat/Long of each lake to generate maps
- Pca.loadings : loadings of the environmental variables on the PCA both as csv and xlsx
Any and all empty cell present in a dataset is considered as N/A. To avoid interfering with the scripts, it was decided to leave them as such in the datasets.
Further data is contained in the folder Map. This data contains shapefiles of the Ontario Watersheds. These shapefiles can be downloaded on Statistics Canada (see https://www150.statcan.gc.ca/n1/pub/92-160-g/92-160-g2021001-eng.htm) and were used to generate maps of the study zone.
## Results
The folder Results contains 2 sub-folders containing RData saved during large loops so that in case of a crash, one does not need to rerun the whole loop.
## Code
Below, we outline the main questions addressed in this study and summarize the corresponding results. Details on the analytical approaches used to obtain these results, as well as the scripts in which each analysis can be found, are provided in the following section.
(1) Does the inclusion of latent variables improve prediction accuracy?
Not all target species models benefitted from the inclusion of latent variables. Importantly, the method used to generate these latent variables did not affect the direction of the LE values and consistently produced the same overall effect on predictive ability, whether as an improvement or a decline relative to the environmental model. A clear trend emerged: species with low occurrences were predicted more accurately by the environmental model, whereas species with higher occurrences were better predicted by models that included latent variables.
(2) Are predictions of sport fish abundances more accurate when using sport fish, non-sport fish, or all fish species as predictors?
Our analysis showed that the best-performing model varied by species used to build latent variables. Cisco, lake whitefish, largemouth bass, northern pike, and smallmouth bass were best predicted by the model using latent variables incorporating all fish species. In contrast, black crappie, lake trout, rainbow smelt, walleye, and yellow perch were better predicted by the model using non-sport fish species. The remaining four species were most accurately predicted by the model that included only sport fish species. Taken together, these results indicate that our models are robust against variations in lake rarity, whether defined by environmental characteristics or community composition, and are not strongly influenced by any single environmental factor.
(3) What types of lakes significantly increase or decrease predictive ability, and are these lakes rare or common in terms of environment and/or species composition?
The LE metric showed no correlation with lake rarity, whether defined by environmental characteristics (Mahalanobis distance) or by species composition (LCBD). This suggests that predictive ability is not primarily driven by whether lake types are common or rare, although certain lake characteristics may still influence predictive through their overall characteristics, regardless of their rarity (or commonness).
(4) To what extent do species share lakes that either improve or reduce predictive accuracy?
Visual analysis revealed three distinct groups with similar correlations across models: (i) rainbow smelt, muskellunge, and sauger; (ii) burbot, lake trout, black crappie, brook trout, and largemouth bass; and (iii) yellow perch, smallmouth bass, northern pike, walleye, lake whitefish, and cisco. The species groups also appear to be correlated with their occurrence rates (i.e., number of lakes that the species was present): group 1 consisted of low-occurrence species, group 2 included medium-occurrence species, and group 3 represented high-occurrence species.
(5) Are sport fish abundances better predicted using all lakes or only those where the species is present?
The results varied by species but were extremely consistent across models. For rainbow smelt, lake trout, and lake whitefish, models fitted using only the lakes where the species occurred performed better on average. In contrast, for black crappie, brook trout, largemouth bass, burbot, smallmouth bass, cisco, walleye, northern pike, and yellow perch, predictions were more accurate when models included data from all lakes in the dataset.
In order to answer these questions, we modified the original approach from Stahl et al. (2024) to our dataset and implemented the following steps:
- Using all lakes (n = 594), we derived three sets of latent variables from the presence-absence data of: (1) sport fish species, (2) non-sport fish species, and (3) all fish species.
- The dataset was randomly split into a calibration set and a validation set, representing respectively 70 % (n = 416 lakes) and 30 % (n = 178 lakes) of the dataset considered. This split was performed multiple times for each target sport fish species to assess uncertainty over model performance.
- Environmental variables of the calibration set were summarized by PCA with a sparsification step (Zou et al. 2006), and the environmental variables of the validation set were subsequently projected onto the same PCA axes (see section Environmental predictors of the manuscript for rationale).
- The calibration set was used to fit (train) statistical models for predicting lake abundance of each of the 14 sport fish species. The trained models varied in their inclusion of different sets of predictors: (1) environmental variables summarized by sparse PCA axes, (2) environmental PCA axes combined with latent variables generated from presence-absence of the 14 sport fish species, (3) environmental PCA axes with latent variables generated from presence-absence of all non-sport fish species, and (4) PCA environmental axes and latent variables from the presence-absence of all fish species. This approach aimed to contrast the effects of different species groups on predictive ability and provide a comparison with models relying only on environmental data, as is commonly done in abundance modelling.
- The validation set was used to evaluate the performance of each model in predicting species abundance, with accuracy measured by the log error.
- The process of cross validation was replicated 1000 times. To determine the contribution of each lake to the dataset, we calculated the difference in error between two scenarios (1) when the lake was included in the calibration dataset, and (2) when the lake was excluded from the calibration dataset. This step allowed us to assess how influential a particular lake is on model performance and to identify whether certain lakes have a disproportionate effect on prediction accuracy.
Codes should be run in the order of the suffix (e.g., run script 01_Contribution before script 02_New). The goal of each script is the following:
- 01_Contribution : Calculates the prediction error per lake and species (Generates results to answer question 1 to 4)
- 01_functionsBIS : functions necessary to run script 01_Contribution
- 02_New : script generating the plots to answer all questions of the article (see section above)
- 02_functions : functions required to run 02_New
- 03_Present.contribution : script to calculate the prediction error but only if the species is present in the lake (generates results to answer question 5)
- Edited.stackedsdm : function stackedsdm from ecoCopula where we edited what the function returns to extract information on the data transformation
- Moran : calculate the spatial correlation
## Figures
Finally, all figures generated by the scripts are found in this folder.
