The best of two worlds: using stacked generalisation for integrating expert range maps in species distribution models

Data files

Sep 25, 2024 version files 268.99 MB

iucn_dists_agg.tif

28.81 MB
model_df.csv

203.27 MB
predictor_stack_agg.tif

36.92 MB
README.md

4.87 KB

Abstract

Aim

Species distribution models (SDMs) are powerful tools for assessing suitable habitats across large areas and at fine spatial resolution. Yet, the usefulness of SDMs for mapping species' realised distributions is often limited since data biases or missing information on dispersal barriers or biotic interactions hinder them from accurately delineating species' range limits. One way to overcome this limitation is to integrate SDMs with expert range maps, which provide coarse-scale information on the extent of species' ranges and thereby range limits that are complementary to information offered by SDMs.

Innovation

Here, we propose a new approach for integrating expert range maps in SDMs based on an ensemble method called stacked generalisation. Specifically, our approach relies on training a meta-learner regression model using predictions from one or more SDM algorithms alongside the distance of training points to expert-defined ranges as predictor variables. We demonstrate our approach with an occurrence dataset for 49 bat species covering four biodiversity hotspots in the Eastern Mediterranean, Western Asia and Central Asia.

Main Conclusions

Our approach offers a flexible method to integrate expert range maps with any combination of SDM modelling algorithms, thus facilitating the use of algorithm ensembles. In addition, it provides a novel, data-driven way to account for uncertainty in expert-defined ranges not requiring prior knowledge about their accuracy, which is often lacking. Integrating expert range maps into SDMs for bats resulted in more realistic predictions of distribution patterns that showed narrower niche breadths and smaller range overlaps between species compared to traditional SDMs. Our approach holds promise to improve assessments of species distributions, while our work highlights the overlooked potential of stacked generalisation as an ensemble method in species distribution modelling.

https://doi.org/10.5061/dryad.6q573n65m

This repository contains Supporting Information for the article "The best of two worlds: using stacked generalization for integrating expert range maps in species distribution models" (https://doi.org/10.1111/geb.13911).

It contains three files:

"model_df.csv": CSV table containing modeling data frame with occurrence information (presence/background) for 49 bat species alongside values of predictors used for building SDMs
"predictor_stack_agg.tif": GeoTIFF containing raster stack of predictor variables used in SDMs, re-sampled to a reduced spatial resolution of 10km for demonstration purposes
"iucn_dists.tif": GeoTIFF containing raster stack of distance layers describing the distance of raster cells to the boundary of IUCN ranges for 49 bat species, re-sampled to a reduced spatial resolution of 10km for demonstration purposes

A companion R script is available on Zenodo:

"stacked_generalization_Rcode_example.R": Commented R code demonstrating the use of stacked generalization for integrating expert range maps with an SDM algorithm ensemble

Description of the data and file structure

The R script demonstrates the use of stacked generalization for integrating information on species range limits into SDMs based on expert range maps, such as provided by the IUCN ( https://www.iucnredlist.org/resources/spatial-data-download ). For more details on the approach, please refer to the paper.

Running the R code requires adjusting the paths in order to enable loading the data.frame ("model_df.csv") as well as the two raster stacks ("predictor_stack_agg.tif" and "iucn_dists.tif"). The .tif raster files can be opened and viewed in any GIS software (e.g., QGIS) or in R using packages such as "raster" or "terra".

The CSV table containing the modeling data frame ("model_df.csv") is made up of the following variables (numbers correspond to column numbers, text in parentheses to variable names or naming structure):

1 ("occ"): Binary variable indicating whether an observation is a presence point (value = 1) or background point (value = 0)

2 ("speces_latin"): (Latin) binomial bat species name

3-22 ("class_*"): Land cover fractions of seven land cover classes, derived from a land cover map. Explanation of class names: "forest" = f, orest, "agriculture" = agricultural lands, "shrubs" = shrubland, "herbaceous" = herbaceous vegetation, "sparseopen" = sparse and open vegetation, "water" = water bodies, "ice" = ice and permanent snow. Variables were calculated at three scales (except for the "ice" class), indicated by variable name endings "small", "medium", or "large".

23-79 (CHELSA_bio*): Values of 19 bioclimatic predictor variables derived from the CHELSA climate dataset. For a description of the dataset and the bioclimatic variables see: https://chelsa-climate.org/bioclim/. Variables were calculated at three scales, indicated by variable name endings "small", "medium", or "large".

80-82 (NTL_*): "Night-time lights" variable, calculated at three scales, indicated by variable name endings "small", "medium", or "large".

83-85 (accessibility_*): "Accessibility" variable (travel time to cities), calculated at three scales, indicated by variable name endings "small", "medium", or "large".

86-88(flii_*): "Forest landscape integrity index" variable, calculated at three scales, indicated by variable name endings "small", "medium", or "large".

89-91(ghm_*): "Human modification" variable, calculated at three scales, indicated by variable name endings "small", "medium", or "large".

92-94 (karst_*): "Presence of karstifiable rock" variable, calculated at three scales, indicated by variable name endings "small", "medium", or "large".

95-121 ("index*_metric") Spectral-temporal metrics derived from Landsat imagery. Three indices ("greenness", "brightness", and "wetness") were used to derive three metrics each (minimum, cumulative, and variability). All variables (index + metric combinations) were calculated at three scales, indicated by variable name endings "small", "medium", or "large".

122-124 ("tri_*): "Topographic ruggedness index" variable, calculated at three scales, indicated by variable name endings "small", "medium", or "large".

125 ("iucn_dist"): (Euclidean) distance to the IUCN expert range (in m). Values of 0 indicate points lying inside expert ranges, all points with values > 0 lie outside the expert range for that species.

For a detailed description of the predictor variables and their data sources, please refer to the paper.