Data from: How far can I extrapolate my species distribution model? Exploring Shape, a novel method
Data files
Oct 31, 2023 version files 354.50 MB
-
Experiment.zip
354.49 MB
-
README.md
9.78 KB
Abstract
Species distribution and ecological niche models (hereafter SDMs) are popular tools with broad applications in ecology, biodiversity conservation, and environmental science. Many SDM applications require projecting models in environmental conditions non-analog to those used for model training (extrapolation), giving predictions that may be statistically unsupported and biologically meaningless. We introduce a novel method, Shape, a model-agnostic approach that calculates the extrapolation degree for a given projection data point by its multivariate distance to the nearest training data point. Such distances are relativized by a factor that reflects the dispersion of the training data in environmental space. Distinct from other approaches, Shape incorporates an adjustable threshold to control the binary discrimination between acceptable and unacceptable extrapolation degrees. We compared Shape’s performance to five extrapolation metrics based on their ability to detect analog environmental conditions in environmental space and improve SDMs suitability predictions. To do so, we used 760 virtual species to define different modeling conditions determined by species niche tolerance, distribution equilibrium condition, sample size, and algorithm. All algorithms had trouble predicting species niches. However, we found a substantial improvement in model predictions when model projections were truncated independently of extrapolation metrics. Shape’s performance was dependent on extrapolation threshold used to truncate models. Because of this versatility, our approach showed similar or better performance than the previous approaches and could better deal with all modeling conditions and algorithms. Our extrapolation metric is simple to interpret, captures the complex shapes of the data in environmental space, and can use any extrapolation threshold to define whether model predictions are retained based on the extrapolation degrees. These properties make this approach more broadly applicable than existing methods for creating and applying SDMs. We hope this method and accompanying tools support modelers to explore, detect, and reduce extrapolation errors to achieve more reliable models.
README: Virtual species and codes were used to run the experiment to evaluate different extrapolation approaches performed in How far can I extrapolate my species distribution model? Exploring Shape, a novel method
https://doi.org/10.5061/dryad.r2280gbk5
Testing the performance of extrapolation metrics
To compare Shape’s performance based on Mahalanobis and Euclidean distances and to existing extrapolation detection methods, i.e., MESS, EO, MOP, ExDet, and AOA, we used a virtual species approach. We tested extrapolation metrics under different modeling conditions determined by species niche tolerance, distribution equilibrium condition, sample size (number of presences), and algorithm.
We used a factorial experimental design where all levels of evaluated factors were combined; therefore, each extrapolation metric was tested by 4560 SDMs (760 virtual species x 6 algorithms; Table 1). We used the same virtual species as Andrade et al. (2019) because their approach generates species distributions using realistic stochastic processes based on dispersal simulation and population dynamics at the cell level (see Appendix S2 and Andrade et al., 2019 for details about the virtual species approach).
Table 1. Combination of factor levels used in the experiment to test the performance of different extrapolation metrics.
Niche tolerance | Distribution condition | Sample size | Final number of species | Algorithm |
---|---|---|---|---|
Broad | Equilibrium | 100 | 100 | GAM, GLM, GP, Maxent, RF, and SVM |
20 | 100 | |||
Non-equilibrium | 100 | 96 | ||
20 | 100 | |||
Narrow | Equilibrium | 100 | 92 | |
20 | 97 | |||
Non-equilibrium | 100 | 84 | ||
20 | 91 |
GAM: Generalized Additive Model; GLM: Generalized Linear Model; GP: Gaussian Process; Maxent: Maximum Entropy; RF: Random Forest; SVM: Support Vector Machine.
Extrapolation values in environmental and geographical spaces can be used to explore the relationship between the degree of extrapolation and suitability values. We use truncation to measure how modeled performance can be improved using different extrapolation metrics. The improvement in SDM prediction was measured by Root Mean Square Error (RMSE) using worldwide cells (n=584,521) and comparing the values of the known niche of each virtual species with suitability predicted by SDMs before and after model truncation (i.e., projection data assumed to be unacceptably extrapolative were assigned a suitability value of 0).
For further details, see https://doi.org/10.1111/ecog.06992
Data files
Data is structured in three folders.
1) Folder functionscontains R scripts with extra functions used to run the experiment.
2) Folder VirtualSpecies contains the following system of subfolders. All raster and tabular files are in GTiffand .txt format, respectively.
Broad and Narrow folders contain information data on species with broad and narrow niches, respectively.
Within each of those folders are the subfolders
│ ├── Adeq (this folder contains rasters with species fundamental niches)
│ ├── Distribution
│ │ ├── Equilibrium (this folder contains rasters with species distribution with distribution conditions in equilibrium, also txt with presences, pseudo-absences, and background points )
│ │ │ └── Model_perf (this folder contains txt files with SDM and extrapolation performance of each species and AOA thresholds)
│ │ └── Nonequilibrium (the same files but for species distribution conditions in non-equilibrium)
│ │ │ └── Model_perf
└────── Env (contains raster with first and second principal component of 19 bioclimatic variables used to define species niches and create SDM)
A summarised folder and file organization is presented below:
./VirtualSpecies/
├── Broad
│ ├── Adeq
│ │ ├── sp1.tif
│ │ └── sp100.tif
│ ├── Distribution
│ │ ├── Equilibrium
│ │ │ ├── 0_backg_100.txt
│ │ │ ├── 0_backg_20.txt
│ │ │ ├── 0_prab_100.txt
│ │ │ ├── 0_prab_20.txt
│ │ │ ├── 1_prab_100.txt
│ │ │ ├── 1_prab_20.txt
│ │ │ ├── Distrib_sp1.tif
│ │ │ ├── Distrib_sp100.tif
│ │ │ └── Model_perf
│ │ │ ├── aoa_threshold_100_sp99.txt
│ │ │ ├── aoa_threshold_20_sp99.txt
│ │ │ ├── aoa_threshold_glm_100_sp99.txt
│ │ │ ├── aoa_threshold_glm_20_sp1.txt
│ │ │ ├── aoa_threshold_glm_20_sp99.txt
│ │ │ ├── extraperf_100_sp1.txt
│ │ │ ├── extraperf_100_sp100.txt
│ │ │ ├── extraperf_20_sp1.txt
│ │ │ ├── extraperf_20_sp100.txt
│ │ │ ├── pred_20_sp1.gz
│ │ │ ├── sdmperf_100_sp1.txt
│ │ │ ├── sdmperf_100_sp100.txt
│ │ │ ├── sdmperf_20_sp1.txt
│ │ │ └── sdmperf_20_sp100.txt
│ │ └── Nonequilibrium
│ │ ├── 0_backg_100.txt
│ │ ├── 0_backg_20.txt
│ │ ├── 0_prab_100.txt
│ │ ├── 0_prab_20.txt
│ │ ├── 1_prab_100.txt
│ │ ├── 1_prab_20.txt
│ │ ├── Distrib_sp1.tif
│ │ ├── Distrib_sp100.tif
│ │ └── Model_perf
│ │ ├── aoa_threshold_100_sp99.txt
│ │ ├── aoa_threshold_20_sp99.txt
│ │ ├── extraperf_100_sp1.txt
│ │ ├── extraperf_100_sp100.txt
│ │ ├── extraperf_20_sp1.txt
│ │ ├── extraperf_20_sp100.txt
│ │ ├── sdmperf_100_sp1.txt
│ │ ├── sdmperf_100_sp100.txt
│ │ ├── sdmperf_20_sp1.txt
│ │ └── sdmperf_20_sp100.txt
│ └── Env
│ └── PCA_2PC
│ ├── PCA_1.tif
│ └── PCA_2.tif
└── Narrow
├── Adeq
│ ├── sp1.tif
│ └── sp100.tif
└── Distribution
├── Equilibrium
│ ├── 0_backg_100.txt
│ ├── 0_backg_20.txt
│ ├── 0_prab_100.txt
│ ├── 0_prab_20.txt
│ ├── 1_prab_100.txt
│ ├── 1_prab_20.txt
│ ├── Distrib_sp1.tif
│ ├── Distrib_sp100.tif
│ └── Model_perf
│ ├── aoa_threshold_100_sp99.txt
│ ├── aoa_threshold_20_sp99.txt
│ ├── extraperf_100_sp1.txt
│ ├── extraperf_100_sp100.txt
│ ├── extraperf_20_sp1.txt
│ ├── extraperf_20_sp100.txt
│ ├── sdmperf_100_sp1.txt
│ ├── sdmperf_100_sp100.txt
│ ├── sdmperf_20_sp1.txt
│ └── sdmperf_20_sp100.txt
└── Nonequilibrium
├── 0_backg_100.txt
├── 0_backg_20.txt
├── 0_prab_100.txt
├── 0_prab_20.txt
├── 1_prab_100.txt
├── 1_prab_20.txt
├── Distrib_sp1.tif
├── Distrib_sp100.tif
└── Model_perf
├── aoa_threshold_glm_100_sp99.txt
├── aoa_threshold_glm_20_sp99.txt
├── extraperf_100_sp100.txt
├── extraperf_20_sp1.txt
├── sdmperf_100_sp1.txt
├── sdmperf_100_sp100.txt
├── sdmperf_20_sp1.txt
└── sdmperf_20_sp100.txt
3) In the folder Performance are the txt summarizing extrapolation (performance_ext) and SMDs (performance_sdm.txt) performance. Also are provided the txt with the HDS Tukey test (post_hoc_)
./Performance/
├── performance_best.txt
├── performance_best_2023.txt
├── performance_best_ONLY_Shapes.txt
├── performance_ext.txt
├── performance_ext_tabular.txt
├── performance_ext_tabular_ONLY_Shapes.txt
├── performance_rmse_ins_out_niche.txt
├── performance_sdm.txt
├── post_hoc_gam.txt
├── post_hoc_gau.txt
├── post_hoc_glm.txt
├── post_hoc_max.txt
├── post_hoc_raf.txt
└── post_hoc_svm.txt
Code/Software
All the codes are written in R language v.4.3.1.
Unzip Experiment.zip file and open extrapolation_peformance.Rproj R project to facilitate running the codes. The codes are structured in two scripts.
1_run_experiment_2023.R
In this script are the codes for i) Sampling presences and pseudo-absences and background points of virtual species; ii) Performing partition using spatially structured partition approaches; iii) Fitting SDMs; iv) Evaluate model extrapolation based on different approaches; and v) Validating the performance of different extrapolation metric.
In this script, the following packages were used:
Package | Version |
---|---|
CAST | 0.8.1 |
caret | 6.0.94 |
doParallel | 1.0.17 |
dplyr | 1.1.3 |
dsmextra | 1.1.5 |
ecospat | 4.0.0 |
flexsdm | 1.3.3 |
ggplot2 | 3.4.3 |
kuenm | 1.1.10 |
parallel | 4.3.1 |
patchwork | 1.1.3 |
raster | 3.6.23 |
readr | 2.1.4 |
terra | 1.7.46 |
2_data_analysis_2023.R
In this script are the codes for i) Compling in different tables the txt files with SDMs and extrapolation performance; ii) Performing exploratory data analysis; iii) Creagin figures about extrapolation performance based on box-plots and bar plots; iv) Fitting GAMLSS and performing HDS Tukey test.
In this script, the following packages were used:
Package | version |
---|---|
dplyr | 1.1.3 |
emmeans | 1.8.8 |
fishualize | 0.2.3 |
gamlss | 5.4.18 |
ggplot2 | 3.4.3 |
ggplot2 | 3.4.3 |
ggrepel | 0.9.3 |
magrittr | 2.0.3 |
multcomp | 1.4.25 |
patchwork | 1.1.3 |
progress | 1.2.2 |
readr | 2.1.4 |
stringr | 1.5.0 |
tidyr | 1.3.0 |
Methods
The experiment was based on virtual species created with the same protocol as (Andrade et al., 2019). Also, are provided the codes used to run the experiment.