Data from: cross-validation matters in species distribution models: a case study with goatfish species
Data files
Sep 05, 2024 version files 187.97 MB
-
raw_data.zip
-
README.md
-
scripts.zip
Abstract
In an era of ongoing biodiversity, it is critical to map biodiversity patterns in space and time for better-informing conservation and management. Species distribution models (SDMs) are widely applied in various types of such biodiversity assessments. Cross-validation represents a prevalent approach to assess the discrimination capacity of a target SDM algorithm and determine its optimal parameters. Several alternative cross-validation methods exist; however, the influence of choosing a specific cross-validation method on SDM performance and predictions remains unresolved. Here, we tested the performance of random versus spatial cross-validation methods for SDM using goatfishes (Actinopteri: Syngnathiformes: Mullidae) as a case study, which are recognized as indicator species for coastal waters. Our results showed that the random versus spatial cross-validation methods resulted in different optimal model parameterizations in 57 out of 60 modeled species. Significant difference existed in predictive performance between the random and spatial cross-validation methods, and the two cross-validation methods yielded different projected present-day spatial distribution and future projection patterns of goatfishes under climate change exposure. Despite the disparity in species distributions, both approaches consistently suggested the Indo-Australian Archipelago as the hotspot of goatfish species richness and also as the most vulnerable area to climate change. Our findings highlight that the choice of cross-validation method is an overlooked source of uncertainty in SDM studies. Meanwhile, the consistency in richness predictions highlights the usefulness of SDMs in marine conservation. These findings emphasize that we should pay special attention to the selection of cross-validation methods in SDM studies.
Methods
According to the best-practice standards (Araújo et al., 2019; Feng et al., 2019), when constructing an SDM, we should pay due attention to hyperparameter optimization of modeling algorithms in order to maximize model predictive performance. Cross-validation represents a key approach to comparing the predictive performance of competing models with different hyperparameters, hence helping to determine the optimal configuration of parameters (Araújo & Guisan, 2006; Hijmans, 2012; Guisan et al., 2017). Taking the widely-used five-fold cross-validation approach as an example, 80% of the data is used for model training and the withholding 20% for model validation, and this step is repeated five times while the validation fold is changed. To date, most SDM studies have adopted this random cross-validation strategy for model evaluation during hyperparameter optimization (Guisan et al., 2017; Roberts et al., 2017). Recently, however, researchers argued that the random cross-validation approach ignores spatial autocorrelation in the training and validation datasets, especially when data is temporally or spatially structured (Roberts et al., 2017; Valavi et al., 2019); thus, random cross-validation may affect model parameter configuration (e.g., Roberts et al., 2017; Valavi et al., 2019), often resulting in the overestimation of predictive performance (e.g., Veloz, 2009; Guillaumot et al., 2019). To address this issue, researchers have proposed a spatial cross-validation strategy; in this approach, species distribution data is split into spatial blocks (see detailed description in Roberts et al., 2017; Valavi et al., 2019). The spatial cross-validation strategy is capable of capturing the spatial heterogeneity in distribution data, reducing spatial autocorrelation issues. This approach is implemented in several R packages, such as ENMeval (Kass et al., 2021) and blockCV (Valavi et al., 2019).