Skip to main content
Dryad logo

Data from: Effects of input data sources on species distribution model predictions across species with different distributional ranges


Arenas-Castro, Salvador (2022), Data from: Effects of input data sources on species distribution model predictions across species with different distributional ranges, Dryad, Dataset,


Species distribution models (SDMs) are a popular tool in theoretical and quantitative ecology, and constitute the most widely used modelling framework in global change science and biodiversity conservation. As main data sources, SDMs require georeferenced biodiversity observations as a response or dependent variable (e.g. species occurrence, species richness, etc) and geographic layers of environmental information as predictors or independent variables (e.g. climate, land cover, vegetation indices derived from remote sensing, etc). However, although SDMs have become one of the most important quantitative tools for addressing regular and timely biodiversity assessments worldwide, these techniques are still subject to different sources of uncertainty that have been unequally assessed. Thus, despite uncertainty related to niche-based or distribution-based models has been addressed at different stages in the modelling process, an analysis of the effect of uncertainty coming from alternative data sources on the predictive ability of SDMs is still limited.

Citizen-collected species occurrence data (e.g. eBird) are often used for fitting SDMs when data from standardized and expert-supported surveys (e.g. Atlases) are unavailable. On the other hand, macroclimate variables are much more commonly used as predictors in SDMs than other sources of information coming from remote sensing data. We assessed the effects of using different data sources (in both response and predictor variables) on SDM performance across a wide range of bird species with contrasting distributional ranges in the Iberian Peninsula (Portugal and Spain). To do that, a SDM ensemble-forecasting approach was implemented by using bird data from two different data sources: the semi-structured eBird project and standardized Atlases. We fitted SDMs with three predictor types: macroclimate, remotely sensed ecosystem functional attributes (EFAs) from vegetation indices, and their combination. Species were grouped in four range size classes. We also used different evaluation metrics to better assess the uncertainty of model predictions. We then applied generalized linear mixed-effects models to test the effect on model performance of input data source across distributional range sizes while accounting for different accuracy metrics. Pairwise comparisons between range projections were used to assess their spatial similarity.

Our models demonstrated the usefulness and complementarity of different input data sources when modelling species distribution across different distributional ranges. Citizen science and remote sensing data contribute to update the knowledge of the distribution of the most threatened bird species by increasing the model accuracy. These findings highlight the need to integrate different data sources to improve the model predictions at regional scale. Our framework also underlines that model uncertainty should be examined more exhaustively at early stages of the modelling process.

To perfom and replicate this study, this dataset provides all needed files (as tables) to fit SDMs: i) the Iberian bird species occurrences at 10km UTM square as a response or dependent variable;  ii) the geographic layers of environmental information at 10km UTM square for the Iberian Peninsula as predictors or independent variables, such as climate data, ecosystem functioning attributes (EFAs) and the combined climate and EFA data. The dataset is provided by four *.csv files named as:

1) The_Iberian_bird_species_occurrences_dataset_10km.csv

2) CHELSA_bioclimate_variables_IP10km.csv

3) MODIS_EVI-based_EFAs_IP10km.csv

4) Combined_bioclimate_EFA_dataset_IP10km.csv

For a more detailed description of the main dataset and each of these subdatasets, please refer to the attached README file.

Keywords: bird atlas, eBird data, ecosystem functional attributes (EFAs), Iberian Peninsula, IUCN categories, Model accuracy, MODIS EVI, narrow-ranged species, remote sensing, species distribution models (SDMs), widespread species


Methods for processing the data: 

1) The Iberian bird species occurrences dataset (10km): Bird occurrence data were collected from two different data sources of biodiversity: i) a standardized dataset based on national Bird Atlases (Atlas); and ii) a citizen science (i.e. non-standardized) dataset based on the EOD - eBird Observation Dataset from the Global Biodiversity Information Facility – GBIF (eBird). To reduce the potential geographical errors in species records that can strongly influence the results of models, we filtered the original dataset removing duplicates and positional/spatial errors such as outliers using R and QGIS programs, and nomenclature errors and taxa misidentification supported by expert knowledge. In addition, we harmonized the species records in grid cells with a resolution higher than 10 km before the modelling procedures. We adjusted eBird data to the spatial resolution of both Atlases (10-km UTM square) to standardize input data and make both datasets comparable, also matching to the predictive variables spatial resolution (10-km UTM square). We also matched the eBird data to the years and months within the Atlas data (1999-2012), from late February to mid-August, the breeding season in the Iberian Peninsula. Additionally, to perform subsequent comparisons, we built a full dataset from the combination of both Atlas and eBird datasets. This full dataset represents the best knowledge of bird distribution in Iberia as it includes all occurrences recorded in the two datasets (Species name, Species code, and Data source are provided). To assess the effect of species distributional range (from narrow-ranged to widespread species) on modelling performance, we grouped the species records from the full dataset in four sets of size classes based on number of occurrences at 10-km UTM squares: i) Class I (10 – 100); Class II (101 – 500); ii) Class III (501 – 1000); Class IV (> 1000). We also considered the conservation categories of species in the IUCN Red List to assess the potential effects of model uncertainty on decision-making for bird conservation. Therefore, each species always belongs to the same group (n = 236) because the full dataset represents the actual distributional range of each species. Number of 10-km UTM squares, and accuracy metrics (AUC and BoyceI) per each species and by type of predictor (climate, EFAs, and combined climate+EFAs) were also provided.

2) CHELSA bioclimate variables (IP-10km): Nineteen (bio-) climate predictors from monthly temperature and rainfall data were derived from the platform This climatic dataset was obtained for historical conditions (1979 to 2013) from the CHELSA V1.2 database at a spatial resolution of 30 arc-sec (~1-km pixel size), to be later aggregated at 10-km UTM square. Based on the pairwise correlations and the collinearity assessment from initial nineteen (bio-) climate variables, three temperature-related (bio4 — Temperature Seasonality; bio8 — Mean Temperature of Wettest Quarter; bio9 — Mean Temperature of Driest Quarter) and two precipitation (bio16 — Precipitation of Wettest Quarter; bio17 — Precipitation of Driest Quarter) variables were selected to build this climate dataset for the Iberian Peninsula.

3) MODIS EVI-based EFAs (IP-10km): Satellite remote sensing data from the Moderate Resolution Imaging Spectrometer (MODIS) on-board the Terra satellite platform were used to derive remotely sensed ecosystem functioning attributes (EFAs) to characterize species habitat dynamics as a counterpoint/complement of climate data. To compute EFAs, we used the MODIS Enhanced Vegetation Index (EVI) (MOD13Q1.v006; 232 m pixel every 16 days) as a proxy of vegetation greenness, biomass, and leaf area index for the 2000–2012 time period. For that, we used Google Earth Engine (GEE) cloud-based platform to derive originally eleven metrics of the EVI seasonal dynamics. These statistical measures were calculated for each complete year. To capture the multi-year normal conditions of each EFA variable, thus reducing the effect of stochastic interannual climatic fluctuations, we computed the overall mean. EFAs were exported from GEE at 1-km squares of final spatial resolution to be later aggregated at 10-km UTM square. Similarly as in the case of climate variables, and based on the pairwise correlations and the collinearity assessment, we selected five remotely sensed EFAs as descriptors of species habitat dynamics for the Iberian Peninsula: EVI annual mean (EVImean as surrogate of annual total amount of primary production), EVI annual minimum (EVImin as an indicator of the annual extremes), EVI seasonal standard-deviation (EVIsd as descriptor of variations between seasons), and dates of maximum (EVIdmax) and minimum (EVIdmin) EVI (indicators of phenology - growing season).

4) Combined bioclimate and EFA dataset (IP-10km): This file represents the most significant five (and uncorrelated) predictors selected from the previous partial climate (Temperature Seasonality - bio4, Precipitation of Wettest Quarter - bio16, Precipitation of Driest Quarter - bio17) and EFA-based models (EVI annual mean - EVImean, EVI seasonal standard-deviation - EVIsd) (CLIM_EFAs) at 10km pixel size for the Iberian Peninsula.


Fundação para a Ciência e a Tecnologia, Award: POCI-01-0145-FEDER-022127

The Spanish Ministry of Universities and the EU-NextGenerationEU fund, Award: GA.IV.9H.21.02 541A.C.RH 649.00