Data from: Integrated species distribution models to account for sampling biases and improve range wide occurrence predictions
Abstract
Aim
Species distribution models (SDMs) that integrate presence-only and presence-absence data offer a promising avenue to improve information on species’ geographic distributions. The use of such ‘integrated SDMs’ on a species range-wide extent has been constrained by the often-limited presence-absence data and by the heterogeneous sampling of the presence-only data. Here, we evaluate integrated SDMs for studying species ranges with a novel expert range map-based evaluation. We build a new understanding about how integrated SDMs address issues of estimation accuracy and data deficiency and thereby offer advantages over traditional SDMs.
Location
South and Central America.
Time period
1979-2017.
Major taxa studied
Hummingbirds.
Methods
We build integrated SDMs by linking two observation models – one for each data type – to the same underlying spatial process. We validate SDMs with two schemes: i) cross-validation with presence-absence data and ii) comparison with respect to the species’ whole range as defined with IUCN range maps. We also compare models relative to the estimated response curves and compute the association between the benefit of the data integration and the number of presence records in each data set.
Results
The integrated SDM accounting for the spatially varying sampling intensity of the presence-only data was one of the top-performing models in both model validation schemes. Presence-only data alleviated overly large niche estimates, and data integration was beneficial compared to modelling solely presence-only data for species that had few presence points when predicting the species’ whole range. On the community level, integrated models improved the species richness prediction.
Main conclusions
Integrated SDMs combining presence-only and presence-absence data are successfully able to borrow strengths from both data types and offer improved predictions of species’ ranges. Integrated SDMs can potentially alleviate the impacts of taxonomically and geographically uneven sampling and to leverage the detailed sampling information in presence-absence data.
README
This data set is a collection of publicly available species and environmental data. They have been used to study populations and distributions of species and their associations with environment. To avoid risks on species populations, coordinates of the presence records for three species with near-threatened IUCN status and all checklist coordinates have been rounded to 0.01 degrees. The species are Little Woodstar (Chaetocercus bombus), Black-thighed Puffleg (Eriocnemis derbyi) and Hoary Puffleg (Haplophaedia lugens).
Description of the data and file structure
- File 1 Name: Cloud_cover/MODCF_intraannualSD_resampled_masked_NA30x30_americas.tif
- File 1 Description: Intra-annual variability of cloud cover in 1 square kilometer resolution over the study area.
- Folder 2 Name: Domains_new/
- Folder 2 Description: Folder contains geotif-file for each species showing the study domain, which is a buffer around species expert range map and points records.
- File 3 Name: Elevation_range/topography_elevation_1KMmi_GMTEDmi_NA30x30_americas.tif
- File 3 Description: Areal elevation from sea level in meters in 1 square kilometer resolution over the study area.
- File 4 Name: Environment/Chelsa_SA.tif
- File 4 Description: Chelsa Bioclim (v. 1.2) of annual mean temperature (cov1), mean annual diurnal range (cov2), mean annual precipitation sum (cov3) and precipitation seasonality (cov4) in 1 square kilometer resolution.
- File 5 Name: EVI/Annual_EVI_resampled_NA30x30_americas.tif
- File 5 Description: Enhanced vegetation index in 1 square kilometer resolution.
- File 6 Name: PA_observations/Juan_parra_checklists/Sites_8Feb2011.csv
- File 6 Description: List of checklist locations.
- Columns:
- Communities_IDComm: identifying number of communities
- CommunityName: name of communities
- LatDecDeg: latitude in geographical coordinates
- LongDecDeg: longitude in geographical coordinates
- Country: country
- MinElevation_m: minimum elevation from sea level in meters
- MaxElevation_m: maximum elevation from sea level in meters
- AreaSampled: sampled area in square meters
- AllSpp: number of species recorded
- SppWithPhy: number of species recorded with phylogenetic information
- SppWithMorph: number of species recorded with morphological information
- File 7 Name: PA_observations/Juan_parra_checklists/SpeciesxSite8Feb2011.csv
- File 7 Description: List of species sightings in the checklist locations.
- Columns:
- Communities_IDComm: identifying number of communities
- Spname: Species name
- CommunityName: name of communities
- Country: country
- SpID: identifying number of species
- Folder 8 Name: PO_observations_non_thinned/
- Folder 8 Description: Folder contains species point records as species-specific CSV files.
- Columns:
- lat: latitude in geographical coordinates
- lon: longitude in geographical coordinates
- Folder 9 Name: Study_area/
- Folder 9 Description: Folder contains study area (rectangle with the extent matching the extent of the union of all species domains) in geographical (WGS 84) and planar (laea) coordinate system.
- File 10 Name: TRI/tri_1KMmd_GMTEDmd_resampled_masked_NA30x30_americas.tif
- File 10 Description: Topographical ruggedness index in 1 square kilometer resolution.
- Folder 11 Name: Compiled_data_v9/
- Folder 11 Description: Folder contains two .RData-files for two example species, Coeligena lutitae (Buff-winged starfrontlet) and Ensifera ensifera (Sword-billed hummingbird). Rdata-files contain compiled data for the species: presence-only, presence-absence and background point information with environmental associations.
- File 12 Name: Compiled_data_v9/'species_name'_not_thin...
- File 12 Description: .RData file contains species and environmental data along with prediction data across the whole study domain:
- env_data:
- covariates: environmental covariates over the whole study area
- coordinates: cell coordinates over the whole study area
- min_coordinates: minimum coordinates of the study area
- offsets: possible offsets computed from expert range maps and species' elevational limits
- ind_na: index for study cells which are not used for range predictions, such as sea areas
- PO_training_data (presence points data):
- covariates: environmental covariates over the presence and background points
- coordinates: cell coordinates over the presence and background points
- cov_mean: mean of each covariate for covariate standardization
- cov_sd: standard deviation of each covariate for covariate standardization
- min_coordinates: minimum coordinates of the study area
- response: species presence status
- weights: statistical weights
- offset_expert: offset from the species' range map
- offset_elevation: offset from the species' elevational limits
- PA_training_data (checklist data):
- covariates: environmental covariates over the presence and background points
- coordinates: cell coordinates over the presence and background points
- response: species presence status
- offset_expert: offset from the species' range map
- offset_elevation: offset from the species' elevational limits
- proj_raster: coordinate reference system
- scale_out: resolution at which background points are sampled
- env_data:
- File 12 Description: .RData file contains species and environmental data along with prediction data across the whole study domain:
- File 13 Name: Compiled_data_v9/INLA_list_6_'species_name'...
- File 13 Description: contain data for running models for the whole data and for cross-validation folds along with spatial mesh-structure for running spatial spde-type INLA-models:
- po_coordinates: presence and background point coordinates
- po_covariates: presence and background point covariates
- po_response: presence status
- po_weights: statistical weights for the presence and background points
- po_offset: offsets from species' range maps and elevational limits
- pa_coordinates: checklist coordinates
- pa_covariates: checklist covariates
- pa_response: presence status in checklists
- pa_offset: offsets from species' range maps and elevational limits
- cov_mean: mean of each covariate for covariate standardization
- cov_sd: standard deviation of each covariate for covariate standardization
- spde_hull: geographical hull for defining SPDEs for the INLA model
- spde_mesh: geographical mesh for defining SPDEs for the INLA model
- pa_testing_folds: cross-validation folds for the checklists
- po_training_folds: presence points used for model training
- File 13 Description: contain data for running models for the whole data and for cross-validation folds along with spatial mesh-structure for running spatial spde-type INLA-models:
Sharing/Access information
Data was derived from the following sources:
- Cloud cover: https://www.earthenv.org/cloud, (https://doi.org/10.1371/journal.pbio.1002415)
- Elevation: https://www.earthenv.org/topography, (https://www.nature.com/articles/sdata201840)
- Chelsa Bioclim: https://doi.org/10.5061/dryad.kd1d4, (https://www.nature.com/articles/sdata2017122)
- EVI:https://lpdaac.usgs.gov/products/mod13a3v006/, (https://modis-land.gsfc.nasa.gov/pdf/MOD13_User_Guide_V61.pdf)
- Presence-absence species observations: https://mol.org/datasets/769f3b99-214e-4056-8c39-1200a6855943, (Parra, J. L. et al. 2019. Data from: Continental-scale 1km hummingbird diversity derived from fusing point records with lateral and elevational expert information.
- Presence-only species observations: Originates from the Global Biodiversity Information Facility (https://gbif.org) and accessed through Map of Life (https://mol.org) on 7 June 2017.
- TRI: https://www.earthenv.org/topography, (https://www.nature.com/articles/sdata201840)
- Expert range map-based validation requires expert range maps which are not included in this repository. They can be downloaded from Map of Life (https://mol.org).
Code/Software
This code and software have been used to create the analysis and results of the study. All analysis were run with R 4.3.1 (RStudio Team (2020)). Inference is conducted with INLA (version 23.05.30, https://www.r-inla.org/).
The R-workflow is run using the iSDM.R script, which calls specific R-scripts to associate species data with environmental variables, fit models with full data and cross-validation folds and validates the models.
- File 1 Name: Cross_validation.R
- File 1 Description: Run cross-validation with specific validation metrics for over all folds.
- File 2 Name: Cross_validation_range_map.R
- File 2 Description: Run range map-based validation.
- File 3 Name: Data_model_summaries_all_species.R
- File 3 Description: Compute summaries of data distribution and range predictions for all study species.
- File 4 Name: Data_model_summaries_cv_species.R
- File 4 Description: Compute summaries of data distribution and range predictions for cross-validated study species.
- File 5 Name: Figure1.R
- File 5 Description: Draw figure 1 of the original publication.
- File 6 Name: Figure2.R
- File 6 Description: Draw figure 2 of the original publication.
- File 7 Name: Figure3.R
- File 7 Description: Draw figure 3 of the original publication.
- File 8 Name: Get_data.R
- File 8 Description: A collective script, which calls specific data-scripts.
- File 9 Name: Get_domain.R
- File 9 Description: Create a study domain given the point records and expert range map.
- File 10 Name: Get_env_data.R
- File 10 Description: Read environmental rasters and create a list of data covering the whole species domain for making predictions.
- File 11 Name: Get_species_pa_training_data.R
- File 11 Description: Read species checklists and create a presence-absence data set.
- File 12 Name: Get_species_po_data.R
- File 12 Description: Read species point records.
- File 13 Name: Get_species_po_training_data.R
- File 13 Description: Associate point records and background points with environmental variables.
- File 14 Name: iSDM.R
- File 14 Description: Script for running the whole analysis pipeline.
- File 15 Name: occurrence.r
- File 15 Description: Spatial thinning script for the point records.
- File 16 Name: RunInference_cv_inla_non_spat.R
- File 16 Description: Fit non-spatial models for each cross-validation fold.
- File 17 Name: RunInference_cv_inla_spat.R
- File 17 Description: Fit spatial models for each cross-validation fold.
- File 18 Name: RunInference_cv_inla_spat_rsr.R
- File 18 Description: Fit spatial models with restricted spatial latent effect for each cross-validation fold.
- File 19 Name: RunInference_cv_inla_spat_samp.R
- File 19 Description: Fit spatial models with additional spatial latent effect for each cross-validation fold.
- File 20 Name: RunInference_inla_non_spat.R
- File 20 Description: Fit non-spatial models for whole data set.
- File 21 Name: RunInference_inla_spat.R
- File 21 Description: Fit spatial models for whole data set.
- File 22 Name: RunInference_inla_spat_rsr.R
- File 22 Description: Fit spatial models with restricted spatial latent effect for whole data set.
- File 23 Name: RunInference_inla_spat_samp.R
- File 23 Description: Fit spatial models with additional spatial latent effect for whole data set.
- File 24 Name: Set_data_for_inference.R
- File 24 Description: Add features in the data set (INLA mesh and cross-validation folds).
- File 25 Name: Validation_functions.R
- File 25 Description: Functions for computing different validation metrics.
- File 26 Name: Species_list.csv
- File 26 Description: List of all study species.
- File 27 Name: Species_list_cross_validation.csv
- File 27 Description: List of cross-validated species.
Methods
A detailed methodology associated with the environmental variables and species data can be found from the references used in the original publication.