Data from: Model-based data integration improves species distribution models for data deficient and narrow-ranged hummingbird species

Mäkinen, Jussi 1 ; Cohen, Jeremy2 ; Jetz, Walter 2

Published Jan 27, 2026 on Dryad. https://doi.org/10.5061/dryad.q83bk3jrs

Data files

Jan 27, 2026 version files 1.45 GB

Data.7z

1.45 GB
README.md

8.07 KB

Abstract

For species with narrow ranges or low population sizes, a deficiency of species occurrence records can limit the capacity to build accurate species distribution models (SDMs). Model-based integration of data from multiple sources has been offered as a solution to improve predictions of species’ distributions at large scales, especially for data-deficient species, but clear empirical demonstrations for this are lacking. The study location was South and Central America. We applied a state-of-the-art data integration technique to model the distributions of 98104 hummingbird species. We fitted SDMs using either presence-absence (PA) data from eBird or presence-only (PO) data from eBird and the Global Biodiversity Information Facility (GBIF) and compared them to integrated SDMs, which utilize both PA and PO data. We fitted generalized linear mixed-effects models and validated them with spatial block cross-validation and expert range map adjusted validation. We also conducted an experiment using artificially thinned datasets of 47 abundant enough species to assess model performance under different levels of data deficiency. Data integration improved model performance compared to PA models for species for which PA data covered poorly the environmental conditions in the study area. Thinning experiment showed that even a small amount of PO data in data integration improved the predictive accuracy in comparison to PA models which was not clear in the cross-validation results with the full data. In comparison to PO models, data integration improved models over all species, but especially for data rich species with large geographical ranges. Overall, data integration enables a more comprehensive capture of available species information and can improve range predictions in comparison to conventional modeling methods.

This data set is a collection of publicly available species and environmental data. They have been used to study populations and distributions of species and their associations with the environment.

Description of the data set (Data.7z)

File 1 Name: Model_objects/INLA_spat_Heliomaster furcifer_fits_thin_not_thin_PA_quad_n_20000_wa_not_offsets_inla.RData
- File 1 Description: Rdata-file contains INLA model objects for three species distribution models, which are fitted with presence-absence data,
  presence-only data, or with their combination (PA+PO model). The models are for the Gilded hummingbird (Heliomaster furcifer):
  - model_fits: model objects of the fitted INLA models
  - stk_coll: data stack objects used for fitting the models
File 2 Name: Prediction_objects/Heliomaster furcifer.csv
- File 2 Description: Csv-file contains spatially explicit model predictions of Blue-tufted starthroat (Heliomaster furcifer). Table columns:
  - lon: longitude
  - lat: latitude
  - full_effect: prediction with covariate and spatial random effects
  - cov_effect: prediction with covariate effects
  - spt_effect: prediction with spatial random effects
  - samp_effect: prediction with survey effort of PO data related spatial random effects. NAs in this column denote that the effect was not estimated for the specific model.
  - model: model (PA, PO or integrated model)

Sharing/Access information

Species and environmental data can be derived from the following sources:

Cloud cover: https://www.earthenv.org/cloud, (https://doi.org/10.1371/journal.pbio.1002415) (CC-BY 4.0)
Elevation: https://www.earthenv.org/topography, (https://www.nature.com/articles/sdata201840); Amatulli, Giuseppe; Domisch, Sami; Tuanmu, Mao-Ning; Parmentier, Benoit; Ranipeta, Ajay; Malczyk, Jeremy; Jetz, Walter (2018): A suite of global, cross-scale topographic variables for environmental and biodiversity modeling, links to files in GeoTIFF format [dataset publication series]. PANGAEA, https://doi.org/10.1594/PANGAEA.867115 (CC-BY 3.0)
Chelsa Bioclim: https://doi.org/10.5061/dryad.kd1d4, (https://www.nature.com/articles/sdata2017122) (CC0 1.0)
EVI:https://lpdaac.usgs.gov/products/mod13a3v006/, (https://modis-land.gsfc.nasa.gov/pdf/MOD13_User_Guide_V61.pdf) (CC0)
Presence-absence species observations originate from eBird (https://ebird.org/home). The data can be downloaded using their data platform.
Presence-only species observations: Originates from the Global Biodiversity Information Facility (https://gbif.org); (CC0, CC BY, CC BY-NC)
TRI: https://www.earthenv.org/topography, (https://www.nature.com/articles/sdata201840); Amatulli, Giuseppe; Domisch, Sami; Tuanmu, Mao-Ning; Parmentier, Benoit; Ranipeta, Ajay; Malczyk, Jeremy; Jetz, Walter (2018): A suite of global, cross-scale topographic variables for environmental and biodiversity modeling, links to files in GeoTIFF format [dataset publication series]. PANGAEA, https://doi.org/10.1594/PANGAEA.867115 (CC-BY 3.0)
Expert range map-based validation requires expert range maps which are not included in this repository. They can be downloaded from Map of Life (https://mol.org)

Code/Software (R.7z)

This code and software have been used to create the analysis and results of the study. All analysis were run with R 4.3.1 (RStudio Team (2020)). Inference is conducted with INLA (version 23.05.30, https://www.r-inla.org/).

The R-workflow is run using the integrated_ebird.R script, which calls specific R-scripts to associate species data with environmental variables, fit models with full data and cross-validation folds and validates the models.

File 1 Name: Create_species_data.R
- File 1 Description: Create species-level data tables from the data table which covers the whole species pool.
  RUN THIS CODE BEFORE THE ACTUAL ANALYSIS
File 2 Name: Get_data.R
- File 2 Description: A collective script, which calls specific data-scripts.
File 3 Name: Get_domain.R
- File 3 Description: Create a study domain given the point records and expert range map.
File 4 Name: Get_env_data.R
- File 4 Description: Read environmental rasters and create a list of data covering the whole species domain for making predictions.
File 5 Name: Get_species_all_gbif_data.R
- File 5 Description: Download species observations from GBIF.
File 6 Name: Get_species_all_gbif_training_data.R
- File 6 Description: Associate GBIF point records and background points with environmental variables.
File 7 Name: Get_species_gbif_data.R
- File 7 Description: Download species observations from GBIF which do not originate from eBird.
File 8 Name: Get_species_pa_training_data.R
- File 8 Description: Read species checklists and associate presence-absence data set with environmental variables.
File 9 Name: Get_species_po_data.R
- File 9 Description: Read species eBird incomplete checklists.
File 10 Name: Get_species_po_training_data.R
- File 10 Description: Combine eBird incomplete checklists and GBIF no-eBird originating observations and associate them and background points with environmental variables.
File 11 Name: Get_species_range_data.R
- File 11 Description: Read in species range map, coarsen to 50 kmx50 km resolution, turn cell center points into prence-absence points and associate them with environmental variables.
File 12 Name: Integrated_ebird_rarity_experiment.R
- File 12 Description: Script for running the whole analysis pipeline for thinning experiment.
File 13 Name: Integrated_ebird.R
- File 13 Description: Script for running the whole analysis pipeline.
File 14 Name: Model_outputs.R
- File 14 Description: Compute summaries of data distribution and range predictions for all study species.
File 15 Name: RunInference_cv_inla_spat_extrap.R
- File 15 Description: Fit spatial models for each cross-validation fold.
File 16 Name: RunInference_cv_inla_spat_rarity_exp_extrap.R
- File 16 Description: Fit spatial models for each cross-validation fold in thinning experiment.
File 17 Name: RunInference_inla_spat.R
- File 17 Description: Fit spatial models for the whole data set.
File 18 Name: RunInference_inla_spat_rarity_exp.R
- File 18 Description: Fit spatial models for the whole data set in thinning experiment.
File 19 Name: Set_data_for_inference.R
- File 19 Description: Add features in the data set (INLA mesh and cross-validation folds).
File 20 Name: Species_data_from_ebird_folder.R
- File 20 Description: Function for setting eBird data into species-level tables.
File 21 Name: Species_check.R
- File 21 Description: A function for checking that models have converged and species data fulfill the data requirements.
File 22 Name: Validation_functions.R
- File 22 Description: Functions for computing different validation metrics.
File 23 Name: Validation_cv_full_HB.R
- File 23 Description: Compute validation metrics for each cross-validation fold.
File 24 Name: Validation_cv_full_HB_rarity_exp.R
*File 24 Description: Compute validation metrics for each cross-validation fold in thinning experiment.
File 25 Name: Vignettes/Vignette.Rmd
- File 25 Description: Rmarkdown-script for running models for an example species (Gilded hummingbird).
Folder 1 Name: Manuscript
- Folder 1 Description: R scripts for creating the figures and tables of the original manuscript and supporting information.