Spatial-biases are a common feature of presence-absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non-detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modeling technique. To explore the consequences of spatial bias and class imbalance in presence-absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing, and majority-only thinning (i.e., retaining all samples of the minority class). We created SDMs using two parametric or semi-parametric techniques (generalized linear models and generalized additive models) and two machine-learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision-recall curve) and calibration (Brier score; Cohen’s kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modeling technique, performance metric, and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence < 0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes, but hindered model calibration. Baseline sample prevalence, sample size, modeling approach, and the intended application of SDM output – whether discrimination or calibration – should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis-à-vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

This test dataset includes presence/absence data for 102 bird species based on point-counts conducted throughout Connecticut and Rhode Island from three different studies. One study covered interior forests habitats primarily in eastern Connecticut, where at each of 32 forested sites, seven point-count stations were spaced a minimum of 160 m apart. A total of 358 surveys were conducted in 2018 at 220 locations, with 87 locations surveyed once, 128 surveyed twice, and five surveyed three times. In another study, point-count stations were randomly stratified by habitat throughout Connecticut with a maximum of 10 point-counts spaced at least 500 m apart within 200 randomly-selected 5 x 5 km grid cells. A total of 932 surveys were conducted in 2018 at 690 locations with 450 locations surveyed once, 238 surveyed twice, and two surveyed three times. In the remaining study, point-counts were conducted throughout Rhode Island at 3,664 randomly selected locations spaced a minimum of 250 m apart. Each location was surveyed once in 2015. Point-counts were conducted between May 19^th – July 5^th to capture the peak breeding period for most species in the dataset. This dataset is presented with associated observation and environmental covariates. See the manuscript and Table S2 for additional details.

Environmental covariate sources: National Landcover Database (NLCD), the National Wetlands Inventory (NWI), NOAA's Coastal Change Analysis Program 2010 Forest Fragmentation Data (C-CAP), and the USGS National Elevation Dataset (NED). Environmental covariates are summarized as proportional coverage (NLCD, NWI, and C-CAP) or mean values (NED) based on a 250-m radius circle surrounding the point. See ColumnHeaderTable txt file for column descriptions.

Avian point-counts from Rhode Island and Connecticut used to test species distribution models

Data files

Abstract

Avian point-counts from Rhode Island and Connecticut used to test species distribution models

Data files

Abstract

Methods

Usage notes

Works referencing this dataset