Supporting data from: When birding hotspots get too hot: A geographic evaluation of wildfire-related disturbance on spatiotemporal biases in citizen science data
Data files
Jul 08, 2025 version files 98.43 MB
-
1_Locality_Selection_Data.csv
95.43 MB
-
1_Locality_Selection.R
113.66 KB
-
2_Probability_of_Return_Data.csv
1.88 MB
-
2_Probability_of_Return.R
29.25 KB
-
3_Long_Term_Probability_of_Return_Data.csv
964.76 KB
-
3_Long_Term_Probability_of_Return.R
4.79 KB
-
README.md
3.50 KB
Abstract
Long-term monitoring is critical for ecology and conservation, especially as non-stationary climatic conditions increase. Citizen science projects offer long-term georeferenced data from thousands of observers across diverse geographic areas. Despite the attraction of these datasets for biogeographical research and conservation planning, data collection commonly lacks standardized probabilistic sampling, which can increase observer bias, decrease precision of parameter estimates, and increase risk of spurious results when using the associated species data. Additionally, environmental disturbance may affect observer behavior, confounding the observed patterns in species responses. We aimed to test the effects of wildfire disturbance on observer biases in locality selection and return rates by citizen scientists registered with eBird, a globally available bird observation database. Location. Western USA. We used a long-term (10-yr) dataset of 47,662 localities from 1,788 eBird observers to calculate resource selection functions and explain observer locality selection as a function of wildfire and non-fire-related environmental covariates. We calculated spatiotemporally explicit covariates from the Monitoring Trends in Burn Severity program, and also developed generalized linear mixed models to predict the probability of observers returning to localities in response to fire and non-fire variables. Our results show that fire characteristics predicted locality selection and the probability of returning to a locality. Closer, more recent, larger, and more severe fires showed the greatest effects on spatiotemporal patterns of observer sampling bias. Other non-fire-related variables related to locality attractiveness, land use, convenience, and accessibility were also important. Our results demonstrate that landscape disturbance introduces spatiotemporal biases in citizen scientist locality selection and revisitation. Researchers using citizen science data can follow our modeling approach to quantify disturbance-related observer sampling biases and estimate bias-corrected parameters necessary for ecological studies. Without this, observer biases inherent in these data can lead to increased bias, decreased precision in parameter estimates, and spurious results. We propose recommendations to enhance the value of citizen science data for biological monitoring and conservation.
https://doi.org/10.5061/dryad.0vt4b8h5x
Description of the data and file structure
Observer data obtained from the eBird database and fire metrics from the MTBS program to identify and quantify disturbance-related sources of bias in citizen science data. 'NA' values in CSV files will be removed when appropriate using the attached R code.
Files and variables
File: 3_Long_Term_Probability_of_Return_Data.csv
Description:
Variables
- join_row: internal analysis
- tsfire: time since fire (years)
- Join_Count: internal analysis
- TARGET_FID: internal analysis
- LOCALIT: observation locality ID
- OBSERVA: observation date
- OBSERVE: observer ID
- YEAR: Year of observation
- DAY_OF_: julian calendar day
- Evnt_ID: observation eBird ID
- BrnBndA: burned acreage
- Ig_Date: fire ignition date
- fire_sz: fire size (sq km)
- fr_svrt: fire severity index (see MTBS)
- geometry: gis control field
- yfire: year of fire
- yobserv: year of first visit
- return: binomial, 1=return, 0=no return
- returny: year of return
File: 2_Probability_of_Return_Data.csv
Description:
Variables
- date: date of observation
- fire_severity: fire severity index (see MTBS)
- fire_size: fire size (sq km)
- loc_ID: locality ID
- obs_ID: observer ID
- return: binomial, 1=return, 0=no return
- state: political state boundaries
- year: year of observation
- pop_den: population density (habs/sq km)
- distance_to_road: distance to road (m)
- land_cov: land cover type, see tables in paper
- longitude: longitude WGS84
- latitude: latitude WGS84
- spp: specis richness in locality
File: 1_Locality_Selection_Data.csv
Description:
Variables
- locality.id: locality id
- observation.date: observation date
- observer.id: observer ID
- number.observers: number of observers
- all.species.reported: logical, wheter all species were reported
- year: year of observation
- day_of_year: julian day of observation
- use: binomial, 1=locality used, 0=locality available but not used
- obs.id: Observer ID
- lon: longitude WSG84
- lat: latitude WSG48
- Event_ID: fire event ID
- BurnBndAc: burned acreage
- Ig_Date: fire ignition date
- fire_size: fire size (sq km)
- fire_severity: fire severity index (see MTBS)
- nearest: nearest fire ID
- distance_to_fire: distance to fire (m)
- pop_den: population density (hab/sq km)
- id_road: id of closest road
- distance_to_road: distance to road (m)
- land_cover.ID: land cover type, see tables in paper
- land_cover.land_cover: reclass of land cover type see rcode
- sp_rich: species richness
Code/software
Three R scripts are provided with this submission:
1_Locality_Selection.R
2_Probability_of_Return.R
3_Long_Term_Probability_of_Return.R
In conjunction with CSV files provided it ispossible to replicate analysis presented in the paper. A list of used packages is available in scripts provided. eBird raw data is not provided but can be accessed through: https://ebird.org/download
R version 4.2.3 (2023-03-15 ucrt)
Access information
Data was derived from the following sources:
- https://ebird.org/download
- https://www.mtbs.gov/direct-download
- See paper citations
Study area
This study focuses on Western United States (USA), which has experienced increased wildfire frequency, size, and severity in recent decades (Figure 2, Weber and Yadav 2020), with these increases predicted to continue due to climate change (Abatzoglou et al. 2021, Wasserman and Mueller 2023, McGinnis et al. 2023). This region supports the widest elevation range in the continental USA (86m BSL to 4,418m ASL), containing large tracks of public land and a high diversity of biomes, including forested mountains, coastal environments, arid and semi-arid plateaus and plains, high-elevation subalpine and alpine areas, and rainforests. In addition to supporting high avian species diversity (> 600 bird species), thousands of bird observers report bird sightings across this region each year.
Citizen-based wildlife observation data
We used data freely available from the eBird repository of citizen-based bird observations as a case study (Sullivan et al. 2009). This data source currently contains over 1.6 billion bird records worldwide from more than 930,000 observers who have contributed to more than 900 scientific publications (ebird.org). To avoid potential confounding effects of the 2020-2021 COVID-19 pandemic-related lockdown, we used data reported by eBird observers from 2010 to 2019. We used only bird observations recorded during May each year to avoid potential effects of seasonal changes in survey effort and capture bird Spring migration, when birding activity peaks. We included checklists (species lists submitted) regardless of species because we were interested in explaining decisions made by observers rather than the birds reported. We focused on environments within the continental US and removed observations occurring in the Pacific Ocean because we did not expect fire to impact marine environments. The filtered eBird observation database contained observer ID, location, date, and time of each observation event.
Fire and environmental data
We calculated spatiotemporally explicit wildfire covariates, including burn status (burned/unburned), distance to closest fire (km), time since fire (years), fire severity index, and fire size (ha) across our study area using data from the Monitoring Trends in Burn Severity (MTBS) program (Eidenshink et al. 2007, Table 1). To account for other potential sources of variability in eBird data, we classified 8 land cover types (Friedl and Sulla-Menashe 2019), human population density (inhabitants/km2, CIESIN - Columbia University 2017), and distance from paved roads (km, U.S. Geological Survey 2014). We extracted these covariates for each eBird observation locality using the `sf ´, `terra´, and `exactextract´ packages in R (v.4.2.3; Pebesma 2018, Baston 2022, Hijmans 2022, R Core Team 2022).
Data analysis
To avoid problems with multicollinearity among independent covariates in our linear modeling framework, we calculated Pearson’s correlation coefficients among the various predictors (R Core Team 2022) and found no evidence of significant correlations (all r < 0.5). Additionally, visual inspections of boxplots and scatterplots revealed no evidence of potential outliers in our predictors and response variables.
Probability of selecting a locality. Locality selection by citizen science observers is a complex process involving multiple factors, including locality attractiveness (Romo et al. 2006, Mancini et al. 2019), land cover type (Tiago et al. 2017, Mandeville et al. 2022), accessibility (Tiago et al. 2017, Zhang 2020), opportunity, and safety (Mancini et al. 2019, Zhang 2020), among other socio-economic factors. To evaluate post-fire effects on locality selection by observers, we calculated third-order resource selection functions (RSFs) following a use-availability design (Johnson et al. 2006). We used records located only in public lands to ensure that all modeled areas were available to observers. To avoid temporal and spatial autocorrelation of observations that can bias RSFs (Alston et al. 2023), we randomly selected a single locality (i.e., point coordinates where an observation was recorded in eBird) per day per observer. We then used eBird data from observers with ≥ 10 distinct observation localities within a given year to remove effects from incidental observers. We also removed localities containing only incidental protocols because those generally do not represent active locality selection by observers (e.g., locality not selected for wildlife viewing, observations recorded during other activities).
For each year, we used the coordinates from localities reported by each observer to represent ‘used’ localities and generated random locations to represent available ‘unused’ localities. We generated the random unused localities on public land within the median distance traveled by an observer between consecutively-dated localities, representing the average distance an observer traveled between used localities and reducing the effects of long-distance travel unrelated to birdwatching. We generated ten random locations for each used locality to characterize available environmental conditions while maintaining an adequate class balance for calculating our binomial regression-based resource selection functions (Northrup et al. 2013, Salas-Eljatib et al. 2018).
At each used and random locality, we measured the distance from, size, and average severity of the nearest wildfire that occurred the previous fire season using data from the MTBS database. We only considered wildfires from the previous season because we expected fire effects to be stronger the following year after a fire event than in subsequent years. To account for other potential covariates that we postulated could influence locality selection, we measured human population density, landcover type (categorical variable with eight classes: forest = A, shrubland = B, grassland = C, wetland = D, cropland = E, urban = F, barren = G, and waterbody = H), distance from the nearest paved road, and bird species richness. We calculated expected bird species richness annually in each locality from a ~15x15 km raster of eBird observations in May of each year from 2010-2019. We found no evidence of a latitudinal pattern (R² = 0.003, p < 0.01) or spatial autocorrelation (Moran’s I = 0.0225, p < 0.01) in our species richness variable at this resolution. This supports our objective of representing the expected species richness that an observer anticipates when selecting a locality, which reflects a combination of the biological processes (true species richness) and sampling effort (number of checklists submitted to eBird for such locality). Localities with missing covariates were excluded from this analysis, resulting in 47,662 used and 433,864 available localities from 1,788 observers distributed across the entire study area, covering a wide range of wildfire and environmental conditions.
Following the information-theoretic approach (Burnham and Anderson 2002), we developed a series of ecologically grounded hypotheses in conjunction with our set of covariates to construct an a-priori set of candidate generalized linear mixed-effects models (n = 432) with a binomial error distribution and logit link function, where each model represented a competing hypothesis explaining locality selection by observers. Our candidate set of models represented a small proportion of the thousands possible. This and our large sample sizes minimized problems with model selection uncertainty and overparameterized models (Anderson and Burnham 2002). We defined observer ID as a random intercept to account for individual locality preferences (i.e., repeated visits by individuals Gillies et al. 2006). Our saturated model included singular, additive, and interaction effects of distance to fire, fire size, and fire severity. Distance to fire was modeled as a linear, quadratic, or log function to test for different curvilinear observer responses to distance from fire. We reasoned that the probability of selecting a locality reached a threshold or asymptote at some distance to fire, whereas the quadratic function represented a possible peak in probability followed by a subsequent decline at increasing distance from fire due to a reduction in some desired condition (e.g., too remote). Other variables in the saturated model were bird species richness, land cover type, population density, and distance to road as a quadratic term (i.e., probability of selecting a locality was lowest at distances closest to and farthest from roads). Statistical analyses were conducted using the `lme4´ package in R (Bates et al. 2015, R Core Team 2022).
We used the receiver operating characteristic curve (ROC) and its area under the curve (AUC) to evaluate the capability of our full model to distinguish between used and available locations using the package `pROC´ in R (Robin et al. 2011). We used the Bayesian Information Criterion (BIC) for model selection (Schwarz, 1978), which outperforms other information criteria when using large samples (Kass and Raftery 1995, Aho et al. 2014). We considered candidate models with ΔBIC ≤ 2 equally supported by the data and used BIC weight (wi) to measure the weight of evidence or support for each model (Burnham and Anderson 2002).
Short-term probability of returning to a locality. To analyze short-term changes in the probability of observers returning to the same locality, given the occurrence of a wildfire the previous year, we applied a before-after fire study design using eBird data from observers with ≥ 20 observations per year from 2010 to 2019. We queried for observers only in localities on public lands to ensure adequate sample sizes and reduce the influence of inconsistent property access and incidental observers. We selected localities within fire boundaries (burned localities, n = 8,095) that occurred one year before a fire event and randomly selected a similar number of localities outside fire boundaries (unburned localities, n = 10,309) as a control group. Using observer identities in each locality, we determined a 'return’ event when a given observer revisited the same locality the following year and a ‘no-return’ event when an observer did not revisit the same locality the next year. This produced a sample of 18,403 records from 4,886 unique observers for this analysis.
We measured the burn status (burned/unburned), fire size, and fire severity at each locality. We also measured additional covariates that we expected might influence the probability of a return; these were human population density, distance from roads, land cover type, and expected bird species richness. Given that little is known about how fire can affect observer behavior, we hypothesize that fire can have an absolute (burned/unburned) and/or gradual effect (fire size) on return probability. We fit a candidate set of generalized linear models (n = 127) with a binomial error structure and logit link function to the return (1) and no return (0) data, where each model predicted the probability of returning to a locality in response to a unique set of covariates. The saturated model included burn status, fire severity, fire size, bird species richness, land cover type, population density, and distance to road. We used BIC for model selection following the approach described above.
Long-term probability of returning to a locality. Lastly, we examined the longer-term (multi-year) effects of time since wildfire on the probability of returning to a locality within a fire boundary over subsequent years, with our inference here focused on burned localities. To avoid small sample bias associated with incidental observers during this sampling period, we considered only observers with ≥ 10 species lists during each year from 2010 to 2019. We identified locations where observers reported species lists within the fire boundary before and after a fire event, and selected only locations surveyed by the same observer one year before and ≤8 years after a fire. This generated a sample of 5,163 records from 916 observers and 732 burned localities for this analysis. Here, we recorded a return (1) or no return (0) each year after a fire event for each observer. Additionally, we calculated time since fire, fire size, and fire severity index for each return and no return event. We did not include other environmental variables that we assumed remained constant for observers who visited a locality before and after a fire (e.g., distance from road). We fit a candidate set of generalized linear models (n = 9) with a binomial error structure and log link function to explain revisitation probability over time in response to time since fire, fire severity, their interaction, and/or fire size, and used BIC for model selection. We included time since fire as a quadratic term to account for predicted initial disinterest of burned sites by birders soon after a fire and the eventual attraction to those sites after some time lapsed and vegetation/birding conditions improved.