Data for: Wild bees and landcover: bee species’ body size does not predict the scale of effect, but bee phenology predicts association with landcover type

Simpson, Dylan 1 2 3 ; Smith, Colleen3 4; Winfree, Rachael3

Published Jul 07, 2025 on Dryad. https://doi.org/10.5061/dryad.ngf1vhj55

Data files

Jul 07, 2025 version files 200.69 MB

Data_and_Code.zip

200.68 MB
README.md

10.02 KB

Abstract

Habitat is a key aspect of any species’ niche and can affect populations at multiple spatial scales. Basic ecology and effective conservation thus require understanding which habitats matter and at which scales. Yet, habitat studies are rarely scale-optimized and what determines the scale(s) at which populations are affected by surrounding habitat (the “scale of effect”) is poorly understood. In this study, we test the “mobility hypothesis,” which predicts that species with larger foraging ranges should have larger scales of effect. The mobility hypothesis is the most popular explanation of what determines species’ scales of effect but empirical support is mixed. We test the mobility hypothesis using wild bee species and, in doing so, also assess landscape-scale habitat associations of 84 bee species. We collected 30,376 specimens of 84 bee species from 165 sites in the northeastern USA and used linear models to determine landcover associations and scales of effect for each species. To test the mobility hypothesis, we asked whether scales of the effect varied with two mobility-related traits - body size or sociality, which are the strongest known predictors of bee foraging ranges. Controlling the false discovery rate at 5%, we found 193 significant species-landcover associations across 60 (of 84) species. Scales of effect ranged from 100 to 8000 m (mode = 200 m; median = 1000 m) and – counter to the mobility hypothesis – were not associated with body size or sociality. As a result, we argue that ecologists should reconsider making assumptions about species’ scales of effect and should instead explicitly measure scales of effect for their particular study organism and system. Considering the landcover associations themselves, we found these were broadly explained by phenology, with spring-flying bees being associated with forests and summer-flying bees being associated with more open, non-forested habitats.

https://doi.org/10.5061/dryad.ngf1vhj55

Description of the data and file structure

Data and analysis for Habitat associations of wild bees: on the importance of phenology and the idiosyncrasy of scale by Dylan T Simpson, Colleen Smith, and Rachael Winfree.

The data needed to run the analysis are in Data_and_Code.zip: ~/Data/In, and the scripts are in ~/Code/. The file directories used in the scripts assume your working folder is this parent folder (i.e., the location of this readme).

For convenience, we have also included the intermediate products and outputs so one can start anywhere without having to run all the analyses (which are rather time and memory-intensive).

Details on the datasets are included in the supplement, but in short:

Datasets are bee abundance data from surveys conducted by pan traps and, in some cases, vane traps.
Data are compiled from 5 studies. These are described in the supplement, and most have been published elsewhere. Use of these data should cite the primary citations where possible (citations available in the supplement and main text).
Collectively, data were collected at 165 sites.
We focus on 84 species that were represented by ≥ 30 individuals at ≥ 10 sites. These 84 species were represented by ~30 K specimens.
Additionally, the data folder includes species trait data. Specifically, body size, which was measured on specimens in the Winfree Lab; phenology, which was estimated from data collected by the Winfree Lab and from the American Museum of Natural History, which was curated by Ignasi Bartomeus (Bartomeus et al 2013, PNAS); and sociality, which is based on natural history information from a variety of sources, e.g., Michener, Discover Life, and personal communications with natural historians like John Ascher and Jason Gibbs

Details on the data tables themselves and the purpose of each R script are found in the two metadata files.

Files and folders

The .zip file included is meant to act as a working directory to recreate the analyses from the paper.

The following describes the contents of the files and folders. Additionally, there is another metadata file included in the folder's main directory, "metadata_FilesAndFolders.rtf", that includes descriptions of individual columns from each data table.

-Dylan Simpson, December 2024

Code: folder containing R scripts to compile data and run the analysis

0.1_ModelData.R: R script to combine survey data with species info (phenology, body size, sociality info) and landcover data, and then break apart data into the form used as model inputs
1.1_GeneratingNullExpectations.R: R script to run a null model version of the main analysis, in which the abundance of 12 example species is repeatedly modeled as a function of randomly generated data that have the same structure as our habitat data. The performance of these “null models” is used to define a null expectation of model performance given the scale-optimization procedure and the cross-scale correlation structure of our landcover data.
1.2_DefiningNullDistribution.R: An R script that processes the output of the previous script and saves as an output a table in which each row is an iteration of the randomization-based null model. The column of log-likelihood ratios is the numerical estimation of the null distribution.
2.1_ScaleSelectionAnalysis.R: This script runs the analysis. Within a set of nested map( ) functions, this script determines the appropriate null model for each species (i.e., which covariates should be included and in what form), models the abundance of each species as a function of each habitat at each scale, and measures the performance of every model against the null (i.e., covariate only) model.
2.2_SelectingSignificantModels.R: A processing script that determines which habitat associations were significant. The script pulls the scale-optimized models from the previous step, calculates p-values by comparing to the simulated null distribution, determines the critical p-value that maintains a false discovery rate of 5%, and then filters the results to only those models with a p ≤ the critical p.
3_Results.R: A script that summarizes and visualizes the results from the significant models. All figures are generated within this script, and all additional statistical tests (e.g., tests of the mobility hypothesis) are also done in this script.
Functions: a folder of custom R functions that are called by earlier scripts
- FDR_pCalculator.R: A function to determine the critical p-value to maintain a pre-determined false discovery rate. The algorithm to identify the critical p-value is taken from Benjamini et al 2006 (Biometrika), as summarized by Pike 2011 (Methods in Ecology & Evolution)
- Plotting_functions.R: functions to make two plots that get called by the Results script. The first is the ‘scale selection bar plot,’ which plots the AIC weight of habitat associations at each scale. The scale of the model with the highest weight is taken as the scale of effect for that species-habitat association. These are the plots in panel 3 of Figure 1 and Figures S4 and S9. The other is a function to make the tile plot of species-habitat associations. This is figure 2 in the main text, though the final figure was modified in Adobe Illustrator to make it more readable.

Data / In: Folder with all input data. If run from the beginning, all R scripts and analysis should work with only the contents of this folder.

Bee_bodySize.csv: table containing intertegular distance (ITD) measurements for each species. Although it is worth noting, this only contains info for 75 of our 84 species. Some species had been measured for a previous project by the lab group. Of the remaining species, we only measured those that had significant habitat associations and were thus included in the mobility hypothesis test.

Bee_phenology.csv: table containing phenology data for each species.
Bee_sociality.csv: table identifying the social strategy of species
Habitat_proportions.csv: table with the proportion of each focal habitat/landcover around each, measured at each of 30 spatial scales (or, technically, spatial extents), and measured using each NLCD release between 2001 and 2021.

Data / Intermediate: folder to contain intermediate data products. If running the R scripts from the beginning, this will become populated with the following files. For convenience, I include here copies of these products produced by my runs of the code, so that a user can pick up anywhere they like in the workflow.

Model_Data: a folder containing .rds files of model input data for each species. Each .rds file is a list of survey data; each element is the same survey data paired with habitat/landcover data measured at a different spatial scale.

Null_LLR_distribution.csv: a table in which each row is the result of one iteration of the randomization-based null analysis (described in the supplement and run by code/1.1_GeneratingNullExpectations.R and code/1.2_DefiningNullDistribution.R.
Species_summary.csv: A table containing sample size and observed site occupancy for each species. Occupancy here is the number of sites at which a species was observed. The total possible is 165. Important note, however: Some sites were only measured in one season or another (e.g., spring but not summer, or vice versa). As a result, not every species was available to be sampled at every site. A robust estimate of occupancy would first filter to sites that were sampled during that species’ flight window (as we did in the main analysis for our models of abundance).
Species_to_test_null.csv: a subset of the previous table, containing the species that were used to generate the null distribution of LLR values. Ten of these were randomly selected. We additionally added Augochlora pura and Ceratina calcerata.

Data / Out: folder containing analytical outputs. As above, this would become populated by running the R scripts in Code/, but I included my copies here for convenience.

Model_fits: a folder containing individual .csv files with results for each species. Each of these .csvs contains summary info for the model of bee abundance as a function of each habitat at each scale.
nullResults.rds: an R list containing the results of the null analysis for each species. Each element is a dataframe for one species, in which each row is an iteration of the null analysis for that species. This is a post-processed file; for each iteration, the best model has been pulled out and the performance of that model has been measured.
nullRuns: a folder containing the raw outputs from the null analysis for each species. Each .rds file contains a list of 1000 elements. Each element is the scale selection table for that run of the null.
scaleSelectionTable.csv: a table with the raw output of the main analysis. It is the concatenated scale selection tables for every species and habitat. For each species and habitat, the table contains the model results for how that habitat, at each scale, predicts the abundance of that species.
significantModels.csv: A subset of the table above that contains only the final, significant models. This was generated by first selecting the best model for each species-habitat combination (each model represents a scale, so this is the scale optimization step), then determining the significance of that scale-optimized habitat model.

Figures: an empty folder to contain figures created by the Results script.

Access information

Much of this data was collected for studies that have been previously published (see Related Works and the Data Description in this readme), but the data are not currently available elsewhere in this form.

This analysis in this paper used a dataset amalgamated from five previous studies, each of which collected bees by pan or vane trap. Following is a brief summary of each of these studies. Data from three of these studies were previously published, and the publications are noted below and in the "Related Works" section, while data from the other two is being published here for the first time. Any future analyses using these data should cite the original articles.

Dataset 1: Pinelands (Winfree et al. 2007)

This study was designed to ask about the effects of human land use on bee communities. The study region was the Pine Barrens of southern New Jersey. There were 44 sites placed along a human land use gradient. In this region, the natural land cover is predominantly forested ericaceous heath, and human land cover is predominantly agriculture (blueberry, cranberry) and suburbs.

The 44 sites were visited 2-5 times each in 2003, for a total of 167 site visits. At each site visit, 44 pan traps (plastic bowls painted with white or fluorescent blue or yellow) were placed along a 110-m transect and left for 8 hours between 07:00 and 17:00. Data were only collected on sunny or partly sunny days.

Of our 84 focal species, 59 were detected in this study, represented by 1470 specimens.

Dataset 2: Biotic homogenization (Harrison et al. 2018a, b)

This study was designed to assess the role of human land use in biotic homogenization across space and ecoregion. The study region was New Jersey, New York, and Pennsylvania. There were 36 sites in a nested block design, with three blocks of three sites each nested within four ecoregions. Each block had one site embedded in each of three dominant landcover types - agriculture, (sub)urban, and forest – while local habitat was standardized as mown grass.

Sites were visited 1-6 times per year, from spring to autumn, in the years 2013-2015. Across sites and years, there were a total of 377 site visits. At each site visit, 36 pan traps and two vane traps were set out and left for 24 hours.

Of our 84 focal species, 83 were detected in this study, represented by 11923 specimens.

Dataset 3: SWG (previously unpublished)

This study was designed to examine differences in bee abundance, diversity, and community composition among different habitat types in New Jersey. There were 37 sites visited a total of 79 times between March and September of 2016. Sites were haphazardly located across the state, with representation of many major landcover and habitat types, including different forest types, crops, sub/urbanization, and near wetlands. Each site was visited at least twice, once during the spring and once during the summer. At each site visit, 35 white, blue, and yellow pan traps were placed along a 50-m transect for 5-7 hours between 08:00 and 15:00.

Of our 84 focal species, 77 were detected in this study, represented by 4154 specimens.

Dataset 4: Forests 1 (Smith et al. 2021)

This study was designed to assess the effects of forest age, area, and fragmentation on bee communities. The study region was the Piedmont ecoregion of New Jersey. There were 32 sites, all embedded within forests but with varying landscape contexts.

Sites were visited 2-4 times in 2017 and 2018, except 5 sites that were only visited in 2017. At each site visit, 39 pan traps were placed in a 40 m x 100 m grid. In 2018, four vane traps were also placed. Traps were left for ca. 8 hours between 05:30 and 20:00.

Of our 84 focal species, 64 were detected in this study, represented by 10550 specimens.

Dataset 5: Forests 2 (Winfree et al 2014)

This study was designed to assess the effects of forest vs. non-forest “matrix” habitat on bee communities. The study region was the Piedmont region of New Jersey. There were 16 forest sites embedded within forest fragments of differing size, or within sub/urban or agricultural matrix.

Sites were visited 4 times each in April and May of 2006. On each site visit, an array of 39 white, yellow, and blue pan traps were placed and left for 4 hours.

Of our 84 focal species, 55 were detected in this study, represented by 2279 specimens.