Data from: Warming temperatures to expand spruce beetle outbreak distribution in northern but not southern Picea forests of western North America
Data files
Jun 02, 2026 version files 5.29 GB
-
README.md
4.98 KB
-
sb_ecography.Rmd
115.82 KB
-
spruce.data.csv
3.15 GB
-
spruce.grid.gpkg
2.13 GB
Abstract
Spruce beetle outbreaks in western North American forests are strongly influenced by climate and host availability, making it critical to understand the biogeographic drivers shaping their distribution for accurate forecasting under changing environmental conditions. This study examines the subcontinental drivers of spruce beetle outbreak patterns, compares regional differences in constraints on outbreak occurrence, and develops geospatial hindcasts and forecasts to assess changes in outbreak distribution over time. Using aerial survey data of spruce beetle–caused tree mortality from 1962 to 2023, along with forest inventory and downscaled climate data, ensemble machine learning models were constructed to analyze historical patterns and predict both past and future distributions. The results indicate that minimum winter temperature, snow-derived precipitation, and mean growing season temperatures are the primary factors explaining outbreak distribution across western North America, although the limiting factors vary regionally. In southern regions, host tree availability restricts outbreaks, while in western coastal areas, high moisture and mixed-species forests limit their spread. In northern regions, cold minimum winter temperatures serve as the primary constraint. Model hindcasts and forecasts reveal that spruce beetle outbreaks have expanded across all regions since 1900 and are projected to continue shifting poleward as warming trends persist. These findings highlight the importance of climate variability in shaping forest insect dynamics and suggest that continued warming will facilitate the expansion of spruce beetle outbreaks into previously unaffected areas. Such large-scale irruptions in forests with historically low activity may significantly alter spruce-dominated ecosystems, with potentially profound but uncertain consequences for forest structure, ecosystem processes, and long-term disturbance regimes.
Dataset DOI: 10.5061/dryad.jm63xsjpk
Description of the data and file structure
Code and datasets necessary for creating model results. Please contact me (Mike Howe: mike.howe@usu.edu) for any ancillary data (i.e., spatial boundary data, etc).
Files and variables
File: sb_ecography.Rmd
Description: Analysis and figure script. This R Markdown script loads ecological/spatial data, defines custom sampling and interaction-analysis functions, then preprocesses spruce suitability data to train and tune machine-learning classifiers (like XGBoost and Random Forest)
File: spruce.data.csv
Description: Primary data file.
Variables
- i5: spatial index
- log.spruce: spruce basal area log([m2/ha]+1)
- name: state/province
- region: study region
- period: 20-year time period
- scenario: historical or climate emissions scenario
- sb: spruce beetle presence/absence.
- CMD: climate moisture deficit (mm; climateNA)
- DD5: growing degree days (degree-days; climateNA)
- PAS: precip. as snow (cm; climateNA)
- Tave_sp: mean spring temp. (˚C; climateNA)
- Tave_sm: mean summer temp. (˚C; climateNA)
- gs.tave: mean spring and summer temp. (˚C; climateNA)
- ow.tave: mean autumn and winter temp. (˚C; climateNA)
- gs.ppt: mean spring and summer precip. (cm; climateNA)
- ow.ppt: mean autumn and winter precip. (cm; climateNA)
- NFFD: number of frost-free days (days; climateNA)
- Tmin_wt: min. winter temp. (˚C; climateNA)
- spr.suitability: spruce suitability index (modeled; see supporting information)
- period.region: string to identify period and region.
File: spruce.grid.gpkg
Description: spatial information, left join by i5. Use the sf package to open; st_read("spruce.grid.gpkg"). Can be visualized in QGIS.
Code/software
All analyses were performed using R 4.4.2 using R Studio (2024.09.1+394) as our IDE.
Access information
Data was derived from the following sources:
Table 1. Model data sources, extent, format, and justification for inclusion.
| Variable | Source(s) | Spatial/Temporal Extent | Format | Justification |
|---|---|---|---|---|
| P/A spruce beetle mortality | Aerial detection/overview surveys | Continental United States (1997–2022)1 | Point | Response variable |
| Predictors | ||||
| Minimum winter temperatures (°C) | ClimateNA6 | North America; 1901–1920 … 2001–2020 | Raster (800 m) | Cold temperatures cause mortality to larvae and adults. |
| Precipitation as snow (cm) | ClimateNA6 | North America; 1901–1920 … 2001–2020 | Raster (800 m) | Snowpack helps insulate larvae and adults. |
| Mean spring and summer temperatures (°C) | ClimateNA6 | North America; 1901–1920 … 2001–2020 | Raster (800 m) | Beetle development rate is tied to growing-season temperatures. |
| Climate moisture deficit (mm) | ClimateNA6 | North America; 1901–1920 … 2001–2020 | Raster (800 m) | Related to host distribution and drought stress. |
| Spruce basal area (log[m²/ha + 1]) | Individual Tree Parameter Maps7 | Continental US/Alaska | Raster (240 m) | Bark beetle outbreak likelihood is tied to host availability. |
| Spruce suitability index (unitless) | Regression model (Supplement 2) | Study Area | Polygon | Bark beetle outbreak likelihood is tied to spruce suitability. |
Note: Data sources and links.
1: USA Detection Survey Data (‘Aerial Detection Surveys, USDA Forest Service’, 1997).
2: USA Detection Survey Data (‘Aerial Detection Surveys, USDA Forest Service’, 1997).
3: Aerial Overview Surveys (BC).
4: Aerial Overview Surveys (AB).
5: Forest Health Aerial Overview (YK).
6: ClimateNA (Wang et al. 2016).
7: Individual Tree Species Parameter Maps.
8: Vegetation Resource Inventory (BC).
9: Satellite-Based Forest Inventory (Canada); Wulder et al. (2024).
Aerial surveys—We compiled aerial detection survey data describing the extent of tree damage from insects and diseases. We considered all tree mortality observed from the air to be caused by an irruptive dynamic, as the detection probability of endemic damage is extremely low. We calculated the presence/absence of spruce beetle outbreak populations by intersecting annual polygon- and point-detection survey data with a 0.25 km2 grid. Grid cells that contained any visible mortality were listed as affected.
Forest Structure—We compiled spruce (spp.) basal area estimates from both raster and polygon-based data (see Table 1). Raster-based inventory data were aggregated by upscaling to 0.25 km2 and transforming units to m2/ha. Polygon-based inventory data were aggregated by summing the area-weighted mean (per hectare) of polygons that intersected each cell. In those regions with two data sources (i.e., British Columbia), we used the higher estimate of spruce basal area because the satellite-based forest inventory (SBFI) appears to underestimate Sitka spruce in coastal British Columbia.
Historical Climate—We compiled 20-year climate data (1901-1920…2001-2020) from decadal raster-based estimates of annual and seasonal climate variables (1901-1910…2011-2020). We extracted raster values using 0.25 km2 grid cell centroids. The full list of climatic variables considered is provided in Supplement 1.
Projected Future Climate—We compiled 20-year climate projections (from ClimateNA) using the 8 General Circulation Model (GCM) ensemble (Mahony, 2022). Emissions scenarios were based on the Shared Socioeconomic Pathways (SSP) from the Coupled Model Intercomparison Project (CMIP6) that was included in the 6th Intergovernmental Panel on Climate Change report (IPCC, 2023). We considered three emissions scenarios: medium-low (SSP 2-4.5˚C; 2.7˚C warming by 2081-2100); medium-high (SSP 3-7.0˚C; 3.6˚C warming by 2081-2100), and high (SSP 5-8.5˚C; 4.4˚C warming by 2081-2100).
Spruce Suitability Index—We created an estimate of spruce climatic suitability (‘spruce suitability index’ hereafter) (Jaime et al., 2022) based on spruce basal area and bioclimatic drivers. Spruce suitability index is a unitless measure bounded between 0 and 4 and represents the gradient of bioclimatic conditions associated with higher spruce basal area. Models were constructed using a similar machine learning framework as detailed below. Methods and results for our spruce suitability index model are provided in Supplement 2.
Biogeographic Boundaries—We compiled biogeographic boundaries based on the ecoregions provided by the United States Environmental Protection Agency (North American Ecoregions) and created a depiction of spruce distribution by joining our compiled estimates of spruce basal area and the distribution of spruce in Little’s Atlas of United States trees (Little, 1971), and buffering the joined polygons by 20 km.
Preprocessing—All predictors considered were rounded and Winsorized (tails were trimmed at quantiles of 0.005 and 0.995) to control for the distribution and number of distinct observations. For example, precipitation as snow was rounded to the nearest 5 cm and Winsorized to decrease the number of distinct observations from 107 to 41 and to trim the long tail of the distribution and range of precipitation as snow from 0-545 cm to 0-200 cm. These preprocessing steps decrease model overfitting, help us explore model interactions, and allow for easier estimation of prediction errors.
Analysis
Model—We constructed a classification machine learning model stack (i.e., a combination of multiple models; ‘classification model’ hereafter) with two ensemble models: A) random forest (Breiman, 2001) and B) gradient boosting (Friedman, 2001). Both ensemble models are tree-based and predicated on constructing many simple classification trees in parallel (bagging; random forest) or sequentially by reducing residual error (boosting; gradient boosting). We used the ranger implementation of random forest (Wright and Ziegler, 2017) and the XGBoost implementation of ‘extreme gradient boosting’ (Chen et al., 2024; Chen and Guestrin, 2016). Both models were constructed using 3-fold cross-validation, and predictions from each model were averaged to compute a final prediction.
Model details—Classification models were tuned based on a grid approach (Supplement 3), and variable selection was performed using variable clustering and ‘design points’ based on holdout models (i.e., testing different combinations of related variables; Supplement 1). Since aerial detection programs contain errors of omission (missing damage caused by insects and pathogens) and commission (incorrect identification)(Coleman et al., 2018), we designed a bootstrapped workflow. For each constructed model, we randomly selected a training and a test dataset using two different stratification methods: A) numeric, and B) proportional (Supplement 4). For the numeric stratification method, we randomly selected 10,000 presences and 10,000 absences from each region and 20-year period (sampling was capped at 33% of the total number of observations for each region period combination). For the proportional stratification method, we randomly selected 10% of presences and an equal number of absences from each region and 20-year period.
Model effects—We assessed model effects based on each stratified test dataset. We assessed model performance using confusion matrix metrics, predictor importance using 1-AUC (area under the receiver-operator curve) dropout loss (i.e., how much does performance decrease if each predictor is removed)(Fisher et al., 2019), relative mean effects using accumulated local effects (Apley and Zhu, 2020), and predictor interaction strength using Friedman’s h2 (higher values correspond to greater interaction strength)(Friedman and Popescu, 2008).
Model fitting—All models were trained and tested based on data subsampled from Alaska, British Columbia, and the contiguous United States. We constructed models 100 times (50 times for each stratification method) to predict spruce beetle outbreak distribution for all 28,643,494 observations in our dataset and constructed models 1000 times (500 times for each stratification method) to estimate the relative mean effects of predictors and associated prediction error based on 95% confidence intervals. Regional models for Alaska, British Columbia, and the contiguous United States, as well as models fitted using only data from low or high basal area spruce forest,s were each constructed 250 times (125 times for each stratification method).
Thermal envelope—We developed thermal envelope breakpoints (Supplement 5) to describe the outbreak distribution of spruce beetle based on modeled results and a priori information about spruce beetle biology. Ultimately, we developed our envelope based on the mean spring and summer temperature, the minimum winter temperature, and the climate moisture deficit. Climate moisture deficit was included in the mapped thermal envelope because it helps distinguish between coastal and interior forests, but was not included in most tabular summaries. Thermal envelopes were delineated as either ‘optimal’ (within the breakpoints), ‘likely’ (outside the envelope, but possibly suitable given our a priori understanding of spruce beetle biology), or ‘not suitable’. Together, we refer to spruce forests that are either ‘optimal’ or ‘likely’ as ‘suitable’. For example, we considered the ‘optimal’ thermal envelope for mean spring and summer temperature to be 4-9.5˚C based on model results, while the ‘likely’ thermal envelope included mean spring and summer temperature between 9.5-15˚C because: A) the development time of many Dendroctonus sp. is flexible and increasing temperatures are related to either more generations per year or developmental delays through facultative diapause (Bleiker and Willsey, 2020; Hansen et al., 2011; Schebeck et al., 2017); B) field- and laboratory-based studies evaluating the effects of heat on Dendroctonus species such as D. frontalis (southern pine beetle) suggest that chronic extreme heat does not cause high levels of mortality (Lombardo et al., 2022) and laboratory studies suggest that spruce beetles can tolerate temperatures of 43˚C (Mitchell and Schmid, 1973); and C) our model results depict that the predicted upper limit of mean spring and summer temperature (where profiles and predictions decrease below 0) tracks the distribution of temperatures across the range of spruce trees (i.e., our dataset contains few observations where host basal area is high and temperatures are warm [Supplement 6]). Finally, we rounded the envelope breakpoints for minimum winter temperature to the nearest 5˚C (i.e., -15˚C instead of -16˚C) for ease of depiction.
Model uncertainty—Large-scale spatially and temporally explicit analyses and machine learning models have several sources of uncertainty. First, our models are based on a stratified sample of observations that were collected using different protocols and with highly unequal sampling effort across regions and 20-year periods. We are most confident in our predictions from 2001 to 2020 across the contiguous United States, British Columbia, and Southcentral Alaska. Second, tree-based machine learning models are not robust for predicting new observations outside of the model parameter space, which informs our decision to hindcast and forecast based on thermal envelopes.
