Incorporating plant phenological responses into species distribution models (SDMs) reduces estimates of future species loss and turnover
Data files
Mar 26, 2024 version files 3.57 MB
-
Raw_data_and_code.zip
3.56 MB
-
README.md
1.81 KB
Abstract
Anthropogenetic climate change has caused distribution shifts of many species, and species distribution models (SDMs) are central for documenting this relationship. However, most SDMs rarely consider the evolution of climate-sensitive functional traits, such as phenology, which strongly affect species fitness. Using >120,000 herbarium specimens representing 360 plant species across the eastern United States, we developed a novel “phenology-informed” SDM that integrates dynamic phenological responses to changing climates. Compared to standard SDMs, our phenology-informed SDMs forecast lower species habitat loss and less species turnover under climate change. These results suggest that phenotypic plasticity or local adaptation in phenology may help species adjust their ecological niches and persist in their habitats under rapid environmental change. Our findings reveal how phenology variation mediates species distributions and affects regional biodiversity patterns. Our newly developed model also circumvents the need for mechanistic models, facilitating the deployment of trait-based SDMs across unprecedented spatial and taxonomic scales.
We have submitted raw species phenology (phenology.total.csv), growth form (Growth_form.csv), geographical location (grid40km.csv), species distribution data (distribution_data.csv) as well as R script (RCode.R)
File Descriptions
phenology_total.csv
- Species: Currently valid species scientific name
- bud: the number of buds
- flower: the number of flowers
- fruit: the number of fruits
- state: the state in which a specimen was collected
- longitude: concrete coordiates where a specimen was collected
- latitude: concrete coordiates where a specimen was collected
- year: the year in which a specimen was collected
- month: the month in which a specimen was collected
- day: the day on which a specimen was collected
growth_form.csv
- Species: Currently valid species scientific name
- Growth_form: the species were classified as Herbaceous perennial, Vine perennial, Woody or Herbaceous annual
- Status: the species were classied as native or invasive species in the eastern United States
distribution_data.csv
species distribution matrix includes 1158 rows (number of grid cells), and 360 colnumns (species)
Key Information Sources
Herbarium specimens, climate and species distribution data were derived from the following public sources:
- Consortium of Northeastern Herbaria
- The Southeast Regional Network of Expertise and Collections
- The Biota of North America Program’s North American Plant Atlas
- Global Biodiversity Information Facility (GBIF)
- PRISM
Code/Software
R script was created using version 4.2.1.
Species selection and phenological data collection
We used digitized specimens from two of the most comprehensive digitized regional floras in the world: the Consortium of Northeastern Herbaria (CNH, https://neherbaria.org /portal/) and the Southeast Regional Network of Expertise and Collections (SERNEC, https://sernecportal.org/portal/). These two online portals include more than six million digitized herbarium specimens from across the eastern United States. Following previously established protocol (Park et al., 2019), we selected species and specimens for analysis when i) there were at least 50 unique collections of the species across space and time in the eastern United States, ii) the specimens included both an exact collection date and at least county-level location information, and iii) the specimens had easily identifiable and quantifiable buds, flowers, and fruits. Applying these criteria yielded 124,847 total herbarium specimens representing 360 species that we used in our models and analyses. These species vary in life history, growth form, and native status (i.e., native or not to the eastern United States) (Table S1 in Supporting Information).
Phenological data extraction
Crowdsourcers hired by Amazon’s Mechanical Turk service (https://www.mturk.com/) counted the number of each type of structure (i.e., buds, flowers, and fruits) present on the specimen using the CrowdCurio platform (Willis et al., 2008). Each specimen was scored independently by at least three crowdsourcers. We estimated the reliability of each crowdsourcer and their data following Park et al., (2022) (Supporting Information).
We calculated the median number of buds, flowers, and fruits from the multiple phenological observations for each specimen. We then estimated the day of year (DOY) of “peak” budding, flowering, and fruiting if their buds, open flowers, or fruits separately represented ≥ 50% of the total number of reproductive structures [as in Park et al., 2019]. Two species each were excluded from the “peak budding” category (Primula mistassinica and Sarracenia rubra) and the “peak fruiting” category (Calopogon barbatus and Narcissus poeticus) because there were no specimens of these species at these developmental stages.
Species distribution data
County-level species’ distribution data were obtained from the Biota of North America Program’s (BONAP; http://www.bonap.org/) North American Plant Atlas (NAPA; Kartesz, 2015), which included 19,039 taxa from 227 families (accessed November 2022). These data are available as binary occurrences (i.e., presence/absence) for 3,067 counties in the USA, excluding Alaska and Hawai’i. We supplemented the species’ distribution records from BONAP using additional data from the GBIF database (https://www.gbif.org) and all available specimen records in the CNH and SERNEC databases. Most (≈ 70%) of these additional data points (i.e., GBIF+NH+SERNEC) represented specimens collected between 1960 and 2000. We then transformed the county-level species distribution data into 40 × 40-km grid cell distributions (Supporting Information).
Environmental data
Climatic data of specimen localities – We extracted average monthly temperature and precipitation data (1895–2018) at a 4-km resolution from PRISM (product AN81m, http://prism.oregonstate.edu/) to characterize the climate of the locality where and when each specimen was collected (Supporting Information).
Environmental data used for modeling species’ distributions – “Recent” (1970–2000) and future (2061–2080; henceforth referred to as “2070s'') climatic data at 2.5-arc-minute (ca. 5-km) resolutions were obtained from WorldClim (https://www.worldclim.org/, ver. 2.1; all 19 climatic variables: bio1–bio19; see Table S2) (Supporting Information). We also included elevation and soil data, which can improve the fitness of SDMs (Supporting Information).
Statistical modeling
Relationships between climate and phenology
We first applied LMMs to all herbarium specimens pooled across the 360 species to identify species-specific relationships between plant phenology and climatic variables. Separate but identically structured models were built for peak budding, flowering, and fruiting. For each full model, we used the DOY for the phenological state recorded for each specimen as the response variable. Predictor variables in the full model included: the year of specimen collection, four climatic variables (bio1, bio4, bio12, and bio15; all centered and scaled), and two species-level plant traits (growth form and native status) and their interactions (bio. * Growth form; bio. * Native status) as fixed terms. The full model also included random intercepts among the species and sampling locations, and random slopes among species for bio1, bio4, bio12, and bio15. The eight interaction terms between native status or growth form and climate indicated whether changes in plant phenology along climatic gradients differed among species with different growth forms and native status. We did not find obvious nonlinear relationships between DOY and any climatic variables in the preliminary analysis for most species, so we did not include quadratic terms in our models.
All models were fitted using the “lmer” function in the “lmerTest” package (Kuznetsova et al., 2017; version 3.1-2) of the R software system (version 4.2.1). We also checked the residuals of all models to ensure that all assumptions were met and used Moran’s I to confirm that there was no evidence of spatial autocorrelation. No models showed significant spatial autocorrelation (p > 0.05). To predict peak budding, flowering, and fruiting time of each current and future species-grid cell combination, we used the “predict” function in the base “stats” package (version 4.0.0) with fits of the models previously described and the estimated values of recent and future bio1, bio4, bio12, and bio15 of each grid cell. We further checked our model with phylogenetic LMM to examine whether our results were affected by phylogenetic relationships among species (Supporting Information; Table S3-S6).
Constructing phenology-informed SDMs
We quantified how plant peak budding, flowering, and fruiting time (DOY) separately influenced the effects of environmental change on the species’ probability of occurrence. We first divided the 19 WorldClim environmental variables and the five soil variables into four groups (Table S2): (1) mean temperature (bio1, bio5, bio6, bio8, bio9, bio10, and bio11); (2) mean precipitation (bio12, bio13, bio14, bio16, bio17, bio18, and bio19); (3) climatic fluctuation (bio2, bio3, bio4, bio7, and bio15); and (4) soil variables. Principal component analysis (PCA) was then used to reduce the dimensionality of each group using the “prcomp” function from the “stats” package (version 4.0.0). In the subsequent analyses, we used either one or two principal component scores (PCs) as predictor variables in our models (see Supporting Information; Fig. S1). Elevation was additionally included as a single variable. Our final phenology-informed SDM used elevation, the aforementioned seven PCs, and the predicted grid-cell level phenology, which all were weakly correlated (|r| < 0.7; Fig. S2).
We used generalized linear mixed models (GLMMs) as the framework for our phenology-informed SDMs (Pollock et al., 2012). We fitted GLMMs using restricted maximum likelihood (REML) with the R package “lmer4” version 1.1-27). We fitted a random intercept, random slope binomial model with a logit link function to species × grid cell presence/absence data (Equations 1-4).
In this base model, the logit probability that species i occurs at the jth of 1, 2, 3, …,1158 grid cells is equal to an intercept term plus the product of a matrix of eight environmental variables ( ) and a vector of eight coefficients ( ). Here, k represents environmental variables. The parameters differed among species (see manuscript).
The submodel includes the parameter, which represents the average probability of occurrence (on a logit scale) among species within a hypothetical grid cell, and the parameter, which is the degree to which a given species departs from the average probability of occurrence. The parameter indicates the response of a given species to the relevant environmental variables.
In the submodel, the estimate is calculated as the intercept plus the coefficient matrix and matrix of trait values. The intercept is modeled as: Bk[i] ~ normal(U[k], τ[k])
where U[k] indicates the average response of species to environmental variables and τ[k] reflects the degree to which each species departs from the among-species average response. Ck describes how plant phenology regulates the response of an individual species to the environmental variables (i.e., DOY * PC[k] interaction coefficients). For example, positive values suggest that a high value for DOY (i.e., later phenology) increases the probability of occurrence of a species under high values of the focal environmental variable. We also added species as a random component for the slope of occurrence vs. environment, which allowed us to explore species-specific differences in responses to environmental variables. We also constructed abiotic SDMs without phenological information (i.e., only equations 1 and 2) to compare the results of phenology-informed and abiotic SDMs. In abiotic SDMs, we also included species as a random component.
To assess model performance, we calculated the conditional R2 reflecting the total variance explained (Nakagawa & Schielzeth, 2013). We also calculated the area under the curve (AUC) with the “pROC” package (Robin et al., 2011; version 1.16.1). AUC represents the predictive accuracy of the model and ranges from 0.5 to 1 (1 is the highest accuracy). We then converted the probability of occurrence of each species × grid-cell combination into binary maps with the threshold that maximized the specificity and the sensitivity, using the “ROCR” package (Sing et al., 2005; version 1.0-11).
Model validation through hindcasting
To validate our model, we created phenology-informed SDMs using distribution and phenology records of annual herbs after 1970 and then hindcasted the models using historic climatic data to project species distributions before 1950 (Supporting Information).
Data analyses
We calculated three metrics of changes in suitable areas of species occurrence under future climate change (combining all the SSP5-85 GCMs): i.) currently occupied range forecasted to have high suitability (i.e., range persistence; probability of occurrence > threshold); ii.) current occupied range forecasted to have low suitability (i.e., range retreat; probability of occurrence ≤ threshold); and iii.) current unoccupied range forecasted to have high suitability (i.e., range expansion) (Supporting Information). Based on species distribution maps under current and future conditions, we further calculated the habitat fragmentation index for each species using the “lsm_c_pd” function in the R package “landscapemetrics” (Hesselbarth et al., 2019; version 1.4.3).
To assess the percentage of local extinctions for a given grid cell, we then examined the geographical patterns in the proportion of species loss based on the species’ current and future distributions, which was calculated as the number of species lost (L) divided by the current species richness (S) of each grid cell (L/S). The same procedure was also applied to evaluate the proportion of species gained (G) in each grid cell under the assumption that species could colonize any new suitable climate space within the eastern United States and that their future distributions would not be limited by dispersal. The proportion of species turnover is then given by T = (L + G) / (G + S). Relationships between the proportion of species gains, losses, and turnover as a function of different degrees of climate change (e.g., changes in mean annual temperature) were further explored to examine the stability of species composition under climate fluctuations. All of these analyses were separately conducted for the abiotic and phenology-informed SDMs. We compared the changes in species’ suitable areas and geographical patterns in species gains (losses and turnover) between abiotic and phenology-informed SDMs.