Simulated Herbarium data for testing the accuracy with which specimen data can predict the timing and duration of population-level flowering displays

Park, Isaac 1 ; Ramirez-Parada, Tadeo 1 ; Record, Sydne2 ; Davis, Charles 3 ; Ellison, Aaron3 ; Mazer, Susan 1

Published Mar 26, 2024 on Dryad. https://doi.org/10.5061/dryad.dbrv15f79

Data files

Mar 26, 2024 version files 167.66 MB

Full_Simulation_30Taxa_1kSamples.csv

167.62 MB
README.md

40.51 KB

Abstract

This dataset provides code and example data for simulating specimen collections of flowering plants across North America, and for developing phenological predictions of population-level flowering onset and termination for these data. It further presents code for assessing the accuracy of these predictions relaticve to known (simulated) population-level flowering dates at the location of each collection.

This README.txt file was generated on 2023-10-20 by Isaac Park

GENERAL INFORMATION

Title of Dataset: Phenological models inferring population-level flowering onset and termination from simulated herbarium data
Author Information witheld for review*
Date of data collection 10/10/2023:
Geographic location of data collection: NA, simulations based on climate data from continental USA
Development of these datasets and associated python code were supported by the National Science Foundation

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: Public Domain
Links to publications that cite or use the data:
Links to other publicly accessible locations of the data:
Links/relationships to ancillary data sets: Raw observation locations used to underpin simulations in this data can be acquired from USANPN.org. Climate data is available through https://www.prism.oregonstate.edu/
Was data derived from another source? yes. Raw observation locations used used to underpin simulations in this data can be acquired from USANPN.org. Climate data is available through https://www.prism.oregonstate.edu/
Recommended citation for this dataset:Park (2023), North American herbarium data for Phenological Assessment, Dryad, Dataset

DATA & FILE OVERVIEW

File List:

Data/Full_Simulation_30Taxa_1kSamples.csv (File): Locations from which to simulate local populations were selected, and associated climate data for those locations
Data/SimHerb_Data (Folder): contains files consisting of simulated plant collections (and the simulated local populations from which they were drawn). Different files represent sets of simulated species, that exhibit differring magnitudes of phenological responsiveness, intrapopulation variation in flowering DOY (Day Of Year), and the magnitude of random noise added to simulated collection dates.
- RQ_Sim_Distribution_slope4_skewDistributiontest21.csv: simulated collections data with both population-level biases and biases in collection dates within the flowering period of selected individuals follow the following name convention:
- RQ_Sim_Distribution_Variableduration_slope.csv simulated data with biases in collection dates a magnitude of population-level phenological responseiveness of 4 days/degree Celsius, and a variable duration of flowering by individual plants within that population.
- Individual files simulating data with biases in collection dates a magnitude of population-level phenological responsiveness of 4 days/degree Celsius, a variable duration of flowering by individual plants within that population, and mild random noise in collection dates: RQ_Sim_Distribution_Variableduration_slope_noise10.csv
- Individual files simulating data with biases in collection dates a magnitude of population-level phenological responseiveness of 4 days/degree Celsius, a variable duration of flowering by individual plants within that population, and high random noise in collection dates: RQ_Sim_Distribution_Variableduration_slope_noise30.csv
- Individual files simulating data with biases in collection dates a magnitude of population-level phenological responseiveness and stable magnitudes of intrapopulation phenological variation as well as constant durations of individual flowering: RQ_Sim_Distribution_XXsigma_slope_Y, where XX represents the magnitude of intrapopulation phenological variation, and Y represents the magnitude of population-level phenological responsiveness in mean pdate of peak flowering (in days/degree Celsius). In cases where random noise was added into simulated data, the suffix _noise10 (in the case of mild magnitudes of noise) or _noise30 (in the case of high magnitues of noise) were appended to these file names.
- RQ_Sim_Distribution_slope4_skewDistribution.csv: simulated data with biases in collection dates a magnitude of population-level phenological responseiveness of 4 days/degree Celsius, incorporating left and right-skewed population-level collection biases into sample collection simulations.
Data/Simulated_Distribution_flatDist_Models (Folder): contains files pertaining to the output of phenoclimate models run on simulated plant collections that exhibit no bias in collection timing within the flowering period of the sampled individuals. These data incorporate many of the fields from the simulated plant collections data in the folder Data/SimHerb_Data, as well as predictions of population-level flowering onset and termination at each location from which populations were simulated. This folder also includes files listing the selected parameters and coefficients for all species-specific phenoclimate models.
Data/Simulated_Distribution_Models (Folder): contains files pertaining to the output of phenoclimate models run on simulated plant collections that are analogous to those located in the folder Data/Simulated_Distribution_flatDist_Models, but pertaining to phenoclimate models developed on data with collections that are biased towards collection proximate to the individual peak flowering dates of the sampled individuals.
Data/Simulated_Distribution_leftDist_Models (Folder): contains files pertaining to the output of phenoclimate models run on simulated plant collections that are analogous to those located in the folder Data/Simulated_Distribution_flatDist_Models, but pertaining to phenoclimate models developed on data with collections that are biased towards collection proximate to the individual flowering onset dates of the sampled individuals.
Data/Simulated_Distribution_rightDist_Models (Folder): contains files pertaining to the output of phenoclimate models run on simulated plant collections that are analogous to those located in the folder Data/Simulated_Distribution_flatDist_Models, but pertaining to phenoclimate models developed on data with collections that are biased towards collection proximate to the individual flowering termination dates of the sampled individuals.
Data/Simulated_Distribution_Models/Summaries (Folder): contains aggregated files summarizing the magnitude of R2, RMSE, MAE, and raw differences between simulated population phenological dates and the predicted timing of those dates generated by phenoclimate models for simulated speces that were collected with different forms of sampling bias, or with different magnitudes of noise.

METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data:
Creating a reference dataset: generating sample locations representing known population-level phenological distributions and individual phenological parameters
We simulated phenological data for 1200 hypothetical “species” in the coterminous USA that varied in the attributes of their individual- and population-level flowering phenology. For each of these simulated species, we selected 1000 locations within the continental United States, each representing a local population observed during a single year from which a simulated specimen was later obtained. The coordinates for each location, year, and associated mean annual temperature in the year of collection were randomly selected without replacement from 4-km2 PRISM pixels (PRISM Climate Group 2011) between the years 1901 to 2020, and were restricted to locations with 1991–2020 temperature normals of 1–20 °C and mean annual precipitation normals for the same period of 60–3800 mm.
Each species generated this way was assigned a series of attributes defining its individual- and population-level flowering phenology. The peak flowering date of an individual was assumed to coincide with its mean flowering date. We then defined a linear equation describing the relationship between the mean date of peak flowering among individuals within a population and local temperature conditions. Each species was assigned a median population flowering DOY of 50 at 0˚C (i.e., the intercept) as well as a phenological responsiveness (i.e., slope) of median flowering DOY to mean annual temperature: advancing by 1, 4, or 8 days per increase in °C. Next, we assigned each species a low or high magnitude of intrapopulation variation in phenological timing (i.e., in peak flowering DOYs) among individuals (based on normal distributions with standard deviations (σ) of either 10 or 30 days), representing the magnitude of variation in the flowering times of early- to late-flowering individuals within each local population. Then, each species was assigned a short, moderate, or long duration of the flowering period by each individual within each population (15, 30, or 60 days, representing the duration of time each individual plant was in flower. Fifty species were simulated for each of these 18 combinations of phenological responsiveness, flowering duration, and intrapopulation variation in phenological timing.
To accommodate the possibility that the magnitude of variation in phenological timing within a population could depend on local climate conditions, we also simulated 50 species with temperature-sensitive intrapopulation phenological variation (σ) ranging from 10 to 30 days. For these species, σ of the DOY among individuals in a given population increased by 1 day for every 1 °C increase in the mean annual temperature of its location. For these simulated species, individual flowering duration was fixed at 30 days. Additionally, to accommodate the possibility that individual flowering durations could exhibit linear relationships with local climate conditions, we also simulated 50 species that exhibited individual-level variation in flowering duration resulting from changes in temperature (increasing by 1 day per °C increase in mean annual temperature, and ranging from 10 days to 30 days). For these species, the degree of intrapopulation variation in peak flowering dates was held constant at σ = 30 days (i.e., high intrapopulation variation).
Calculation of population-level onset, median, and termination dates of flowering
For each population of each species described above, we calculated a distribution of individual-level peak flowering dates—assumed to be normally distributed (Clark and Thompson 2011)—based on the flowering attributes of the species and the temperature conditions corresponding to its site and year of observation. First, we calculated the median flowering DOY at the location and year from which each specimen was collected based on its pre-defined intercept and phenological responsiveness to mean annual temperature (i.e., 1, 4, and 8 days per °C). Then, we obtained the standard deviation of each local population (i.e., its degree of intrapopulation variation in flowering dates) based on the flowering attributes of the simulated species as outlined above. Next, we arbitrarily defined population-level flowering onset DOYs for each population and year as the 10th percentile of a normally distributed population whose mean and standard deviation we obtained in the previous steps (i.e., the DOYs by which the first 10% of individuals in a local population at a given location and year would have reached their median flowering dates). Similarly, the population-level flowering termination dates were calculated as the 90th percentile of a normally distributed population with the same characteristics as described above (i.e., the DOYs by which all but 10% of individuals in a local population at a given location and year would have reached their peak (or mean) flowering dates).
Through this process, we obtained a sample of 1000 annual population-level distributions of flowering dates for each of 1200 hypothetical species. For each of these populations, the quantiles of their flowering distribution—representing the nth individual reaching peak flowering within a population—were known a priori, representing a benchmark against which to compare estimates derived from simulated specimen data.
Simulating randomly selected (unbiased) phenological snapshots from pre-defined populations
For each species, we then generated simulated specimens by: (1) randomly selecting an individual within each population and (2) selecting a random DOY within its individual-level flowering period that emulated the phenological snapshot provided by real herbarium specimens. Specifically, using the distribution of peak flowering dates of each population, we selected an individual at random. From its peak flowering date, we then obtained onset and termination dates by subtracting (for flowering onset) or adding (for flowering termination) half the individual’s flowering duration for that species to the sampled date of peak flowering. To simulate a phenological snapshot for that individual, we then randomly selected a DOY between the onset and termination of that individual’s flowering period. As a result, the simulated datum represented a simulated herbarium specimen generated accounting for uncertainty in both the timing of the individual relative to its source population, and in the timing of the collection relative to the onset and termination of that individual’s flowering period. This procedure was repeated across all locations for each simulated species, generating 1000 data points (i.e., simulated specimens or phenological snapshots) per species.
Simulating biases in collection effort across population-level flowering periods
To simulate biases towards collection of specimens during the early or late portion of their local population-level flowering displays, we selected an individual at random within each population and year using both left- and right-skewed normal probability distributions. These distributions were constructed by modulating the parameter α in the python package scipy.stats.skewnorm v1.10.1 (Azzalini and Capitanio 1998), such that if the underlying plant population was treated as exhibiting a normal distribution (α = 0), samples were collected from that population with a left-skewed (α = -1.0) or right-skewed (α = -1.0) probability distribution. Once an individual was selected from these skewed distributions, the timing of sample collection from within the individual flowering durations of these ‘specimens’ were generated using similar methods as unbiased specimens. We then determined the accuracy of the model predictions generated from datasets exhibiting biased and unbiased sampling of local populations by comparing predicted population-level flowering onset and termination dates with the actual (i.e., known, simulated) flowering dates that were produced using a normal distribution. To minimize computation time, population-level biases were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C (representing moderate responsiveness to climate stimuli), intrapopulation variation was high (σ = 30), and individual flowering duration was moderate (30 days).
Simulating biases in the timing of collection within flowering periods of individuals
In addition to biases towards collection of early or late individuals within a local population, botanists may also preferentially collect individuals from the early or late portion of their individual flowering period (i.e., individual collection bias). In some cases, collectors may preferentially collect individuals that are proximate to their peak flowering date because this is when the most flowers are displayed. In other cases, collectors may preferentially collect specimens that have only recently begun to flower, when floral structures may exhibit less damage from inclement weather or herbivores, or proximate to flowering termination in cases where the collector prefers specimens that include both flowers and fruits. Accordingly, for each population of each species, we simulated DOYs within each individual’s flowering period both at random (i.e., without bias) and with three different types of bias. Unbiased collections were simulated by selecting a random date chosen uniformly within the flowering period of each sampled individual. To represent a bias toward collection of individuals close to their peak (median) flowering DOY, we sampled collection dates from a truncated normal distribution centered on an individual’s mean flowering date and with σ = 25% of the flowering duration for that species and location (henceforth referred to as mean-biased collection data). To represent a bias toward collection dates shortly after flowering onset (henceforth, onset-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% earlier than the mean flowering onset date of that individual (σ = 25%). Finally, to represent a bias toward collection on dates shortly before flowering termination (henceforth termination-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% later than the mean flowering onset date of that individual (σ = 25%). As with examinations of population-level bias, collection biases within the flowering periods of individuals were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C, intrapopulation variation was high (σ = 30), individual flowering duration was moderate (30 days), and no population-level bias was present.
Estimating population-level flowering onsets and terminations from simulated herbarium data
We generated phenoclimate models for each species from each set of simulated specimen collection dates using quantile regression (Koenker et al. 2018) in RStudio (R Team 2020). In all cases, each model regressed observed DOYs of the phenological snapshots of all sampled individuals of a given species against mean annual temperature. From these 1450 models (representing each of the species-specific models for all 1200 species plus the additional 150 models exhibiting population-level collection biases and the 100 models exhibiting individual-level collection biases), we predicted the 10th, 50th, and 90th percentiles of flowering DOYs for each species from mean annual temperatures corresponding to the years and locations of their source populations. We then calculated the mean absolute error (MAE) of the linear regression of the known timing of the onset (or termination) of the peak flowering period for each reference population on the predicted DOYs produced by each phenoclimate model based on the simulated herbarium data. For each metric of population-level phenology (i.e., flowering onset, peak (i.e., median DOY), and termination), we then used Tukey HSD tests to compare the mean accuracies (estimated as MAE) of these predicted DOYs versus the actual population-level metrics among models constructed from species that differed in their phenological sensitivities to climate, flowering durations, degrees of intrapopulation variation in phenological timing, and collection biases.
Similarly, we tested whether the mean MAE of estimated peak flowering onset and termination dates among groups of species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation differed significantly from the mean MAE of estimated median flowering dates for each group of simulated species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation. We used Tukey HSD tests to compare the accuracy of estimated onset, median, and termination dates of the peak flowering period among all species produced from each of the simulated datasets.
Finally, we re-fit all 1200 models (including all 24 combinations of species parameters but excluding models constructed to test the effects of collection biases) with randomly selected subsets of data (100–1000 specimens per species) to determine how sample size affected model performance and predictive accuracy. To evaluate whether more data would be needed when variation in phenology among populations is not perfectly explained by the climate variables included in the model, we ran additional simulations in which population-level mean DOYs (and associated onset and termination DOYs of the flowering period) of each species at each sampled location and year included random variation not associated with local climate: adding either ±5 days (i.e., a low-noise scenario) or ±15 days (i.e., a high-noise scenario) to the DOYs of the onset, median, and termination of flowering DOYs. For each location and year, the random offsets of the DOYs of flowering onset, median flowering DOY, and flowering termination were identical, such that random variation was incorporated only into the timing of flowering, and not its duration.
Methods for processing the data:
Python and R scripts should be run in the following order: (Note that these scripts require setup of the python environment according to the file pheno.yml, included in this repository)

01_Simulated_distribution-popskew.ipynb: generates randomized collection locations and simulates population-level flowering DOYs, as well as simulating specimen collection dates from those populations.
02_NPN_qreg_Functions_SIM_SlopeComp_debug3.Rmd: using simulated collection data, models and predicts population-level flowering onset and termination dates- calculates separate predictions for specimens collected under biases towards collection proximate to the dates of flowering onset, mean flower, or termination by the individuals being 'collected'
03_NPN_qreg_Functions_SIM_SlopeComp_skew.Rmd: using simulated collection data, models and predicts population-level flowering onset and termination dates among species collected with biases towards the early or late portions of the population-level bloom display
04_NPN_qreg_Functions_SIM_SlopeComp_noise10.Rmd: using simulated collection data, models and predicts population-level flowering onset and termination dates similarly to 02_NPN_qreg_Functions_SIM_SlopeComp_debug3.Rmd, but for simulated collections in which mild quantities of additional random variation (noise) have been added to collection dates.
05_NPN_qreg_Functions_SIM_SlopeComp_noise30.Rmd: using simulated collection data, models and predicts population-level flowering onset and termination dates similarly to 02_\NPN_qreg_Functions_SIM_SlopeComp_debug3.Rmd, but for simulated collections in which high quantities of additional random variation (noise) have been added to collection dates.
06_Onset_Term_Model_Assessment-Simulated_Distributions-skew.ipynb: Extracts mean absolute error, R2, and other perfomance metrics for phenoclimate models developed using data from each simulated species while also recording the attributes of each species. Conducted for cases in which biases towards the early or late portions of the population-level bloom display. Also compares the magnitude of MAE across species exhibiting different traits or collection biases.
07_Onset_Term_Model_Assessment-Simulated_Distributions-left_right_flat.ipynb: Extracts mean absolute error, R2, and other perfomance metrics for phenoclimate models developed using data from each simulated species while also recording the attributes of each species. Conducted for cases in which biases towards the collection proximate to the DOY of flowering onset, peak flowering, or flowering termination of the collected individual.
08_Onset_Term_Model_Assessment-Simulated_Distributions-left_right_flat-Noise.ipynb: Simulated_Distributions-left_right_flat.ipynb: Extracts mean absolute error, R2, and other perfomance metrics for phenoclimate models developed using data from each simulated species while also recording the attributes of each species. Conducted for cases in which biases towards the collection proximate to the DOY of flowering onset, peak flowering, or flowering termination of the collected individual, and in which mild or high magnitudes of random noise were added into collection dates.
09_Simulated_Data_Noise_MAE_Samplecount_Assessment-Copy1.ipynb: Assesses MAE of predictions made by models constructed from different numbers of specimen collections, and from specimen collections that exhibit differing magnitudes of random variation in collection dates.

Instrument- or software-specific information needed to interpret the
data:
Python code was run python version 3.9.7, and requires the installation packages listed in the file pheno.yml included with this data. R code was run using Rstudio version 4.0.0 and requires installation of the following packages:

data.table v1.14.2
plyr v1.8.7
MCMCpack v1.6-2
quantreg v5.88
DataCombine v0.2.21

DATA-SPECIFIC INFORMATION

`Data/Full_Simulation_30Taxa_1kSamples.csv`

This dataset represents the point locations (and associated normal climate conditions) from which simulated specimen locations can be randomly drawn.

Variables: 366 (6 used in analysis)
5000 rows/cases
Variable List:

X: decimal longitude of location
Y: decimal latitude of collection
Ann_tmean_norm: annual mean temperature normal (1991-2020)
Ann_ppt_norm: annual total annual precipitation normal (1991-2020)
year: year of collection (not used in analysis)
genus_species: marker delineating simulated genus and species to which data pertains
Ann_tmean_13: mean annual temperature
other climate data was incorporated into this dataset using standard PRISM climate nomenclature, excepting that the prefix 'Ann_' was added to data pertaining to annual climate, 'Prev_' was added to data pertaining to climate in the previous year, 'Norm_' was added to data pertaining to longterm normals, and 'Anomaly_' was added to data pertaining to annual deviations from long-term normals. This data was retained for completeness - however, it was not utilized in subsequent analyses.

Missing data: NAN
No specialized formats or other abbreviations used

`Data/SimHerb_Data/RQ_Sim_Distribution_10sigma_slope_4.csv`

(file names vary as specified below)

These datasets represent simulated collections of specimen collections, as well as information recording the population-level phenological timings and durations of the populations from which they were drawn. Each file records multiple simulated collection dates which reflect collection dates drawn under different forms of collector bias, as specified below.
(Note that, this data structure data structure is identical to all other files within the folder Data/SimHerb_Data/):

Files that are labeled 10sigma represent simulated collections from species that exhibit low intrapopulation variation in individual peak flowering dates, while files labeled 30sigma indicate high intrapopulation variation and files labeled variableSigma exhibit variable intrapopulation variation that increases with mean local temperature.
Files labeled slope_1 represent simulated collections of species that exhibit 1 day/degree Celsius advances with mean annual temperature, while files labeled slope_4 or slope_8 represent simulations of species that exhibit 4 or 8 days/degree celsius advances with mean annual temperature respectively.
Files labeled VariableDuration represent simulations of species that exhibit increasing individual flowering durations with local temperature.
Files appended with noise_10 or noise_30 have small or large magnitudes of random variation (respectively) in collection dates added to the simulated data.)

Variables: 385 (26 used in analysis)
150000 rows/cases (5000 in case of files pertaining to species exhibiting variable duration or variable intrapopulation variation)
Variable List:
Same as listed in description of Data/Full_Simulation_30Taxa_1kSamples, with the following additions:

local_50pct_mean_DOY: simulated median date of peak flowering for a population of the simulated species at the selected location
local_sigma: the magnitude of intrapopulation variation in peak flowering dates within the simulated population
local_duration: the individual flowering duration of individuals at that location
Local_10pct_mean_DOY: date of peak flowering for the 10th percentile of individuals in the local population
Local_10pct_onset_DOY: date of flowering onset for the 10th percentile of individuals in the local population
Local_90pct_mean_DOY: date of peak flowering for the 90th percentile of individuals in the local population
Local_90pct_termination_DOY: date of flowering termination for the 90th percentile of individuals in the local population
slope: the magnitude of phenological response (based on changes in mean date of peak flowering within each population) in days/degree C
param: the climate parameter to which the species responds
sigma: magnitude of intrapopulation variation in dates of peak flowering among individuals
duration: flowering duration of each individual within the simulated population
skew: skew of collection bias (when selecting individuals from the population to sample from). Only present in file RQ_Sim_Distribution_slope4_skewDistribution, which simulates according to multiple skew types.
Local_indiv_mean_DOY: peak (mean) flowering date of simulated individual from which a specimen was collected
Local_indiv_onset_DOY: date of flowering onset for the selected (simulated) individual
Local_indiv_termination_DOY: date of flowering termination for the selected (simulated) individual
Local_sample: DOY of simulated collection assuming a bias towards collection proximate to individual peak flowering date
Local_sample_leftDist: DOY of simulated collection assuming a bias towards collection proximate to individual flowering onset date
Local_sample_rightDist: DOY of simulated collection assuming a bias towards collection proximate to individual flowering termination date
Local_sample_flatDist: DOY of simulated collection assuming no bias towards collection proximate any part of the individual flowering period
noise: magnitude of random noise (in DOYS) added to the sample. noise is only present in files that compare skew or introduce additional noise

`Data/Simulated_Distribution_Models/`

RQ_SimHerb_ZZ_Simulated_Distribution_SlopeX_YY_WW_RQ_Predict_Test_simData_Simulated_Distribution_SlopeComp_Local_sample.csv
These datasets represent predictions of population-level peak flowering onset and termination dates derived from quantile regression models at each location from which specimens were collected for that simulated species, as well as retaining the 'actual' phenological timings of the simulated local populations from which the simulated specimens were drawn.

Replace components of this filename according to the following rubric:

ZZ: represents the number of specimens used to develop these models
X: represents the magnitude of phenological responsiveness of the simulated species under assessment (in days/degree celsius)
YY: either 10sigma (in case of low intrapopulation variation), 30sigma (in the case of high intrapopulation variation), variablesigma (in the case of variable intrapopulation variation (see notes for previous data files), or variableDuration (in cases where species exhibited variable individual durations)
WW: replace with Noise10 if low magnitudes of random noise were added to collections, noise 30 if large quantities of random noise were added, or remove if no noise was added.
(Note that, this data structure data structure is identical to all other files across the folders Data/Simulated_Distribution_Models, Data/Simulated_Distribution_flatDist_Models, Data/Simulated_Distribution_leftDist_Models, and Simulated_Distribution_rightDist_Models)

Variables: 22
15000 rows/cases (5000 in case of files pertaining to species exhibiting variable duration or variable intrapopulation variation)
Variable List:

Unique ID: Unique ID for simulated specimen
genus_species: name of simulated species (contains all information on species attributes)
slope: the magnitude of phenological response (based on changes in mean date of peak flowering within each population) in days/degree C
param: the climate parameter to which the species responds
sigma: magnitude of intrapopulation variation in dates of peak flowering among individuals
duration: flowering duration of each individual within the simulated population
skew: skew of collection bias (when selecting individuals from the population to sample from). Only present in file RQ_SimHerb_1000_Simulated_Distribution_Slope4_30sigma_skew_normalSample_RQ_Predict_Test_simData_Simulated_Distribution_SlopeComp_Local_sample, which simulates according to multiple skew types.
Local_10pct_mean_DOY: date of peak flowering for the 10th percentile of individuals in the local population
Local_10pct_onset_DOY: date of flowering onset for the 10th percentile of individuals in the local population
Local_90pct_mean_DOY: date of peak flowering for the 90th percentile of individuals in the local population
Local_90pct_termination_DOY: date of flowering termination for the 90th percentile of individuals in the local population
Latitude: Lattude of specimen collection
Longitude: Longitude of specimen collection
First_Yes_Year: Year of specimen collection
Local sample: DOY of sample collection
RQ_Pred_0.1_Local_sample: Predicted DOY of population-level peak flowering onset
RQ_Pred_0.5_Local_sample: Predicted DOY of population-level peak flowering
RQ_Pred_0.9_Local_sample: Predicted DOY of population-level peak flowering termination

`Data/Simulated_Distribution_Models/RQ_SimHerb_ZZ_Simulated_Distribution_SlopeX_YY_WW_RQ_SelectedParams_Simulated_Distribution_SlopeComp_Local_sample.csv`

This dataset records the selected cliamte parameters used to produce phenoclimate models for peak flowering onset and termination for each species - note that, for the purposes of these simulations, the selected parameters were always identical and were set manually to mean annual temperature.

Replace components of this filename according to the following rubric:

ZZ: represents the number of specimens used to develop these models
X: represents the magnitude of phenological responsiveness of the simulated species under assessment (in days/degree celsius)
YY: either 10sigma (in case of low intrapopulation variation), 30sigma (in the case of high intrapopulation variation), variablesigma (in the case of variable intrapopulation variation (see notes for previous data files), or variableDuration (in cases where species exhibited variable individual durations)
WW: replace with Noise10 if low magnitudes of random noise were added to collections, noise 30 if large quantities of random noise were added, or remove if no noise was added.
(Note that, this data structure data structure is identical to all other files across the folders Data/Simulated_Distribution_Models, Data/Simulated_Distribution__flatDist_Models, Data/Simulated_Distribution_leftDist_Models, and Simulated_Distribution_rightDist_Models.)

Variables: 7
500
Variable List:

genus_species: name of simulated species (contains all information on species attributes)
parameter: selected climate parameter (set to Ann_tmean for simulations)
month: month (or time period) of selected parameter (set to 13, or annual, for all simulations)
quantiles: the quantile of the regression to which the selected parameter pertains
AIC: the AIC of the selected model
Slope: the predicted response of the speciespopulation level flowering onset/peak/termination to that climate parameter.

`Data/Simulated_Distribution_Models/Summaries/FlatDist_Overall_mae_multiSampleCounts.csv`

These datasets represents mean absolute error (relative to 'actual' population-level phenological timings) of predictions made by phenoclimate models developed using data from each simulated species while also recording the attributes of each species. Analogous files were also produced examining R2, rmse, and raw error, which follow identical data structure as described below. This file represents assessments of model accuracy among collections that exhibited no bias towards collection during the early, mid, or late portion of the flowering period of the individuals being collected.

Similarly, analogoous files were generated to evaluate models conducted on samples collected with different biases towards collection at the early or late portions of the population-levle bloom display (Data/Simulated_Distribution_Models/Summaries/normDist_Skew_test_Overall_mae_multiSampleCounts.csv), among models conducted with skews towards collection proximate to individual flowering onset dates (Data/Simulated_Distribution_Models/Summaries/LeftDist_Overall_mae_multiSampleCounts.csv), among models conducted with skews towards collection proximate to individual peak floweringdates (Data/Simulated_Distribution_Models/Summaries/Overall_mae_multiSampleCounts.csv), among models conducted with skews towards collection proximate to individual flowering termination dates (Data/Simulated_Distribution_Models/Summaries/RightDist_Overall_mae_multiSampleCounts.csv). Additionally, similar datasets were generated for species exhibiting low (Noise10_Overall_mae_multiSampleCounts.csv) and high (Noise30_Overall_mae_\multiSampleCounts.csv) magnitudes of random error/noise in sampling. In all cases, these datasets exhibit similar structures and variable names to those described below.

Variables: 17
350
Variable List:

genus_species: name of simulated species (contains all information on species attributes)
sample_count: actual number of samples at which predictions were made
sample_count_enforced: maximum possible number of samples used to develop models
slope: the magnitude of phenological response (based on changes in mean date of peak flowering within each population) in days/degree C
param: the climate parameter to which the species responds
sigma: magnitude of intrapopulation variation in dates of peak flowering among individuals
duration: flowering duration of each individual within the simulated population
skew: skew of collection bias
MAE_indiv_timing_Onset_inferred: mean absolute error between predicted DOY and simulated date of flowering onset by sampled individual
MAE_indiv_timing_Median_inferred: mean absolute error between predicted DOY and simulated median date of flowering by sampled individual
MAE_indiv_timing_Termination_inferred: mean absolute error between predicted DOY and simulated date of flowering termination by sampled individual
MAE_Population_mean_timing_Onset_inferred: mean absolute error between predicted DOY and simulated date by which the earliest 10% of individuals reached their peak flowering dates
MAE_Population_mean_timing_Median_inferred: mean absolute error between predicted DOY and simulated date by which the earliest 50% of individuals reached their peak flowering dates
MAE_Population_mean_timing_Termination_inferred: mean absolute error between predicted DOY and simulated date by which the earliest 90% of individuals reached their peak flowering dates
MAE_Population_timing_Onset_inferred: mean absolute error between predicted DOY and simulated date by which the earliest 10% of individuals reached their flowering onset dates
MAE_Population_timing_Median_inferred: mean absolute error between predicted DOY and simulated date by which the earliest 90% of individuals reached their peak flowering dates
MAE_Population_timing_Termination_inferred: mean absolute error between predicted DOY and simulated date by which the earlest 90% of individuals reached their flowering termination dates
Sample_Type: type of bias in sampling within individual flowering period

Creating a reference dataset: generating sample locations representing known population-level phenological distributions and individual phenological parameters

We simulated phenological data for 1200 hypothetical “species” in the coterminous USA that varied in the attributes of their individual- and population-level flowering phenology. For each of these simulated species, we selected 1000 locations within the continental United States, each representing a local population observed during a single year from which a simulated specimen was later obtained. The coordinates for each location, year, and associated mean annual temperature in the year of collection were randomly selected without replacement from 4-km² PRISM pixels (PRISM Climate Group 2011) between the years 1901 to 2020, and were restricted to locations with 1991–2020 temperature normals of 1–20 °C and mean annual precipitation normals for the same period of 60–3800 mm.

Each species generated this way was assigned a series of attributes defining its individual- and population-level flowering phenology. The peak flowering date of an individual was assumed to coincide with its mean flowering date. We then defined a linear equation describing the relationship between the mean date of peak flowering among individuals within a population and local temperature conditions. Each species was assigned a median population flowering DOY of 50 at 0˚C (i.e., the intercept) as well as a phenological responsiveness (i.e., slope) of median flowering DOY to mean annual temperature: advancing by 1, 4, or 8 days per increase in °C. Next, we assigned each species a low or high magnitude of intrapopulation variation in phenological timing (i.e., in peak flowering DOYs) among individuals (based on normal distributions with standard deviations (σ) of either 10 or 30 days), representing the magnitude of variation in the flowering times of early- to late-flowering individuals within each local population. Then, each species was assigned a short, moderate, or long duration of the flowering period by each individual within each population (15, 30, or 60 days, representing the duration of time each individual plant was in flower. Fifty species were simulated for each of these 18 combinations of phenological responsiveness, flowering duration, and intrapopulation variation in phenological timing.

To accommodate the possibility that the magnitude of variation in phenological timing within a population could depend on local climate conditions, we also simulated 50 species with temperature-sensitive intrapopulation phenological variation (σ) ranging from 10 to 30 days. For these species, σ of the DOY among individuals in a given population increased by 1 day for every 1 °C increase in the mean annual temperature of its location. For these simulated species, individual flowering duration was fixed at 30 days. Additionally, to accommodate the possibility that individual flowering durations could exhibit linear relationships with local climate conditions, we also simulated 50 species that exhibited individual-level variation in flowering duration resulting from changes in temperature (increasing by 1 day per °C increase in mean annual temperature, and ranging from 10 days to 30 days). For these species, the degree of intrapopulation variation in peak flowering dates was held constant at σ = 30 days (i.e., high intrapopulation variation).

Calculation of population-level onset, median, and termination dates of flowering

For each population of each species described above, we calculated a distribution of individual-level peak flowering dates—assumed to be normally distributed (Clark and Thompson 2011)—based on the flowering attributes of the species and the temperature conditions corresponding to its site and year of observation. First, we calculated the median flowering DOY at the location and year from which each specimen was collected based on its pre-defined intercept and phenological responsiveness to mean annual temperature (i.e., 1, 4, and 8 days per °C). Then, we obtained the standard deviation of each local population (i.e., its degree of intrapopulation variation in flowering dates) based on the flowering attributes of the simulated species as outlined above. Next, we arbitrarily defined population-level flowering onset DOYs for each population and year as the 10^th percentile of a normally distributed population whose mean and standard deviation we obtained in the previous steps (i.e., the DOYs by which the first 10% of individuals in a local population at a givenlocation and year would have reached their median flowering dates). Similarly, the population-level flowering termination dates were calculated as the 90^th percentile of a normally distributed population with the same characteristics as described above (i.e., the DOYs by which all but 10% of individuals in a local population at a given location and year would have reached their peak (or mean) flowering dates).

Through this process, we obtained a sample of 1000 annual population-level distributions of flowering dates for each of 1200 hypothetical species. For each of these populations, the quantiles of their flowering distribution—representing the n_thindividual reaching peak flowering within a population—were known a priori, representing a benchmark against which to compare estimates derived from simulated specimen data.

Simulating randomly selected (unbiased) phenological snapshots from pre-defined populations

For each species, we then generated simulated specimens by: (1) randomly selecting an individual within each population and (2) selecting a random DOY within its individual-level flowering period that emulated the phenological snapshot provided by real herbarium specimens. Specifically, using the distribution of peak flowering dates of each population, we selected an individual at random. From its peak flowering date, we then obtained onset and termination dates by subtracting (for flowering onset) or adding (for flowering termination) half the individual’s flowering duration for that species to the sampled date of peak flowering. To simulate a phenological snapshot for that individual, we then randomly selected a DOY between the onset and termination of that individual’s flowering period. As a result, the simulated datum represented a simulated herbarium specimen generated accounting for uncertainty in both the timing of the individual relative to its source population, and in the timing of the collection relative to the onset and termination of that individual’s flowering period. This procedure was repeated across all locations for each simulated species, generating 1000 data points (i.e., simulated specimens or phenological snapshots) per species.

Simulating biases in collection effort across population-level flowering periods

To simulate biases towards collection of specimens during the early or late portion of their local population-level flowering displays, we selected an individual at random within each population and year using both left- and right-skewed normal probability distributions. These distributions were constructed by modulating the parameter α in the python package scipy.stats.skewnorm v1.10.1 (Azzalini and Capitanio 1998), such that if the underlying plant population was treated as exhibiting a normal distribution (α = 0), samples were collected from that population with a left-skewed (α = -1.0) or right-skewed (α = -1.0) probability distribution. Once an individual was selected from these skewed distributions, the timing of sample collection from within the individual flowering durations of these ‘specimens’ were generated using similar methods as unbiased specimens. We then determined the accuracy of the model predictions generated from datasets exhibiting biased and unbiased sampling of local populations by comparing predicted population-level flowering onset and termination dates with the actual (i.e., known, simulated) flowering dates that were produced using a normal distribution. To minimize computation time, population-level biases were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C (representing moderate responsiveness to climate stimuli), intrapopulation variation was high (σ = 30), and individual flowering duration was moderate (30 days).

Simulating biases in the timing of collection within flowering periods of individuals

In addition to biases towards collection of early or late individuals within a local population, botanists may also preferentially collect individuals from the early or late portion of their individual flowering period (i.e., individual collection bias). In some cases, collectors may preferentially collect individuals that are proximate to their peak flowering date because this is when the most flowers are displayed. In other cases, collectors may preferentially collect specimens that have only recently begun to flower, when floral structures may exhibit less damage from inclement weather or herbivores, or proximate to flowering termination in cases where the collector prefers specimens that include both flowers and fruits. Accordingly, for each population of each species, we simulated DOYs within each individual’s flowering period both at random (i.e., without bias) and with three different types of bias. Unbiased collections were simulated by selecting a random date chosen uniformly within the flowering period of each sampled individual. To represent a bias toward collection of individuals close to their peak (median) flowering DOY, we sampled collection dates from a truncated normal distribution centered on an individual’s mean flowering date and with σ = 25% of the flowering duration for that species and location (henceforth referred to as mean-biased collection data). To represent a bias toward collection dates shortly after flowering onset (henceforth, onset-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% earlier than the mean flowering onset date of that individual (σ = 25%). Finally, to represent a bias toward collection on dates shortly before flowering termination (henceforth termination-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% later than the mean flowering onset date of that individual (σ = 25%). As with examinations of population-level bias, collection biases within the flowering periods of individuals were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C, intrapopulation variation was high (σ = 30), individual flowering duration was moderate (30 days), and no population-level bias was present.

Estimating population-level flowering onsets and terminations from simulated herbarium data

We generated phenoclimate models for each species from each set of simulated specimen collection dates using quantile regression (Koenker et al. 2018) in RStudio (R Team 2020). In all cases, each model regressed observed DOYs of the phenological snapshots of all sampled individuals of a given species against mean annual temperature. From these 1450 models (representing each of the species-specific models for all 1200 species plus the additional 150 models exhibiting population-level collection biases and the 100 models exhibiting individual-level collection biases), we predicted the 10^th, 50^th, and 90^th percentiles of flowering DOYs for each species from mean annual temperatures corresponding to the years and locations of their source populations. We then calculated the mean absolute error (MAE) of the linear regression of the known timing of the onset (or termination) of the peak flowering period for each reference population on the predicted DOYs produced by each phenoclimate model based on the simulated herbarium data. For each metric of population-level phenology (i.e., flowering onset, peak (i.e., median DOY), and termination), we then used Tukey HSD tests to compare the mean accuracies (estimated as MAE) of these predicted DOYs versus the actual population-level metrics among models constructed from species that differed in their phenological sensitivities to climate, flowering durations, degrees of intrapopulation variation in phenological timing, and collection biases.

Similarly, we tested whether the mean MAE of estimated peak flowering onset and termination dates among groups of species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation differed significantly from the mean MAE of estimated median flowering dates for each group of simulated species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation. We used Tukey HSD tests to compare the accuracy of estimated onset, median, and termination dates of the peak flowering period among all species produced from each of the simulated datasets.

Finally, we re-fit all 1200 models (including all 24 combinations of species parameters but excluding models constructed to test the effects of collection biases) with randomly selected subsets of data (100–1000 specimens per species) to determine how sample size affected model performance and predictive accuracy. To evaluate whether more data would be needed when variation in phenology among populations is not perfectly explained by the climate variables included in the model, we ran additional simulations in which population-level mean DOYs (and associated onset and termination DOYs of the flowering period) of each species at each sampled location and year included random variation not associated with local climate: adding either ±5 days (i.e., a low-noise scenario) or ±15 days (i.e., a high-noise scenario) to the DOYs of the onset, median, and termination of flowering DOYs. For each location and year, the random offsets of the DOYs of flowering onset, median flowering DOY, and flowering termination were identical, such that random variation was incorporated only into the timing of flowering, and not its duration.

Simulated Herbarium data for testing the accuracy with which specimen data can predict the timing and duration of population-level flowering displays

Data files

Abstract

README: Simulated Herbarium data for testing the accuracy with which specimen data

GENERAL INFORMATION

SHARING/ACCESS INFORMATION

DATA & FILE OVERVIEW

METHODOLOGICAL INFORMATION

DATA-SPECIFIC INFORMATION

Data/Full_Simulation_30Taxa_1kSamples.csv

Data/SimHerb_Data/RQ_Sim_Distribution_10sigma_slope_4.csv

Data/Simulated_Distribution_Models/

Data/Simulated_Distribution_Models/RQ_SimHerb_ZZ_Simulated_Distribution_SlopeX_YY_WW_RQ_SelectedParams_Simulated_Distribution_SlopeComp_Local_sample.csv

Data/Simulated_Distribution_Models/Summaries/FlatDist_Overall_mae_multiSampleCounts.csv

Methods

Works referencing this dataset

`Data/Full_Simulation_30Taxa_1kSamples.csv`

`Data/SimHerb_Data/RQ_Sim_Distribution_10sigma_slope_4.csv`

`Data/Simulated_Distribution_Models/`

`Data/Simulated_Distribution_Models/RQ_SimHerb_ZZ_Simulated_Distribution_SlopeX_YY_WW_RQ_SelectedParams_Simulated_Distribution_SlopeComp_Local_sample.csv`

`Data/Simulated_Distribution_Models/Summaries/FlatDist_Overall_mae_multiSampleCounts.csv`