Herbarium-Derived Phenological Data in North America
Data files
Sep 20, 2023 version files 22.41 MB
-
README.md
20.03 KB
-
reproductive_for_Assessment_modified.csv
201.47 KB
-
TNRS_Concat.csv
22.18 MB
Abstract
We present infrastructure for developing large-scale and long-term phenological datasets across multiple herbaria, as well as a sample dataset that has been acquired from the digital archives of 440 distinct herbaria across North America and further processed to evaluate phenological status. This dataset contains 2,319,672 specimen records of plants collected while reproductively active. These data have been modified to explicitly codify the observed phenological status of each specimen at the time of collection, and to remove specimens for which information essential to assessing their phenology or the corresponding climate conditions in the year and location of collection were missing. As different collectors have used distinct taxonomic schema over space and time in documenting the specimens being collected, these data were also rectified into a single unified taxonomic schema to ensure that consistent taxon names were used throughout the dataset. Further, this data has been united with long-term and annual climate conditions in the year and location of collection, as derived from PRISM climate data (https://www.prism.oregonstate.edu/). To date, this data includes 2,319,672 specimens across 25,429 plant taxa. However, this represents a living dataset that will continue to be updated as digitization efforts proceed and additional digital specimen records become available.
Author Information
Principal Investigator Contact Information
- Name: Isaac W. Park
- Institution: University of California - Santa Barbara
- Address: 4117 Life Sciences Building, University of California Santa Barbara 93116
- Email: isaac_park@ucsb.edu
Date of data collection
07-01-2021 through 08-30-2021
Geographic location of data collection:
North America
Information about funding sources that supported the collection of the data
Munging of this data from raw herbarium data was supported by the National Science Foundation (NSF) through:
- NSF DEB-1556768 (to S.J.M., I.W.P.)
- NSF DEB-2105932 (to S.J.M., I.W.P)
- NSF DEB-2105907 (to S.R.)
- NSF DEB-2105903 (to C.C.D.)
SHARING/ACCESS INFORMATION
- Licenses/restrictions placed on the data: Creative Commons Zero (CC0 1.0)(note that supplementary data deposited in Zenodo, which includes the rectified phenological data developed using these data and associated python code are shared using the CC4-BY-NC-SA 4.0 license due to licensing restrictions on the herbarium data from which it is derived)
- Links to other publicly accessible locations of the data: Raw specimen records used in this data can be acquired from https://swbiodiversity.org/seinet/, https://www.cch2.org, and https://pnwherbaria.org. Climate data is available through https://www.prism.oregonstate.edu/
- Links/relationships to ancillary data sets: Raw specimen records used in this data can be acquired from SEINET.org, https://www.cch2.org, and https://pnwherbaria.org. Climate data is available through https://www.prism.oregonstate.edu/
- Herbarium specimen data were accessed and downloaded from the following platforms:
SEINET, the Consortium of Pacific Herbaria, the Consortium of California Herbaria, the
Consortium of Northeastern Herbaria, the Consortium of Pacific Northwest Herbaria, the
consortium of MidAtlantic Herbaria, the Consortium of Canadian Herbaria, the Consortium
of Midwest Herbaria, the North American Network of Small Herbaria, The Consortium of
Northeastern Herbaria, the Red de Herbarios de Noroeste de México, the SouthEast Regional
Network of Expertise and Collections (SERNEC), and the Texas Oklahoma Regional
Consortium of Herbaria (TORCH).
Herbaria hosting the specimen data used in this study (Darwin Core Institution Codes):
ACAD, AMES, BALT, BLMRD, BSCA, CalBG, SFV, UCR, SD, SDSU, SBBG,
UCJEPS, IRVC, LOB, CDA, OBI, RSA, CSUSB, CSLA, GMDRC, FSC, LA, JOTR,
POM, DAV, OBS, HSC, UCSC, UCSB, MACF, PUA, JROH, SCFS, SOC, UNLV,
OSC, CHRB, CM, NY, University of Connecticut, Université de Montréal Biodiversity
Centre, ANSP, Rutgers University, UConn, University of New Hampshire, NEBC, GH,
A, YPM, Yale Peabody Museum of Natural History, Woods Hole Oceanographic
Institution, ECON, Harvard University, University of Maine, Memorial University of
Newfoundland, VT, ELH, Harvard, HUDC, Utah State University, ENLC, RENO, USU,
BRY, SUU, RM, UT, NTS, EDOBLM, NDOA, BLM, SEINet, WSCO, NPS, SLCTNL,
MARY, UNISON, HCIB, CIAD, BCMEX, IBUG, UJED, URUZA, UAEH, UADY,
DEK, EIU, IND, MOR, MSC, WIS, MU, BUT, CINC, MICH, MWI, NC, F, LUC,
CHIC, KE, MIN, ILL, PH, TAWES, PAC, APSC, USFWS, BHSC, USFS/BHSC, LCDI,
MISU, CSCN, KSP, MSUB, FHKSC, MSUNH, PPWD, ASU, ARIZ, ASC, NAVA,
DES, RHNM, USFS, UNM, SNM, MNA, SJNM, JEMEZ, KAIB, AWC, NHI, BTA,
TAF, MUR, USCH, TROY, NCU, USF, MISSA, FLAS, DUKE, SWSL, SEL, WILLI,
NCSM, BOON, MMNS, ANHC, AUA, LSU, CLEMS, NCSC, VDB, MISS, UNCC,
USMS, TENN, PEMB, MUHW, GA, UARK, NLU, KNK, WVA, STAR, ODU, WWC,
HBSH, NO, ENO, SC, ECUH, UAM, NCZP, CATU, HTTU, CAU, WEWO, WCUH,
GMUF, URV, UOS, APCR, EKY, UCHT, MEM, SBAC, VPI, FARM, SFSU, dtnm,
gtnp, WYAC, FBNM, USNH, YELLO, GRTE, dtnp, DTNM, RMBL, DBG, FLD, WSC,
MESA, PUSC, COLO, ALAM, CIBO, SAT, SRSC, BRIT, TAC, PAUH, TTC, TLU,
JWC, OKL.
Climate data
Climate data was produced by the PRISM climate group, https://www.prism.oregonstate.edu/
Recommended citation for this dataset:
Code and python packages presented here should be cited as:
Park (2023), North American herbarium data for Phenological Assessment, Dryad, Dataset
Please note that any use of this data should also acknowledge the constituent herbaria from which this data was drawn: Usage of associated climate data should also cite the PRISM climate group: PRISM Climate Group, Oregon State University, https://prism.oregonstate.edu, data created 4 Feb 2014, accessed 30 Sep 2021
DATA and FILE OVERVIEW
1. File/Folder List:
Data shared via DRYAD:
(These data represent files created to assist in developing of phenologically assessed data from herbarium records)
TNRS_Concat.csv
(File): Folder containing matched taxonomic nomenclature for standardizing taxon namesreproductive_for_Assessment_modified
(File): Manually constructed table of phenological status derived from unique text strings in darwincore data
Data shared via Zenodo:
These data represent data files drawn from other sources (notably the PRISM climate group and the contributing herbaria), which represent an example phenologically assessed dataset that can be created from the files and data associated with this repository.
Example_Data
(Folder):AllData_Core.csv
(File): Processed herbarium data with rectified DOY (day of year), Year, additional phenological status data columns, and standardized taxonomic nomenclature (columns described below). Note that this data retains all information from original specimen records, including uncorrected DOY and year field, original fields used to identify phenological status, and original taxonomic nomenclature as well as all data columns required to re-acquire these specimen records from source herbaria.Point_Years_Core
(File): processed data extracted fromAllData_Core
, consisting of all unique combinations of collection year and collection location (as defined by decimalLatitude and decimalLongitude). All other data fields have been removed to conserve active memory.
ClimateData
(folder):PRISM_PointYears.csv
(File):Point_Years_Core
data with associated annual and long-term normal climate data estimated from PRISM climate data.
These data sets can be accessed on Zenodo: https://doi.org/10.5281/zenodo.8323155
METHODOLOGICAL INFORMATION
Methods for processing the data:
Raw herbarium data was processed in the following ways:
First the DOY (day of year) corresponding to each specimen record in the raw specimen data (within the folder Data/Raw) was calculated (as DOY was only calculated by some herbaria or for some records during the original digitization process).
Records for which specific DOYs could not be calculated, or for which decimal latitudes and longitudes of collection were missing were then eliminated. Resulting date-corrected data was placed in the folder: Data/DoyCorr_Dev/
.
Specific phenological statuses were then extracted from these data based on the data fields ‘reproductiveCondition’ and ‘lifeStage’ from the date corrected data in the folder Data/DoyCorr_Dev/
using the corresponding reproductive assignment data: Data/Pheno_Categories/reproductive_for_Assessment_modified.csv
. Resulting data files were then exported to the folder Data/Pheno_Munged/
.
The resulting data (in Data/Pheno_Munged/
) was then aggregated into a single file and exported to the location Data/Pheno_Categories/reproductive_for_Assessment_modified.csv
As part of this aggregation, specimen records not coded as in flower, strobilating, or fertile were eliminated, as were records with erroneous DOYs (DOY < 1 or > 365).
Duplicate records (as defined by specimen records with identical values in the “scientificName”, DOY. and Years of collection as well as identical decimalLatitude and decimalLongitude fields) were also eliminated. Resulting data was exported to the location Data/Filtered/Concat_Data.csv
.
Using species lists provided by the Taxonomic Name Resolution Service (tnrs.biendata.org), located at the location Data/TNRS/Names_Rectified_test.csv
, we then matched these ‘rectified’ taxon names to the ‘raw’ taxon names in the field “scientificName” within the dataset Data/Filtered/Concat_Data.csv
to enforce standardized nomenclature throughout the dataset. Specimen records with no matched taxon nomenclature were eliminated.
The PRISM data is then automatically downloaded to the folder “Data/PRISM_Data/” by the python script, climate parameters were extracted at all point locations, and resulting data merged back into specimen data, creating the file Data/ClimateData/PRISM_Specimen_Data.csv
Software
The python scripts contained here were conducted using python version 3.9.7, and requires the installation of the python packages listed in Pheno.yml
in addition to the included package Phenocoll
, which can be installed by entering the anaconda environment, navigating to the folder Phenocoll, and typing ‘pip install -e’. The script and python package developed for this project have been archived on Zenodo https://doi.org/10.5281/zenodo.8323153
QA/QC procedures performed on the data
DOY of collection was corrected using raw fields, specimens with missing (or unparseable) year, DOY, decimalLatitude, decimalLongitude, or taxonomic id were eliminated. All specimen records were rectified to a single taxonomic schema using the taxonomic name resolution service.
DATA-SPECIFIC INFORMATION FOR:
Data Shared Via Dryad:
TNRS_Concat.csv:
- Number of variables: 25
- Number of cases/rows: 64257
- Variable List: The majority of column names are QA columns referencing the match quality of names rectified through TNRS, and are not relevant to further data processing. Key fields are as follows:
- Name_Submitted: unique taxonomic identifier for specimen(s) derived from herbarium data
- Accepted_name: Rectified taxon name produced by TNRS, to finest ID possible
- Accepted_species: Rectified taxon name produced by TNRS, genus and species only
- Name_matched_rank: rank of rectified taxon ID
- Name_matched_accepted_family: family of rectified taxon
- Missing data codes:’’
- Specialized formats or other abbreviations used: NA
reproductive_for_Assessment_modified.csv
- Number of variables: 10
- Number of cases/rows: 4671
- Variable List:
- Field: text string to be matched with reproductiveCondition or lifeStage fields in darwinCore herbarium data, recording phenological status
- Count: number of times this unique text string was observed in example dataset
- bud: ‘’ if not in bud, 1 if in bud
- flower: ‘’ if not in flower, 1 if in flower
- fruit: ‘’ if not in fruit, 1 if in fruit
- strobilus: ‘’ if not strobilating, 1 if strobilating
- fertile: ‘’ if not fertile, 1 if fertile
- inflorescence: ‘’ if no inflorescence present (or not documented), 1 if inflorescence documented to be present
- cone: ‘’ if no cones documented, 1 if cones present
- odd: 1 if notable but unusual.unclear phenological status observed
- Missing data codes:’’
- Specialized formats or other abbreviations used: NA
Data Shared Via Zenodo:
Data/Raw/occurrences.csv:
- Number of variables: 28
- Number of cases/rows: Variable
- Variable List: All column names are Darwin Core terms. Definitions for each variable are taken from the Darwin Core website (https://dwc.tdwg.org/list/#2-use-of-terms).
- Missing data codes:’’
- Specialized formats or other abbreviations used: NA
DataPheno_Categoriesreproductive_for_Assessment_modified.csv
- Number of variables:10
- Number of cases/rows: 4668
- Variable List:
- Field: Unique string matching record in specimen data from
- Count: Number of unique instances of that Field value observed in raw specimen data
- bud: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as in bud
- flower: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as in flower
- fruit: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as in fruit
- strobilus: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as in strobilus (includes both spore-bearing plants and some gymnosperms displaying pollen cones)
- fertile: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as fertile (without more specific reproductive phenological information)
- inflorescence: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as having an active inflorescence
- cone: Column indicating that specimens containing this unique Field value in the darwincore columns ‘reproductiveCondition’ or ‘lifeStage’ should be coded as exhibiting cones (may be pollen or seed cones)
- Missing data codes:’’
- Specialized formats or other abbreviations used: NA
Data/AllData_Core.csv
This dataset represents raw herbarium specimens collected from all contributing herbaria, with the additional data added through the processing described in the attached software; Notably, corrected year and DOY of each collection as well as binary presence/abscence for all phenological phases of interest. These data include all identification fields for re-acquiring original specimen records from contributing herbaria. While additional data fields were added (as listed above), no data columns present in raw data (as downloaded from contributed herbaria) were edited in any way.
- Number of variables: 365
- Number of cases/rows: 2319672
- Variable List: variable names for columns 7:39 are Darwin Core terms, We provide the definition for each term in the README document. These definitions come directly from the Darwin Core website (https://dwc.tdwg.org/list/#2-use-of-terms).
- Note: All columns if original darwincore were preserved, even if supplanted by modified data columns (most notably those fields associated with the taxonomic identification of each specimen or the DOY and year of its collection.) In such cases, we recommend ignoring the original darwincore data fields in favor of the modified data fields, listed below.
- DOY_Rect: Rectified day of year (DOY) on which the specimen was recorded to have been collected
- Year_Rect: Rectified year on which the specimen was recorded to have been collected
- bud: determination of whether the specimen was recorded as being in bud. Values of 1 indicate presence of buds, values of 0 indicate no documentation of buds.
- flower: determination of whether the specimen was recorded as being in flower. Values of 1 indicate presence of flowers, values of 0 indicate no documentation of flowers.
- fruit: determination of whether the specimen was recorded as being in fruit. Values of 1 indicate presence of fruits, values of 0 indicate no documentation of fruits.
- strobilus: determination of whether the specimen was recorded as being in strobilus. Values of 1 indicate presence of strobili, values of 0 indicate no documentation of strobili.
- cone: determination of whether the specimen was recorded as being in strobilus. Values of 1 indicate presence of seed or pollen cones, values of 0 indicate no documentation of cones.
- fertile: determination of whether the specimen was recorded as being in strobilus. Values of 1 indicate specimen was collected when fertile (most commonly this occurs in graminoids, as well as some spore-bearing plants), values of 0 indicate no record of the specimen being fertile when collected.
- phenocolumn_sum: marker field indicating that one or more of the afforementioned phenological statuses was positive.
- Accepted_name: Genus and species as matched to a standardized taxonomic schema using the taxonomic name resolution service (tnrs.biendata.org).
- Accepted_name: Genus, species, and subspecies/variety (if applicable) of specimen according to a standardized taxonomic schema using the taxonomic name reolution service (tnrs.biendata.org).
- Accepted_species: Genus and species of specimen according to a standardized taxonomic schema using the taxonomic name reolution service (tnrs.biendata.org).
- Accepted_name_author: Author of species identification of specimen’s species according to a standardized taxonomic schema using the taxonomic name reolution service (tnrs.biendata.org).
- Accepted_name_rank: taxonomic rank to which specimen was identified according to a standardized taxonomic schema using the taxonomic name reolution service (tnrs.biendata.org).
- Name_matched_accepted_family: family to which specimen was identified according to a standardized taxonomic schema using the taxonomic name reolution service (tnrs.biendata.org).
Data/Point_Years_Core.csv
This dataset represents a streamlined subset of the data included in AllData_Core.csv, which includes only unique collection locations (based on decimalLatitude and DecimalLongitude fields) and Years of collection (Based on DOY_Rect and Year_Rect Fields).
- Number of variables: 3
- Number of cases/rows: 778012
- Variable List:
- decimalLatitude: decimal latitude of collection(s)
- decimalLongitude: decimal longitude of collection(s)
- Year_Rect: Year of collection(s) as descibed above.
these data fields are identical to decimanlLatitude, decimalLongitude, and Year_Rect fields as described in the dataset AllData_Core.csv
, but includes only unique latitude/longitude/year combinations in order to minimize processor time and memory usage during acquisition of PRISM climate data.
Data/ClimateData/PRISM_PointYears.csv
This dataset represents the annual and long-term normal climate conditions at each collection and year from which a specimen was collected (based on Point_Years_Core.csv
). This data was created as a separate datafile and not merged directly into data from AllData_Core.csv due to the high memory requirements of working with the entire dataset at once. Instead, resulting data from AllData_Core can be subset to those records/taxa/years of interest, and then be merged with PRISM_PointYears.csv
data based on the decimalLatitude, decimalLongitude, and Year_Rect fields.
- Number of variables: 365
- Number of cases/rows: 2319672
- Variable List:
Due to the large number of climate variables included in this data, we will instead document them using this schema:
Climate parameters were indicated by variables with one of the following prefixes: - Ann : conditions in the year of collection
- Norm : average conditions from the year 1901 through 2000
- Prev : conditions in the year prior to collection
followed by text indicating the climate variable being estimated from PRISM climate data: - ppt : total monthly precipitation
- tmean : monthly mean temperature
- tmin : monthly mean minimum temperature
- tmax : monthly mean maximum temperature
- tdmean : monthly mean dew point temperature
- vpdmin : monthly mean vapor pressure deficit
- vpdmax : monthly maximum vapor pressure deficit
followed by the month being estimated as two digit numeral (01 through 12).
Each of these three components is always separated by the character ‘_’. - Missing data codes: ‘’
Phenological data pertaining to flowering times in this dataset consist of 2,319,672 specimen records of plant species collected in flower, while strobilating, or while fertile (this last category primarily applied to graminoids). These data were derived from the digital archives of 440 herbaria (see Readme for full listing), and subsequently cleaned and modified using several criteria described below to facilitate their use in phenological assessment.
To ensure the quality of the data used in this study, specimens were included in the dataset analyzed here only if, at the time of digitization, herbarium personnel had: verified that the specimens were collected when in flower, strobilating, or fertile; recorded GPS coordinates of the location from which the specimen was collected; and provided the precise date of collection (including month, date, and year). Only those specimens that were explicitly recorded reproductive status within either the DarwinCore “reproductivecondition” or “lifestage” fields of their source's database were included in this study.
The taxonomic nomenclature used to describe each specimen was standardized using the Taxonomic Name Resolution Service iPlant Collaborative, Version 4.0 (Boyle et al., 2013, Accessed: 30 August 2021; https://tnrs.biendata.org/). Duplicate collections of a species at the same location, DOY, year, and location were also removed. The resulting dataset included 2,319,672 specimens distributed throughout North America.
Climate data associated with the year and location of each specimen collection was then integrated into this data. All climate data was drawn from PRISM climate data (https://www.prism.oregonstate.edu/) and incorporated both long-term normal conditions at the location of each collection as well as the predicted conditions in the year and location of each collection.
Code was written in python 3.7.
Multiple python packages are required to run these packages (see attached .yml file for full list). We recommend the usage of Anaconda for constructing the python environment and installing the python packages required to produce this dataset, including the PhenoColl package that was developed for this project (https://doi.org/10.5281/zenodo.8323153)