Data from: Phenological patterns of tropical mountain forest trees across the neotropics: Evidence from herbarium specimens
Data files
Jan 27, 2025 version files 40.29 MB
-
README.md
25.99 KB
-
S1_db_raw_n_53793_Allspecies_GBIF_HCNQ_fieldnotes_RF_predictions.csv
30.53 MB
-
S1_species_list_TMF_Andes.csv
14.43 KB
-
S2_data_herbarium_records_calendars.csv
8.11 MB
-
Stratified_val_df_img_nondup_def.csv
1.61 MB
Abstract
The flowering phenology of many Tropical Mountain Forest tree species remains poorly understood, including flowering synchrony and its drivers across Neotropical ecosystems. We obtained herbarium records for 427 tree species from a long-term monitoring transect on the north-western Ecuadorian Andes, sourced from the Global Biodiversity Information Facility (GBIF) and the Herbario Nacional del Ecuador (QCNE). Using machine learning algorithms, we identified flowering phenophases from digitized specimen labels and applied circular statistics to build phenological calendars across six climatic regions within the Neotropics. We found 47,939 herbarium records, of which 14,938 were classified as flowering by Random Forest Models. We constructed phenological calendars for six regions and 86 species with at least 20 flowering records across the 6 regions. Phenological patterns varied considerably across regions; among species within regions; and within species across regions. There was limited interannual synchronicity in flowering patterns within regions primarily driven by bimodal species whose flowering peaks coincided with irradiance peaks. The predominantly high variability of phenological patterns among species and within species likely confers adaptative advantages by reducing interspecific competition during reproductive periods and promoting species coexistence in highly diverse regions with little or no seasonality.
README: Phenological patterns of tropical mountain forest trees across the neotropics: Evidence from herbarium specimens
https://doi.org/10.5061/dryad.08kprr59w
The datasets and code provided in this repository represent the underlying data of the results presented in the article. In this repository, we include the list of species used to look for species in GBIF and Herbario Nacional del Ecuador.
We also provided a dataset of herbarium records (S1_db_raw_n_53793_Allspecies_GBIF_HCNQ_fieldnotes_RF_predictions.csv) that has been cleaned and prepared to be able to run machine learning models. From this dataset, we selected a subset of 3000 records (Stratified_val_df_img_nondup_def.csv) that was used as a validation and training dataset (NaN- data not available; left empty cells as it may interfere with code)
To train models, users will need the validation dataset and the script (random_forest_models_phenology2022.ipynb). The results of this analysis were used to select the best machine-learning model. In our case, the best was the random forest model.
To run random forest models for the whole dataset users will need the the complete dataset of herbarium records (S1_db_raw_n_53793_Allspecies_GBIF_HCNQ_fieldnotes_RF_predictions.csv) and the script (random_forest_predictions_phenology2022.ipynb). Please note that the herbarium dataset in this repository already has included the results of the predictions in our study in the last 5 columns. User can ignore these columns for the analysis or eliminate them from the input dataset. The results of flowering and fruiting phenology in our analysis were saved in the files (RandomForest_flowering_nestimators100.bin) and (RandomForest_fruiting_nestimators100.bin), that users can utilize to access directly model estimators without running all the models again.
To build circular calendars users will need the dataset (S2_data herbarium records_calendars.csv) and the script (S2-scripts circular analysis phenology.R).
DATASETS
S1_species list TMF Andes.csv
Supplementary material S1. Species list from tropical mountain forests on the north-western slope of the Ecuadorian Andes
This dataset contains the species list used to search for herbarium records. The species list corresponds to all species found across 16 permanent plots from a transect that covers forests between 600-3500 m asl on the north-western slope of the Ecuadorian Andes.
S1_db_raw_n_53793_Allspecies_GBIF_HCNQ_fieldnotes_RF_predictions.csv
Supplementary material S1. Processed dataset with records from GBIF and HCNQ. 53,793 records.
This dataset contains selected digitized and processed data from herbarium specimens obtained from GBIF and Herbario Nacional del Ecuador. Description of variables follows Darwin Core standard. This dataset is input of the script to make predictions with the Random Forest models (random_forest_predictions_phenology2022.ipynb).
Stratified_val_df_img_nondup_def.csv
Validation dataset of 3.000 records with labels for flowering and fruiting, obtained from field notes and images. This dataset was used to train the ML models, and it was the script's input for the model selection and training (random_forest_models_phenology2022.ipynb).. Description of variables follows Darwin Core standard.
S2_data herbarium records_calendars.csv
Supplementary material S2. Dataset of 47,939 unique herbarium records corresponding to 427 species (from 80 families) across the Neotropics, filtered (by complete dates and within the neotropics) for constructing phenological calendars.
Description of the data and file structure
S1_db_raw_n_53793_Allspecies_GBIF_HCNQ_fieldnotes_RF_predictions.csv
Empty cells are identified with a string "NA" = no data
VARIABLE | DESCRIPTION |
---|---|
gbifID | ID from GBIF, Darwin core |
hcnqID | ID from Herbario Nacional Ecuador (herbarium record number) |
institutionCode | GBIF variable: An identifier for the institution having custody of the object(s) or information referred to in the record. |
recordedBy | GBIF variable: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original dwc:Occurrence. The primary collector or observer, especially one who applies a personal identifier (dwc:recordNumber), should be listed first. |
eventDate | GBIF variable: The date-time or interval during which a dwc:Event occurred. For occurrences, this is the date-time when the dwc:Event was recorded. Not suitable for a time in a geological context. |
year | GBIF variable: The four-digit year in which the dwc:Event occurred, according to the Common Era Calendar. |
month | GBIF variable: The integer month in which the dwc:Event occurred. |
day | GBIF variable: The integer day of the month on which the dwc:Event occurred. |
Year_interval | category interval created to perform data stratification for validation dataset. 3 categories: >1970, 1971-2010,>=2011 |
country | GBIF variable: The name of the country or major administrative unit in which the dcterms:Location occurs. |
decimalLongitude | GBIF variable: The geographic longitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive. |
Binned_longitude | category interval created to perform data stratification for creation of the validation dataset. 5 categories from -50 to -170 : (-50.0, -25.0] (-70.0, -50.0] (-90.0, -70.0] (-110.0, -90.0] (-130.0, -110.0] (-170.0, -150.0] |
decimalLatitude | GBIF variable: The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. |
Binned_latitude | category interval created to perform data stratification for creation of the validation dataset. 5 categories from -50 to 50 : (-50.0, -30.0] (-30.0, -10.0] (-10.0, 10.0] (10.0, 30.0] (30.0,50.0] |
elevation | GBIF variable, original name verbatimElevation: The original description of the elevation (altitude (m), usually above sea level) of the Location. |
family | GBIF variable: The full scientific name of the family in which the dwc:Taxon is classified. |
genus | GBIF variable: The full scientific name of the genus in which the dwc:Taxon is classified. |
species | Corresponds to GBIF variable specificEpithet: the name of the first or species epithet of the dwc:scientificName. |
acceptedScientificName | Species scientific name, built by concatenating genus and species |
scientificNameAuthorship | GBIF variable: The authorship information for the dwc:scientificName formatted according to the conventions of the applicable dwc:nomenclaturalCode. |
image_url | For GBIF datasets links to images of herbarium specimens, for HCNQ, link to information of the specimen in HCNQ database |
reproductiveCondition | The reproductive condition of the biological individual(s) represented in the dwc:Occurrence. |
n_records | number of specimen records per species in this dataset. |
nrecords_interval | category interval created to perform data stratification for creation of the validation dataset. 4 categories: 1-10; 11-100; 101-500; >500 |
occurrenceRemarks | GBIF variable: Comments or notes about the dwc:Occurrence. |
dynamicProperties | GBIF variable: A list of additional measurements, facts, characteristics, or assertions about the record. Meant to provide a mechanism for structured content. |
fieldNotes | GBIF variable: One of a) an indicator of the existence of, b) a reference to (publication, URI), or c) the text of notes taken in the field about the dwc:Event. Notes correspond to the original language in which they were collected. Possible options: English, Spanish, French, Dutch. |
FieldNotes_processed | A processed variable from field notes, eliminating punctuation marks and numbers. Original language of labels are kept. Possible options: English, Spanish, French, Dutch. |
Flowering_pred_RF | Binary variable result of random forest models, YES=flowering, NO=not flowering |
Flowering_pred_prob_RF | Estimated probabilities of the results from random forest models, P values > 0.5 = YES, P values <0.5 = NO |
Fruiting_pred_RF | Binary variable result of random forest models, YES = fruiting, NO = not fruiting |
Fruiting_pred_prob_RF | Estimated probabilities of the results from random forest models, P values > 0.5 = YES, P values <0.5 = NO |
Filter | filter used to select records for constructing calendars. Pass= records that were used for construction of calendars, error no year data= records without year; year or month 0= records with no data in for months or year; out of latitude=records outside the neotropics |
For a full detail of GBIF variables, please revise: Darwin Core Quick Reference Guide https://dwc.tdwg.org/terms/#dwc:fieldNotes |
---|
S2_data herbarium records_calendars.csv
Empty cells are identified with a string "NA" = no data
VARIABLE | DESCRIPTION |
---|---|
gbifID | ID from GBIF, darwin core |
hcnqID | ID from Herbario Nacional Ecuador (herbarium record number) |
eventDate | GBIF variable: The date-time or interval during which a dwc:Event occurred. For occurrences, this is the date-time when the dwc:Event was recorded. Not suitable for a time in a geological context. |
year | GBIF variable: The four-digit year in which the dwc:Event occurred, according to the Common Era Calendar. |
month | GBIF variable: The integer month in which the dwc:Event occurred. |
day | GBIF variable: The integer day of the month on which the dwc:Event occurred. |
country | GBIF variable: The name of the country or major administrative unit in which the dcterms:Location occurs. |
decimalLongitude | GBIF variable: The geographic longitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive. |
decimalLatitude | GBIF variable: The geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. |
elevation | GBIF variable, original name verbatimElevation: The original description of the elevation (altitude, usually above sea level) of the Location. |
family | GBIF variable: The full scientific name of the family in which the dwc:Taxon is classified. |
genus | GBIF variable: The full scientific name of the genus in which the dwc:Taxon is classified. |
species | Corresponds to GBIF variable specificEpithet: the name of the first or species epithet of the dwc:scientificName. |
acceptedScientificName | Species scientific name, built by concatenating genus and species |
Flowering_pred_RF | Binary variable result of random forest models, YES=flowering, NO=not flowering |
Flowering_pred_prob_RF | Estimated probabilities of the results from random forest models, P values > 0.5 = YES, P values <0.5 = NO |
Fruiting_pred_RF | Binary variable result of random forest models, YES = fruiting, NO = not fruiting |
Fruiting_pred_prob_RF | Estimated probabilities of the results from random forest models, P values > 0.5 = YES, P values <0.5 = NO |
clusters | Climatic region where the record is located. There are 6 regions. NA values correspond to locations that fall on water bodies and cannot be assigned to cluster maps. |
CODE/SOFTWARE
random_forest_models_phenology2022.ipynb
A jupyter notebook written in Python to preprocess the validation dataset, create ML models, compare them, and choose the best ML model to make fruting and flowering predictions (random forest). This script can be visualized and edited with a compatible IDE like VSCode.
random_forest_predictions_phenology2022.ipynb
A jupyter notebook written in Python to make the predictions of the final dataset using the random forest models. This script can be visualized and edited with a compatible IDE like VSCode.
RandomForest_flowering_nestimators100.bin
RF_models_file: Binary file of the random forest model for flowering prediction. https://github.com/sayalaruano/Phenology_from_herbarium_records/blob/main/ML_models/Final_models/RandomForest_flowering_nestimators100.bin.
RandomForest_fruiting_nestimators100.bin
RF_models_file: file of the random forest model for fruiting prediction. https://github.com/sayalaruano/Phenology_from_herbarium_records/blob/main/ML_models/Final_models/RandomForest_fruiting_nestimators100.bin
Full code used to run machine learning models can be found at GitHub, https://github.com/sayalaruano/Phenology_from_herbarium_records
S2-scripts circular analysis phenology
Coded used for circular analysis and generation of the figures included in the article. For specific questions regarding this script please write to jennyordonez@gmail.com
SHARING/ACCES INFORMATION
Orginal data was derived from the following sources:
- Global Biodiversity Information Facility – database: GBIF API https://api.gbif.org/v1/
- Herbario Nacional del Ecuador (QCNE) - Instituto Nacional de Biodiversidad: Data portal https://bndb.sisbioecuador.bio/bndb/collection
Methods
Species selection and retrieval of herbarium records
We selected species from tropical mountain forests on the north-western slope of the Ecuadorian Andes, using data on tree inventories from 16 permanent plots from the ‘Pichincha long-term forest dynamics and carbon monitoring transect’. The transect covers forests between 600-3500 m asl, at the equator (latitude 0°11.32’ N – 0°7.6’ S) characterized by a high tree alpha and β diversity. The initial flora list included 516 unique taxa that included species unequivocally identified to the level of subspecies, species, and genus and 82 taxa with ambiguous identification at the species level (conferatur, or affinis). From these 598 taxa, we eliminated duplicates (n=35 entries identified as conferatur o affinis that were already in the list of 516 taxa), and entries only identified to genus level (n=123). Finally, for 8 taxa that were identified to the level of subspecies or varieties, we added 8 entries that were only identified to the species level, to increase the chance of finding suitable herbarium specimens (for instance for the entry “Aegiphila lopez-palacii var. pubescens” we added an entry “Aegiphila lopez-palacii”). The final species list included 444 species from 80 families (See Supplementary material S1). All species’ names were validated based on the Checklist of the Vascular Plants of the Americas. For each species, we retrieved their synonyms from the TROPICOS database (https://www.tropicos.org/home, accessed on 07/30/2022) using the taxize R package. The final list of 2,908 entries of the original species names and their synonyms was used to search herbarium specimens.
We searched for herbarium specimens in the GBIF - Global Biodiversity Information Facility – database using the GBIF API (https://api.gbif.org/v1/). The search parameters matched our species list to names in the GBIF backbone. We applied filters to retrieve only specimens that: (1) had complete geographical coordinates and no GBIF-identified geospatial issues, (2) had complete dates or dates with at least the month and year, (3) corresponded to locations within the Neotropics: Latitude between -23S and 23N and Longitude between -160W and -20E, and (4) had information in at least one of the columns with field notes (i.e., "fieldNotes", "occurrenceRemarks", and “dynamicProperties” according to the Darwin Core Standard from GBIF). The original GBIF dataset had 54,146 specimen records. We cleared duplicated from the initial dataset; we found 5,886 actual duplicates in the dataset using the search fields scientific name, collector name, year, latitude, and longitude. We also removed records corresponding to GBIF-added subspecies and varieties that did not have the species name as a synonym in TROPICOS, and records that only listed field notes in the “dynamicProperties” column. The total number of records in the final GBIF dataset was 41,004.
We also retrieved data from the Herbario Nacional del Ecuador (QCNE) - Instituto Nacional de Biodiversidad (INABIO https://bndb.sisbioecuador.bio/bndb/collections), rendering an initial dataset of 10,881 records. The cleaning and filtering protocol described above for the GBIF records was also applied to the QCNE dataset, obtaining a dataset of 6,935 records. We merged the GBIF and QCNE datasets into one and made a final check for duplicates and potential errors or incomplete data in dates of collection and species names. Lastly, we merged the three columns with field-notes information into one column used as input to run the machine learning models (see below). The final dataset included 47,939 unique records corresponding to 427 species (from 80 families) across the Neotropics (Supplementary material S2). The dataset covered the period 1821 to 2022, but most records (89.5%) were gathered from 1980 onwards (Supplementary material S3).
Machine learning approaches to determine phenological status
We used natural language processing (NLP), a machine learning algorithm, to determine the phenological status of each specimen based on the information in the field notes, as this commonly contains words related to phenological information (‘flowers’, ‘buds’, among others). First, we created a training and evaluation dataset of 3,000 specimen records to compare the performance of different machine-learning models and select the best. We selected the records for this dataset from our final dataset by applying a stratified sampling considering the year, latitude, and longitude of all specimens with links to images. We visually checked the agreement between images and field note labels to assess whether labels that included flowering information, corresponded to a flowering specimen. Only 1,913 specimens had valid links to images, of which 97% contained information about flowering on the label and had a good correspondence to images (80% of the flowering labels).
Next, we cleaned the field notes by removing the punctuation, numbers, special symbols, and certain repetitive expression that were not informative (i.e. “na”, “ca”, “PORT US”, etc.). Then, we used the Natural Language Toolkit (NLTK) Python package [30] to delete the stop words from different languages, including Spanish, English, Portuguese, and French. Since machine learning algorithms usually require matrices of numbers as their input, we converted our text data from field notes into a numerical matrix using the “bag of words” method (the method describes the occurrence of words within a text). This vectorization method consists of splitting the text into single words and getting the frequency of each word in a piece of text. The “bag of words” output is a numerical matrix in which columns are words from the training dataset, rows are the observations from the training dataset, and each cell is the number of times a word appears in a particular observation. We applied the “bag of words” method to our training and evaluation dataset using “CountVectorizer” from the scikit-learn Python package.
Finally, we evaluated three different approaches to predict whether a specimen was flowering from field notes data. First, we created a baseline model for flowering using the scikit-learn “DummyClassifier”, which is a simple classifier that always predicts the most frequent class in the data. Then, we applied the naïve Bayes and random forest algorithms (RFM) for the same purposes, applying “GaussianNB” and “RandomForestClassifier” from scikit-learn.
We estimated the performance of the models using 5-fold cross-validation and evaluated them using five metrics: accuracy, the total proportion of ‘flowering’ and ‘not flowering’ predictions that were correct; precision, the proportion of ‘flowering’ predictions that were correct; recall, the proportion of true ‘flowering’ records correctly predicted; ROC-AUC which quantifies the ability of a binary classifier to distinguish between flowering and non-flowering classes, and F1, the harmonic mean of precision and recall.
Once we determined the best performance model, we retrained it on the entire training and evaluation dataset and used it to predict flowering for all records in the final dataset (n = 47,939). We followed the same cleaning procedure for the whole dataset as the one applied to the training and evaluation dataset. For all subsequent analyses, we considered only records predicted to be flowering (n = 14,938).