Data and code from: Data integration advances reproductive phenology research across temporal, spatial, and taxonomic scales
Data files
Feb 20, 2026 version files 61.41 MB
-
README.md
27.95 KB
-
STR_repro_phen_data.zip
61.38 MB
Abstract
Climate change is altering plant reproductive phenology; however, a scarcity of long-term, systematic monitoring hinders our ability to quantify and predict these responses in many parts of the world. We addressed this gap by demonstrating how data integration can be used to produce a synthesised record of reproductive phenology observations (flowering and fruiting) that spans longer time periods, larger spatial scales, and includes more species than any single source alone. Using Australian subtropical rainforest trees as a case study, we integrated reproductive phenology observations from both common data sources—published datasets, herbarium specimens, and citizen science records—and previously untapped expert botanical knowledge, including private photographic collections, field notes, and seed collections. Data integration yielded 110,657 records of flowers or fruits from 915 species (representing half of all subtropical rainforest tree species) spanning 255 years (1770-2025). We found that different data sources provided unique information across temporal, spatial, and taxonomic dimensions. Herbarium specimens provided the longest taxonomic coverage, while citizen science contributed the most recent observations. Critically, 197 species (21.5 %) were represented from only a single source, including 154 species represented solely by herbarium specimens and 46 species in expert botanist collections. While 46.6 % of species had fewer than 50 observations, for many species, these represent the only available historical phenology data. This integrated dataset may be the only available resource for establishing pre-industrial baselines for the reproductive phenology of Australian subtropical rainforest trees. This would not have been possible without the engagement and contributions of the local botanical community, which greatly expanded the research capacity beyond conventional data sources.
Ella Cathcart van Weeren
Dataset overview
This dataset contains reproductive phenology data (observations of flowering and fruiting events) and code to replicate analysis in Cathcart-van Weeren et al (2026), testing the hypothesis that different sources of reproductive phenological data provide complementary, rather than redundant, observations of plant reproductive phenology.
The dataset integrates multiple data sources (published datasets, herbarium specimens, iNaturalist observations, expert botanist field notes, photo collections, and seed collection records) representing 915 species of Australian subtropical rainforest trees spanning 255 years (1770-2025). It provides a comprehensive temporal and spatial coverage of (mostly) presence-only reproductive observations.
The data support analyses of a variety of phenological patterns across many spatial and temporal scales. This includes seasonal flowering and fruiting patterns, and longer-term phenology patterns such as reproductive frequency, and responses to environmental drivers including shifts linked to climate change.
Repository Structure
Repo Name: STR_repro_phen_data.zip (download and unzip to access RProject)
Folders
- data
- STR_repro_phen_data_anon_2026-02-19.csv - The synthesised phenology dataset with data sources and locations anonomised. Each row of the data is a flowering or fruiting observation for a given subtropical rainforest tree species, date and location.
- STR_tree_species_list.csv - a list of all tree species found in the Subtropical Rainforest of Australia
- GIS sub-folder
- MVS_Numeric_Order.csv - a look up table of the mapped Major Vegetation Groups
- NVIS_Tiff.tif - a tiff file with the mapped vegetation locations for extant vegetation
- outputs - sub-folders to save figures and tables
- scripts
- cathcart_van_weeren_et_al_2026_analysis.R
- script to recreate results from Cathcart-van Weeren et al 2026 Data integration advances reproductive phenology research across spatial, temporal, and taxonomic scales
- cathcart_van_weeren_et_al_2026_custom_functions.R
- custom functions for plotting
- cathcart_van_weeren_et_al_2026_analysis.R
Description of the synthesis data: STR_repro_phen_data_anon_2026-02-19.csv
Note: All coordinates in this dataset have been fuzzed to two decimal° for privacy protection, regardless of original precision.
| Variable | Description | Format / Units/Categories |
|---|---|---|
| family | Plant family | text |
| genus | Genus name | text |
| species_standard | Standardised species name (Genus species, with subspecies removed) | text |
| year | Year of observation | numeric (range: 1770-2025) |
| month | Month of observation | numeric (range: 1-12) |
| day | Day of month if provided | numeric (range: 1-31, NA if no day provided) |
| flower | Flowering status | 1 = present, 0 = absent, NA = not recorded |
| fruit | Fruiting status | 1 = present, 0 = absent, NA = not recorded |
| latitude_fuzzy | Latitude in decimal degrees (fuzzed for privacy) | numeric |
| longitude_fuzzy | Longitude in decimal degrees (fuzzed for privacy) | numeric |
| spatial_confidence | A categorical variable indicating the location precision of the original observation before fuzzing. This variable can be use to filter to specific resolutions if you require more precise locations | categories: < 1 km, 1-5 km, 5-10 km, 10-15 km, 15-20 km, 20-30 km, 30-40 km, 40-50 km, > 50 km, Unknown |
| spatial _ resolution_km | Original spatial precision of coordinates, if none are provided this will be NA | numeric (kilometers) |
| data_type | published_data - data from published papers. Herbarium_specimes - Preserved pecimens with collection data. iNaturalist_record - Research and need_ID-grade community science observations. Field_observations - Direct field observations made by experts. photo_collection - records from photos taken by expert botanists and not published on iNaturalist. seed_collection - records from native nurseries and individual seed collectors. | Categories: Field_observations, Herbarium_specimens, iNaturalist_record, photo_collection, published_data, seed_collection |
| source_anon | Anonymised data source or contributor. | GBIF, iNaturalist, Contributor_1, Contributor_2, Contributor_3, Contributor_4, Contributor_5, Contributor_6, Contributor_7, Contributor_8, Contributor_9, Contributor_10, published_data_Mo_et_al, published_data_Innis_et_al |
| basisofrecord | Record basis from source database categories from Darwin Core Code | HUMAN_OBSERVATION, PRESERVED_SPECIMEN, OBSERVATION |
| publisher | Data publishing institution for GBIF data this will tell you what herbarium published the observations | 29 different publishers that contributed data |
| scientific_name | Full currently accepted scientific name with taxonomic authority | text |
| time_period | Temporal period classification | factor: 1770-1970, 1971-2000, 2001-2025 |
| within_10km_rainforest | Located within 10 km of rainforest vegetation | logical: TRUE/FALSE |
| inatid | Unique numeric identifier assigned by iNaturalist to each observation record. Can be used to retrieve the original observation at https://www.inaturalist.org/observations/{inat_id}. Populated for iNaturalist records sourced directly from the iNaturalist platform; NA for all other data sources. |
|
| gbifid | Unique numeric identifier assigned by the Global Biodiversity Information Facility (GBIF) to each occurrence record. Can be used to retrieve the original record at https://www.gbif.org/occurrence/{gbifID}. Populated for records sourced from GBIF; NA for all other data sources. |
Missing Data Codes
- NA - Data not available or not recorded for this observation
- Empty cells indicate no data collected for that variable
Methods
Data Sources
Data were compiled from six primary sources:
- Published data - Mo and Waterhouse 2015 and Innis et al. 1989
- Herbarium specimens - Digital records from herbaria accessed through the GBIF
- iNaturalist - community science observations of flowering and fruiting
- Expert photo collections - Botanical surveys and targeted collections by researchers and taxonomists. Photos were either assessed manually or run through a machine learning model (93 % accurate) to classify reproductive material
- Expert field observations - Observations of flowering or fruiting made by expert botanists
- Expert seed collections - Records of dates, locations, and species of seed collections
Data Processing
- All taxonomic names were standardised using the APCalign R package to align with the Australian Plant Census nomenclature and taxonomic synonyms were resolved to currently accepted names
- Duplicate records were identified and removed based on species, date, and location
- Only records with verifiable reproductive status (flowering or fruiting) were included
- Spatial coordinates were checked for errors and outliers
Spatial Coverage
Spatial data files are used to identify rainforest vegetation and validate the proximity of phenology observations to mapped rainforest areas.
Region: Southeast Queensland and northeast New South Wales, Australia
Vegetation type: Subtropical rainforest communities including cool temperate and tropical or subtropical rainforest and vine thickets.
Coordinate range: - Latitude: 21-35 ºS and Longitude:1 50-155 ºE
Files from GIS sub-folder
NVIS_Tiff.tif - National Vegetation Information System (NVIS) Version 6.0 extant vegetation layer - Source: Department of Climate Change, Energy, the Environment and Water (DCCEEW) - Download URL: https://digital.atlas.gov.au/maps/national-vegetation-information-system-nvis-version-6-0-extant-vegetation/ - Projection: GDA94 / Australian Albers (EPSG:3577) - Resolution: 100m × 100 m - Coverage: Cropped to subtropical study region (150-155 °E, 21-35 °S) - Data type: Raster with Major Vegetation Group (MVG) classifications
MVG_Numeric_Order.csv - Lookup table for Major Vegetation Group classifications - Maps numeric raster values to vegetation type names - Used to identify rainforest vegetation types (cool temperate, tropical/subtropical rainforest, and vine thickets) - Classification reference: https://www.dcceew.gov.au/sites/default/files/documents/mvg-introduction.pdf
Purpose and Usage
These GIS files are used in the analysis script to: 1. Identify pixels representing rainforest vegetation 2. Create a 10 km buffer around mapped rainforest areas 3. Determine if each observation falls within 10 km of rainforest (stored in within_10km_rainforest variable) 4. Provide spatial validation of observation locations
This spatial classification helps users filter observations based on proximity to rainforest habitat and assess data quality across the landscape.
Data Usage Notes
Known Limitations
- Temporal sampling is uneven, with greater observation density in recent decades
- Spatial coverage is concentrated in accessible rainforest areas
- Herbarium specimens may be biased toward peak flowering/fruiting periods
- Species-level sample sizes vary considerably
- Earlier historical records (pre-1900) are sparse
Analysis Considerations
- Account for temporal and spatial sampling bias in statistical models
- Distinguish between systematic monitoring data and opportunistic observations when appropriate
Software Requirements
R and System Requirements
- R version: 4.5.2 or higher
- System dependencies:
- GDAL (for spatial data processing with terra and sf packages)
- PROJ library (for coordinate transformations)
Note: Most users can install these dependencies automatically when installing the terra and sf packages.
Required R Packages
Essential packages for core analysis: - tidyverse 2.0.0 (includes dplyr 1.1.4, tidyr 1.3.1, ggplot2 4.0.1, readr 2.1.6, and others) - terra 1.8-93 (raster and vector spatial data) - sf 1.0-23 (simple features for vector data) - V.PhyloMaker2 0.1.0 (phylogenetic tree construction) - ape 5.8-1 (phylogenetic analyses) - phangorn 2.12.1 (phylogenetic reconstruction) - phytools 2.5-2 (phylogenetic comparative methods)
Optional packages for figure creation: - patchwork 1.3.2 (multi-panel figure layouts) - ozmaps 0.4.5 (Australian basemap for figures)
Installation
Install required packages using:
# Install from CRAN
install.packages(c("tidyverse", "patchwork", "terra", "sf",
"ozmaps", "ape", "phangorn", "phytools"))
# V.PhyloMaker2 requires installation from GitHub
install.packages("remotes")
remotes::install_github("jinyizju/V.PhyloMaker2")
Note: Installing terra and sf will prompt installation of system dependencies (GDAL, PROJ) if not already present.
Replication Instructions
Setup
- Download the entire repository maintaining the folder structure
- Open
STR_repro_phen_data.Rprojin RStudio (ensures correct file paths) - Install required packages (see Software Requirements)
Running the Analysis
Open and run scripts/cathcart_van_weeren_et_al_2026_analysis.R
The script: - Automatically sources custom functions - Processes phenology data and adds temporal/spatial variables - Generates all figures and tables from Cathcart-van Weeren et al 2026 Data integration advances reproductive phenology research across spatial, temporal, and taxonomic scales - Saves outputs to outputs/ folder
Figures (PDF format, saved to outputs/ subfolders), see below for descriptions:
- Figure 3 — Temporal distribution of observations across three time periods (1770–1970, 1971–2000, 2001–2025). For each period: a donut chart showing the proportional contribution of each data source type to total observations, and a bar histogram showing the monthly distribution of flowering and fruiting observations. Month is shown as a numeric value (1–12) corresponding to January–December.
- Figure 4 — Spatial distribution of flowering and fruiting observations across the study region (150–155 °E, 21–35 °S) for each of three time periods. Points are coloured by data source type. Observations are restricted to records within 10 km of mapped subtropical rainforest extent based on the National Vegetation Information System (NVIS).
- Figure 5 — Taxonomic and phylogenetic coverage of the dataset. Intersection plots show species overlap between data source types per time period. Fan phylogeny shows family-level coverage, with tip labels coloured by observation count category (1–100, 101–500, 501–1000, > 1000 observations; grey = families not represented in synthesis). This figure uses the STR_tree_species_list.csv (a list of all tree species found in the Subtropical Rainforest of Australia) to determine the species not represented in the phenology dataset.
- Figure 6 — Distribution of species by total observation count category (1–50, 50–150, 150–500, 500–1000, > 1000). Bars are subdivided by whether species have flowering only, fruiting only, both flowering and fruiting, or no reproductive observations recorded.
Table (CSV format, saved to outputs/):
- Table 1: Summary statistics by data source (observations, species, temporal range) is generated by summarising the full dataset (see data Description of the synthesis data) by
data_type, reporting total observations, species richness, number of sources, temporal range, and counts of flowering and fruiting records per source type."
Custom Functions
The file scripts/cathcart_van_weeren_et_al_2026_custom_functions.R contains three custom functions used for figure generation:
- theme_ella(): Custom ggplot2 theme for consistent figure formatting
- create_period_map(): Generates paired flowering and fruiting spatial distribution maps for a given time period
- create_intersection_DOT_plot(): Creates dot matrix visualisation showing species overlap between data sources
These functions are automatically sourced at the beginning of the main analysis script.
Related Publications
Cathcart-van Weeren E, Dwyer JM, Hawkins B, Holmes J, Holmes G, Leiper G, McDonald W, Mo M, Nicholson H, Price R, Shaw K, Shaw S, Waterhouse D, Weber L, Luskin MS (2026) Data integration advances reproductive phenology research across temporal, spatial, and taxonomic scales. Ecography.
Funding
No funding was used in this project.
Contact Information
For questions about this dataset, please contact Ella Cathcart-van Weeren:
or
University of Queensland Brisbane, Queensland, Australia
Acknowledgments
We acknowledge the traditional custodians of the lands on which this research was conducted and pay our respects to Elders past, present, and emerging.
We thank the many botanists, naturalists, and community scientists who contributed observations over the past 255 years.
README last updated: 20/02/2026
Thanks for checking out my project! :) E
