Climate change alters the future of natural floristic regions of deep evolutionary origins

Daru, Barnabas 1 ; Minev-Benzecry, Samuel1

Published Mar 10, 2025 on Dryad. https://doi.org/10.5061/dryad.xd2547dqc

Data files

Mar 10, 2025 version files 1.24 GB

DRYAD_UPLOAD.zip

1.24 GB
README.md

3.09 KB

Abstract

Biogeographic regions reflect the organization of biotas over long evolutionary timescales but face alterations from recent anthropogenic climate change. Here, we model species distributions for 189,269 vascular plant species of the world under present and future climates and use this data to generate biogeographic regions based on phylogenetic dissimilarity. Our analysis reveals declines in phylogenetic beta diversity for years 2040 to 2100, leading to a future homogenization of biogeographic regions. While some biogeographic boundaries will persist, climate change will alter boundaries separating biogeographic realms. Such boundary alterations will be determined by altitude variation, heterogeneity of temperature seasonality, and past climate velocity. Our findings suggest that human activities may now surpass the geological forces that shaped floristic regions over millions of years, calling for the mitigation of climate impacts to meet international biodiversity targets.

General Information

This repository contains datasets and R scripts used in the analysis of Climate change alters the future of natural floristic regions of deep evolutionary origins. The dataset includes geographic distance calculations, optimal clustering, spatial correlations, and other supporting files.

Data Structure

The dataset is organized into the following directories:

1. CODES/

This directory contains all the R scripts used in the analyses.

DATA_ANALYSIS/
- Contains R scripts for computing various biogeographic indices (Sørensen and Simpson).
- Includes scripts for mapping, clustering, boundary determination, spatial correlations, and statistical models (HGLM).
- Example scripts:
  - 2a_mapping_phyloregions_present.R: Maps present-day phyloregions.
  - 12a_optimal_clusters_present.R: Determines optimal clusters for present conditions.
  - 14a_spatial_correlations_phyloregions_SORENSEN.R: Computes spatial correlations for Sørensen index.
  - 18a_hglm_phyloregions_FUTURE.R: Hierarchical generalized linear modeling for future projections.
FIGURES/
- Contains scripts to generate figures used in publications.
- Example figures:
  - Figure1_NMDS.R: Generates NMDS visualization.
  - SI_FigureS2_optimal_clusters.R: Plots optimal clustering results.

2. DATA/

This directory contains raw and processed data files.

phylogeny/
- plant_phylogeny_20trees.tre: Phylogenetic tree file for 20 plant species.
CSVs/
- Contains presence-absence data at different spatial resolutions.
- Future projections are categorized by climate models (e.g., MIROC6) and time periods (e.g., 2021-2040, 2041-2060, etc.).
- Example files:
  - PRESAB/PRESENT/100km/presab.csv: Presence-absence matrix for present conditions.
  - PRESAB/FUTURE/MIROC6/2081-2100/585.csv: Future projections under MIROC6 climate model, SSP 5-8.5 scenario.

File Formats

.R: R scripts for analysis.
.csv: Comma-separated values for tabular data.
.tre: Newick format phylogenetic tree file.

How to Use the Data

Clone or download the dataset.
Ensure you have R and required packages installed.
Run analysis scripts in the CODES/DATA_ANALYSIS/ folder.
Use data files from DATA/CSVs/ as input where necessary.
Modify paths in scripts if needed to match local directory structures.

License and Citation

This dataset is made available under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. If using this dataset, please cite the associated publication:

Minev-Benzecry, S. & Daru, B.H. (2024) Climate change alters the future of natural floristic regions of deep evolutionary origins. Nature Communications 15: 9474. DOI: 10.1038/s41467-024-53860-8

Contact

For any questions regarding the dataset, please contact:

Name: Barnabas Daru
Email: bdaru@stanford.edu
Institution: Stanford University

Distribution data and species distribution modeling

We explored shifts in plant biogeographic regions under climate change using species distribution modeling under alternative climate change scenarios. Our models were constructed based on the standard protocol for reporting species distribution models using the ODMAP (Overview, Data, Model, Assessment and Prediction) protocol⁵⁴ (Supplementary Note 1), along with open-source data and codes for scientific reproducibility (see Data Availability). We used maximum entropy (MaxEnt v.3.4.3)⁵⁵ to model plant species distributions. MaxEnt is not computationally expensive⁵⁶ and has been shown to outperform other algorithms in modeling species distributions for computational efficiency especially when dealing with a huge number of species spanning hundreds of thousands of species as in this study and is robust for modelling distributions for species with relatively few occurrence records⁵⁷. Predictor variables for the modeling were downloaded from WorldClim v.2.1 (ref.⁵⁸) at a spatial grain resolution of 5-arcmin (equivalent to ~9 km at the equator) for present-day conditions (1970-2000) and four future climate scenarios (T1: 2021-2040, T2: 2041-2060, T3: 2061-2080, and T4: 2081-2100) based on MIROC6 and four Shared Socioeconomic Pathways (SSP 126, 245, 370 and 585). These pathways represent varying levels of climate mitigation, ranging from strong mitigation (SSP126) to moderate (SSP245 and SSP370) and high emissions (SSP585) scenarios^32,33. We considered 20 predictor variables (Supplementary Table 4) which are hypothesized to be important for plant distributions and diversity in previous studies^59,60. From these predictor variables, we removed areas corresponding to inland waters, i.e., lakes (using vector polygons from https://naturalearthdata.com). Variance Inflation Factor (VIF) was calculated among predictor pairs to remove highly autocorrelated predictors using the R package usdm version 2.1-6 (ref.⁶¹).

Plant occurrence records used for the modeling were compiled from the Global Biodiversity Information Facility (GBIF, https://doi.org/10.15468/dl.jqqjba, accessed 2 June 2024) using the query term “Tracheophyta”. This yielded 454 million records from 14,405 published datasets. However, we previously showed that raw point occurrences of plants suffer from inherent coverage gaps and sampling biases⁶² that can hinder their ability to accurately represent global biodiversity patterns^62,63. To this end, we used a multi-step workflow to address these limitations as follows: (i) Source data: The raw occurrence data used to produce the species’ range polygons were obtained from GBIF (https://doi.org/10.15468/dl.jqqjba, accessed 2 June 2024). (ii) Data cleaning: These records were thoroughly cleaned by matching species names from the GBIF occurrences to those in the World Checklist of Vascular Plants (WCVP) and keeping only verified names from WCVP⁶⁴. At the same time, the point records were refined to capture native distributions by intersecting them with WCVP's native range maps of vascular plants within country borders⁶⁴ and retaining points that overlap WCVP’s range maps. (iii) Polygon maps: After data cleaning, we converted the point records into polygon maps by modeling with alpha hulls using the R package rangeBuilder v.2.1 (ref.⁶⁵). We cropped each species’ polygon map to land areas using a basemap from naturalearth (https://naturalearthdata.com). Finally, we systematically sampled these polygon maps to generate 500 points per species for input into the species distribution model (SDM) as in previous studies^66–69 rather than using the raw and biased point occurrences. (iv) Species-specific dispersal rate: We incorporated a partial-dispersal model to prevent erroneous predictions in suitable but unoccupied areas^70,71. Specifically, we calculated species-specific dispersal rates using a spherical Brownian motion model (SBM)^72,73 implemented with the R package castor v.1.7.10 (ref.⁷⁴). Unlike the widely used Brownian Motion models of continuous trait evolution that encode geographic locations in orthogonal space^75,76, the SBM model quantifies the dispersal of a clade over time as a diffusion-like process based on a single diffusion coefficient D, while accounting for Earth's spherical geometry^73,74. The SBM model was fitted using the function fit_sbm_const in the R package castor v.1.7.10 (ref.⁷⁴). (v) Calibration area: The resulting dispersal rate, defined as the expected dispersal distance traversed by a species in a year (expressed in km/year), was used to define calibration areas (i.e., training areas) for modelling species distributions for each species across different timeframes. This was achieved by buffering the dispersal rates around the alpha hull polygons of each species and intersecting the buffered zones with maps of the terrestrial ecoregions of the world³ overlapped by the species to predict habitat suitability of each species. This latter step was intended as an additional fine-tuning process to allow us capture the natural habitats of each species based on their overlap with the ecoregions. (vi) Background points: We generated background points as a function of global plant sampling intensity to account for the biased sampling in the input occurrence records using spatial kernel density estimation and probabilistic sampling of 10,000 background points for each species within their calibration areas. (vii) Species distribution modeling: Species distribution modeling was conducted to estimate species distributions based on environmental conditions that correlate with known occurrences, and calibrated to species’ realized niche based on the calibration area defined using the species-specific dispersal rates. From the occurrence data as input, we used a 75% random sample for model development, while retaining the remaining 25% sample for model evaluation. For each species, we built models using a combination of hyperparameters in terms of the feature classes and regularization multiplier settings in MaxEnt v.3.4.3 (ref.⁵⁴) as follows: linear, threshold, and hinge responses, and tested a set of regularization multiplier values (2, 5, 10, 15, 20) under a 5-folds cross-validation framework⁷⁷. Our models were predicted over each species’ occurrences as a function of present-day bioclimatic variables and using these combinations of settings on a continuous scale between 0 and 1 using the sdm function in phyloregion v.1.0.9 (ref.⁷⁸). We generated five sets of models for each species and took the median to account for uncertainties across different model runs. While ensemble methods that integrate and average predictions from multiple models are valuable, this can be computationally expensive and impractical for large datasets as in our study. With hundreds of thousands of species spanning five time horizons and four climate scenarios, creating ensembles for each would be computationally expensive and time-consuming. Moreover, if the individual models within the ensemble are highly similar, the ensemble may not provide much additional benefit compared to a single model. Therefore, we decided to run each model five times and take the median as a more efficient approach. For future climate scenarios, we modelled plant distributions as a function of present climate variables, and then used these models to predict plant distributions at new values of climate under different future scenarios for T1–T4 and SSP126, SSP245, SSP370, and SSP585. The model prediction consisted of a range map stored in raster format at a 5-arc minute grid cell resolution. The suitability of the models for each species was converted to binary presences by using the 95% quantile of the suitability values extracted from the underlying occurrences records as presence threshold. The final dataset contains range maps for 189,269 species, for the present, and four future time horizons each with four SSPs resulting in a total of 3,217,573 range maps for the analysis of biogeographical regionalization under climate change.

Biogeographical regionalization

For each climate scenario, we overlaid each species modelled distributions onto equal area grid cells of 100 km ´ 100 km (Mollweide global projection system) to convert the predicted distribution data to a community matrix of 189,269 species ´ 14,810 grid cells using the polys2comm function in the R package phyloregion v.1.0.9 (ref.⁷⁸). This resulted in a total of 17 different community matrices of 189,269 species ´ 14,810 grid cells (one for the present condition and four for the four different future time horizons ´ four SSPs). We then matched the community data to the most updated phylogenetic tree of the world's vascular plants⁷⁹. The phylogenetic tree was generated using the R package V.PhyloMaker2 v.0.1.0 (ref.⁸⁰) with function phylo.maker based on the expanded megaphylogeny (GBOTB.extended.TPL) of ref.⁷⁹ as a backbone and the function build.nodes.1 in the R package V.PhyloMaker2 v.0.1.0 (ref.⁸⁰). Missing taxa from the megaphylogeny were added using V.PhyloMaker2 under scenario "S2” that allows generation of random trees. We generated one tree and used it to analyze compositional turnover (phylogenetic beta diversity) based on Simpson beta diversity index (βsim)⁸¹. We used the βsim here to measure beta diversity as in previous studies of biogeographical regionalization^{4,7,8,12,36,37} because it is insensitive to differences in species richness among sites⁸², and therefore provides unbiased estimation of species composition among sites⁴. Simpson beta diversity is expressed as: (1)

where a is the number of species shared between two sites, and b and c are the numbers of species unique to each site. Values of βsim vary from 0 (high similarity in species composition between sites) to 1 (no shared taxa). We additionally tested the robustness of our results by comparing our findings from βsim (Simpson’s index) with those obtained from the Sorensen index, which is known to be more sensitive to variations in species richness⁸². We found generally similar results (Supplementary Fig. 6 to 8), suggesting the reliability of our conclusions based on βsim. Matrices of beta diversity were calculated using the function phylobeta in the R package phyloregion v.1.0.9 (ref.⁷⁸).

Next, we contrasted the performance of eight different hierarchical clustering algorithms (UPGMA, Single, Complete, ward.D, ward.D2, WPGMA, WPGMC, and UPGMC) on each of the 17 βsim matrices for degree of data distortion using ref.⁸³’s cophenetic correlation coefficient using the select_linkage function in phyloregion v.1.0.9 (ref.⁷⁸). For all the distance matrices of phylogenetic beta diversity, the unweighted pair group method with arithmetic mean (UPGMA) was identified as the best clustering algorithm (Supplementary Fig. 1) and was used to cluster the distance matrices in downstream analyses.

To determine the optimal number of clusters that best describes the observed βsim matrices, we adapted ref.¹²’s approach and selected two different thresholds to define floristic regions corresponding to the minimum number of regions that explained 90% of between-cluster βsim (sum of between-cluster βsim/total βsim) and 85% of between-cluster βsim. We refer to the clusters explaining 90% of phylogenetic dissimilarity as ‘regions’, while clusters explaining 85% of dissimilarity correspond to the ‘realms’. The outputs were stored as vector polygon for mapping and visualizations.

Climate change and plant biogeographic boundaries

We assessed the potential effects of climate change on biogeographic boundaries as the boundaries between floristic regions for which at least one adjacent cell belongs to a different floristic region³⁵. This delineation involved four steps. First, for each time horizon and climate scenario, we split the vector polygons based on floristic regions. Second the vector polygons of each floristic region were rasterized and assigned a value of 1 using the rasterize function in the R package terra v.1.7-55 (ref.⁸⁴). Third, buffers of 200 km were generated around the cells using the function buffer in the R package terra v.1.7-55 (ref.⁸⁴). Finally, the rasterized polygons were summed. Raster cells with a cumulative value greater than one correspond to plant biogeographic boundaries.

Spatial congruence across regionalizations

We measured the degree of spatial association between present versus future regionalizations using a quantitative measure known as v-measure⁸⁵. The v-measure evaluates spatial congruence using two criteria: homogeneity and completeness. A spatial association satisfies homogeneity criteria if all of regions contain only cells which have a single label. An association satisfies the completeness criteria if all cells having the same label belong to a single region. To this end, we projected each map in the Equal Earth projection system (+proj=eqearth code) and assessed congruence between the maps using the V-measure statistic in the R package sabre v.0.4.3 (ref.⁸⁵). Spatial association was computed using the function vmeasure_calc and setting the option B > 1 so that completeness is weighted more than homogeneity. V-measure scores range from 0 (incongruence) to 1 (indicating perfect similarity).

We additionally evaluated the similarities or differences of present biogeographic boundaries versus future boundaries. This was achieved by comparing the position of terrestrial boundaries across the different regionalizations for both regions and realms. From the spatial raster boundaries described above, we computed the geographic distance of each cell to the nearest boundary using the function distance in terra v.1.7-55 (ref.⁸⁴). We then conducted a spatially corrected correlation between the position of the present biogeographic boundaries versus future boundaries. The correlations were conducted using a corrected Pearson’s correlation for spatial autocorrelation using the function modified.ttest in the R package SpatialPack v.0.4 (ref.⁸⁶).

Determinants of biogeographical boundaries

We tested the potential of various biogeographical drivers such as orographic barriers, tectonic movements and variations in climate seasonality (temperature and precipitation) in explaining the position of present and future plant biogeographic boundaries. Along these lines, we considered biogeographical drivers hypothesized to affect biogeographical boundaries in previous studies^35–37 and grouped them into four categories: climate heterogeneity, tectonic movements, orographic barriers, and instability of past climate. 1) To assess climate heterogeneity, we obtained four bioclimatic variables known to explain biogeographic patterns in previous global studies⁸⁷ including: mean annual temperature, temperature seasonality, mean annual precipitation and precipitation seasonality, from WorldClim v.2.1 (ref.⁵⁸). We acknowledge that our original input maps are modeled estimations based on bioclimatic variables, and using variables solely from WorldClim could introduce some circularity. To address this, we calculated climate heterogeneity for each grid cell as the coefficient of variation between the focal cell and its eight neighbors. This approach assumes that biogeographic boundaries often occur in areas with sharp climate variations. Grid cells with higher heterogeneity values indicate they are different from neighboring ones. 2) Tectonic movement was determined by reconstructing the geographic locations of present-day grid cell centroids in timesteps of 1 Ma back to their historical positions 65 Ma. For each grid cell, we calculated the historical distance between a grid cell and eight neighboring cells at each timestep, and then computed the standard deviation of these distances across the last 65 million years, representing the variability of geographical distances between grid cells across time. Our historical distance reconstruction was based on SETON2012’s tectonic plate model⁸⁸ which reconstructs global plate motion for coastlines and topological plate polygons since the break-up of the supercontinent Pangea 200 Ma. This was implemented using the function reconstruct in the R package rgplates v.0.4.0 (ref.⁸⁹), an interface for the GPlates plate reconstruction software⁸⁸. 3) To estimate orographic barriers, we obtained elevation data from WorldClim v.2.1 (ref.⁵⁸) and calculated the mean absolute difference in elevation between a focal grid cell and 8 neighboring cells. 4) We also included past climate velocity during the Quaternary (reconstructed by ref.⁹⁰) at 5-arcmin resolution by extracting values across grid cells. We standardized all predictor variables to have a mean of 0 and variance of 1. All predictor variables showed variance inflation factors less than 2 indicating minimal collinearity.

Next, we used a simultaneous autoregressive spatial (SAR) model with binomial error distribution to assess relationships of the position of plant biogeographic boundaries and the predictors while accounting for spatial autocorrelation. This was achieved by building our models with the response variable being a binomially distributed binary variable determining whether a given cell is in contact with a biogeographic boundary or not. Along these lines, for each time horizon and climate scenario, we split the vector polygons based on floristic regions, rasterized the positions of boundaries between floristic regions, and generated buffers of 200 km around the boundary cells. The cells directly touching boundary lines were coded as “YES” and the remaining buffered cells not touching the boundary lines were coded “NO”. To ensure the best performance of our spatial regression, we incorporated spatial autocorrelation in the error term using neighborhood matrices. The neighborhood matrices were defined based on the diameter of a circle extending from one grid cell centroid to another cell centroid, corresponding to 283 km, being the shortest cell that kept all cells connected in the study area (functions dnearneigh and nb2listw in the R package spdep v.1.3-1, ref.⁹¹). We then used the function hglm in the R package hglm v.2.2-1 (ref.⁹²) to fit hierarchical generalized linear mixed models (HGLMs)⁹³ with spatially autocorrelated random effects with 200 bootstrap replicates. For each variable, the effect size of the HGLM coefficients was converted to Fisher’s z, which measures effect size independent of differences in sample sizes⁹⁴. Values of Fisher's z were calculated using the t-values from the output of HGLM models using the tes function in the R package compute.es v.0.2-5 (ref.⁹⁵). See ‘Data availability’ to access the data and analysis codes⁹⁶.