Data from: Predicting undetected native vascular plant diversity at a global scale

Daru, Barnabas 1

Research facility: Stanford University

Published Sep 23, 2024 on Dryad. https://doi.org/10.5061/dryad.5x69p8d9w

Data files

Sep 23, 2024 version files 15.18 GB

DRYAD_DATA.zip

15.18 GB
README.md

9.13 KB

Abstract

Vascular plants are diverse and a major component of terrestrial ecosystems, yet their geographic distributions remain incomplete. Here, I present a global database of vascular plant distributions by integrating species distribution models calibrated to species’ dispersal ability and natural habitats to predict native range maps for 201,681 vascular plant species into unsurveyed areas. Using these maps, I uncover unique patterns of native vascular plant diversity, endemism, and phylogenetic diversity revealing hotspots in under-documented biodiversity-rich regions. These hotspots, based on detailed species-level maps, show a pronounced latitudinal gradient, strongly supporting the theory of increasing diversity towards the equator. I trained random forest models to extrapolate diversity patterns under unbiased global sampling and identify overlaps with modelled estimations but unveiled cryptic hotspots that were not captured by modelled estimations. Only 29-36% of extrapolated plant hotspots are inside protected areas, leaving more than 60% outside and vulnerable. However, the unprotected hotspots harbor species with unique attributes that make them good candidates for conservation prioritization.

Experimental design

The experimental design is divided into six steps (Fig. S1) that include: (i) Source data, the raw occurrence data used to produce the global species’ range polygons; (ii) Data cleaning, which describes how errors and inconsistencies were minimized in the input data; (iii) Geographic modeling of polygons, describing how alpha hull polygons were generated from point occurrences; (iv) Species dispersal capacity and calibration area, describes how I incorporated species-specific dispersal ability in defining a calibration area for modeling the species distributions; (v) Background points, describes how background points were selected for the species distribution modeling; and (v) Species distribution modeling, describes the estimation of species distributions based on environmental conditions that correlate with known occurrences, and calibrated to species’ realized niche.

Step1: Source data. The primary source of occurrence data used in this study was downloaded from Global Biodiversity Information Facility (GBIF; https://doi.org/10.15468/dl.vgvc3z, accessed 15 May 2023) by querying "TaxonKey = Tracheophyta" in a full data export. The GBIF represents the largest effort to mobilize vascular plant diversity data at a global extent. The resulting occurrences consisted of 402 million records from 11,517 published datasets.

Step 2: Data cleaning. From the occurrences, I used a two-step cleaning process to reconcile taxonomic names. First, I used the World Checklist of Vascular Plants (WCVP) (1) because it has the most comprehensive source for species names along with their native ranges within broad political divisions. I then matched names from the GBIF occurrence records to those in WCVP and retained only species names that exist in WCVP (1). This approach ensures data consistency and minimizes the tendency to include inaccurate or outdated names. Second, I manually harmonized four frequently interchanged family names to be consistent with the Angiosperm Phylogeny Group classification system. Specifically, Xanthorrhoeaceae was changed to Asphodelaceae, Cephalotaxaceae to Taxaceae, Leguminosae to Fabaceae and Compositae to Asteraceae. Subsequently, I analyzed the validity of the point data by running the occurrence records for each species through the function clean_coordinates in the R package CoordinateCleaner v.3.0.1 (2) and removing potentially erroneous records such as duplicates, records with doubtful or imprecise localities (e.g., points outside ±180° or in the sea), and on country or province centroids. The clean_coordinates function also has a built-in global database of over 10,055 geo-referenced biodiversity institutions, including herbaria, botanical gardens, zoos, museums, GBIF contributors, and university collections. By using a 100-meter radius filter around these institutions, the function allowed me to remove records that likely originate in proximity to these institutions indicating records from cultivation. Next, I refined the resulting point records to accurately reflect species distributions within their known native ranges. This was achieved by intersecting point records with WCVP’s database of native distribution maps of vascular plants within country borders, i.e., level 3 of the biogeographic regions recognized by the Biodiversity Information Standards (also known as the Taxonomic Databases Working Group (1, 3). Points that intersected with the boundaries of WCVP were retained for further analysis. Subsequently, for species with ≥5 unique localities, I used spatial thinning to reduce the effects of uneven, or biased sampling. This involved subsampling records present in each cell using the R package dismo v.1.3-14 (4) with function gridSample. The resulting records from spatial thinning represent the final cleaned point occurrence records utilized for subsequent analyses.

Step 3: Geographic modeling of alpha hull polygons. Following data cleaning, I incorporated the thinned point records into polygon maps by modeling with alpha hulls using the R package rangeBuilder v.2.1 (5) with options initialAlpha = 2. For each species, the alpha hyperparameter value was incrementally increased until a range was formed that encompassed at least 99% of the occurrence records. All other parameters were left at their default settings. From these alpha hull polygons, I cropped them to land areas by removing areas falling in the sea and inland water bodies (using vector polygons from https://naturalearthdata.com). I additionally finetuned the alpha hull polygons by clipping them using the polygons of vascular plant families (6, 7) under the assumption that a species does not occur outside its family’s range.

Step 4: Species dispersal capacity and calibration area. Predicting a species’ habitat suitability and ability to occupy a space requires knowledge of inherent dispersal capacity (8). I implemented a partial-dispersal model which incorporates demographic or life history factors into the species distribution modeling thereby preventing erroneous predictions in areas that might contain suitable habitat but not occupied by the species (8, 9). I started by fitting a phylogeographic Spherical Brownian Motion (SBM) model (10, 11), which quantifies the dispersal of a clade over time as a diffusion-like process based on a single diffusion coefficient D, while accounting for Earth's spherical geometry (9, 12) using the function fit_sbm_const in the R package castor v.1.7.10 (11, 13). This yields a diffusivity score D, measured in km² Myr^–1 and it is based on maximum-likelihood and independent contrasts between sister lineages (11). The fit_sbm_con function calculates diffusivity using a rooted phylogenetic tree and geographic coordinates (latitudes and longitudes) for the tips of the tree. I then computed species dispersal rates as the expected dispersal distance traveled by a species in a year (in km yr^–1) using the castor function expected_SBM_distance. The phylogenetic tree for the SBM modeling was generated using the R package V.PhyloMaker2 v.0.1.0 (14) with function phylo.maker based on the expanded megaphylogeny (GBOTB.extended.TPL) of Smith and Brown (15) as a backbone and the build.nodes.1 function (14). Missing taxa from the megaphylogeny were added using V.PhyloMaker2 under scenario "S2” that allows generation of random trees. I generated 2 random trees and used it for downstream analysis. The tip coordinates were estimated by taking the centroid of each species’ alpha hull polygon. Finally, I used the species dispersal rates to define a calibration area by buffering the dispersal rates around the alpha hull polygons of each species before intersecting the buffered zones with the terrestrial ecoregions occupied by the species (16, 17) to predict habitat suitability of each species. The SBM-derived dispersal rates (km/yr) allow me to account for dispersal ability and restrict predictions to the realized niche (suitable and accessible areas) rather than the fundamental niche (all potentially suitable areas).

Step 5: Background points. Due to the inherent sampling biases in vascular plant occurrence records (18, 19) and the lack of absence data, I generated background localities randomly for modeling distributions for each species but accounting for sampling biases in vascular plant occurrence data. This was achieved using a spatial kernel density estimate (20) through the R package spatialEco v.2.0-1 (21) to generate sampling density grid standardized to 0-1 (22). I used the sampling density grid as estimate of vascular plant sampling bias in the occurrence records (Fig. S6). I then used the sampling bias grid to probabilistically sample 10,000 background points for each species within their respective calibration areas, representing the natural areas where vascular plants can occur.

Step 6: Species distribution modeling. Previous empirical analyses (18, 19) have revealed that multidimensional gaps and biases in vascular plant occurrence records can influence accurate estimation of vascular plant species distributions. For these reasons, I modeled species distributions using alpha hull polygons rather than the raw biased point records, following previous studies (23–27). Here, I used alpha hull polygons in similar context to how IUCN expert-range polygons are generally used in macroecology and biogeography. IUCN expert-range polygons incorporate diverse information sources such as point records, species ecology and distribution, local inventories, atlases, and relevant literature (26, 27). However, unlike IUCN expert-range polygons, alpha hulls are purely data-driven. Building on the success of IUCN expert-range polygons in generating accurate SDMs across various taxonomic groups (24–27) and their demonstrated agreement with point data (23), I used alpha hulls for modelling vascular plant distributions in the absence of expert-range polygons for vascular plants. Following these approaches, I sampled these alpha hull polygons to generate 500 points per species for input into the species distribution model (SDM) as in previous studies (24–27) rather than using the raw and biased point occurrences. To achieve this, I used a regular sampling procedure that ensured samples were systematically aligned by transferring values associated with the geometries of the alpha hull polygons to a raster at a resolution matching the environmental data and then thinning to 500 points per species, as in previous studies (24). This ensures a reproducible set of points for modeling. I acknowledge that actual species distributions in nature are not necessarily evenly distributed within its range and could potentially lead to commission errors, previous studies suggest minimal impact for modelling species distributions at macroecological scales as in this study. Furthermore, as IUCN range polygons generally represent expected species distributions (28), and my aim is to predict plant distribution at a scale comparable to IUCN polygons, this approach is appropriate for the data and goal of predicting expected native vascular plant diversity.

Using the occurrence data as input, I employed a 75% random sample for model development, while keeping the remaining 25% sample for model evaluation. To predict the distribution of species into unsampled areas, I used species distribution models to predict the relationship between the observed occurrences and environmental factors while restricted to the calibration area that was defined by species’ dispersal rates and intersected with terrestrial ecoregions. Environmental factors were obtained from the WorldClim database v.2.1 (29) which represent historical (near current) climate layers for a total of 19 variables at a spatial grain resolution of 5 arcmin (~9 km). These variables represent climate data for 1970-2000 and were selected because they correlate with key physical and biological attributes of many plant species. I additionally included elevation from WorldClim v.2.1 (29), as a variable in the model. I then computed the variance inflation factor for the predictor variables to remove highly correlated variables in a stepwise manner using the vifstep function in the R package usdm v.2.1-6 (30) by setting the VIF threshold = 5. From these predictor variables, I excluded pixels corresponding to inland water bodies (i.e., lakes) from Natural Earth (https://www.naturalearthdata.com/) and then masked them to each species' calibration area.

I used maximum entropy (MaxEnt) to model plant species distributions because it generally outperforms other methods based on predictive accuracy and does not generate response curves that may lead to unpredictable behavior when applied to novel climatic conditions (31). MaxEnt is also widely used for species distribution modelling with presence-only data and works well for several taxonomic groups including plants. For the model settings, I considered important decision steps in the input data and parameter settings. This was achieved using the sdm function in the R package phyloregion v.1.0.9 (32) by evaluating the model with a range of different complexities in terms of the feature classes and regularization multiplier settings in MaxEnt as follows: linear, threshold, and hinge responses, and tested a set of regularization multiplier values (2, 5, 10, 15, 20) under a 5-folds cross-validation framework (33).

To convert the continuous predicted probabilities into binary presence-absence maps, I used the equal training sensitivity (true positive rate) and specificity threshold (true negative rate; 34). I also used three evaluation measures for quantifying the predictive accuracy of the models: area under the curve (AUC), True Skill Statistic (TSS), and Boyce Index. The AUC measures how well predictions differentiate between presence and non-presence sites (35, 36), TSS measures the ability of the model to discriminate between positive and negative cases while accounting for both correct and incorrect classifications (37), and Boyce Index assesses how well predictions serve as an index of the probability of presence (38, 39). AUC scores vary between 0 (random model) to 1 (best discriminating model), whereas both TSS scores and Boyce Index vary between −1 and +1, with positive values indicating model predictions better than random and negative values corresponding to poor prediction (40).

For each species, I generated five sets of models and reported the median to account for uncertainties across different model runs. The model prediction consisted of a range map stored in raster format at a 5 arcmin grid cell resolution. The suitability of the models for each species was converted to binary presences by using the 95% quantile of the suitability values extracted from the underlying occurrence records as presence threshold. I did not exclude models with lower test scores because while such values may mean poor prediction (40), I found that the performance of the models was generally high across the evaluation metrics with a median AUC of 0.91 (range 0.36 to 0.999), Boyce index = 0.87 (range -0.55 to 0.99) and TSS = 0.42 (range -0.16 to 0.88) (Fig. S2). These high evaluation scores indicate that my models’ predictions are a reliable indicator of where the plant species are likely to be found. I provided an ODMAP standard report of my protocol (41) in the supplementary information (Dataset S1). I provided the model outputs in three formats: 1) raster files of raw suitability scores to allow users to explore alternative approaches to thresholding the predictions, 2) raster files of binary presences using a 95% quantile threshold for downstream analyses, and 3) range polygons derived by dissolving the binary presence rasters using the R package terra v.1.7-3 (42) followed by spline interpolation of vertices in the R package smoothr v.1.0.0 (43) to smooth the jagged edges and sharp corners of the polygon maps. The smoothing of the polygon boundaries was done to achieve a more visually appealing representation for these standardized range maps consistent with IUCN’s mapping standard (44). However, it is important to emphasize that this smoothing does not alter the underlying binary presence data, which maintains all the details from the model, in downstream analyses (e.g., see Fig. S8). All calculations were processed using R v.4.2.0 "Vigorous Calisthenics" (45) on the Sherlock computing clusters of Stanford University.

Data repository. The database contains range maps for 201,681 species and covers the basal Eudicots (4,770 species), Monocots (42,370 species), Superasteridae (77,443 species), Superrosidae (63,372 species), Magnoliidae (7,243 species), Chloranthales (204 species), Gymnosperms (854 species), and Ferns (5,425 species). The geographic coverage of the data is global spanning all continents except for Antarctica and focused on only the terrestrial part of a species’ native range, and representing contemporary times. Each species’ spatial data is projected in the WGS84 coordinate reference system (EPSG:4326 code) and stored as vector polygons in GEOPACKAGE format for maximum compression, allowing easy integration with other spatial datasets and data manipulation using a standard computer. I additionally stored each species’ binary spatial raster as well as the raw suitability scores as raster files to explore alternative approaches of thresholding the predictions. The dataset can be downloaded from Dryad repository (https://doi.org/10.5061/dryad.5x69p8d9w). Future versions of the data will be made available on GreenMaps (greenmaps.net).

Technical Validation. I explored the validity of the modeled maps using three different distribution datasets: 1) traditional SDM of using raw input occurrence records, 2) extent-of-occurrence range polygons from the IUCN, and 3) range maps for trees of southern Africa.

Traditional SDM of directly using raw input occurrence points. I tested the sensitivity of my results by comparing my polygon approach to a traditional species distribution model of directly using the raw and biased point records. This was achieved using range filling analysis (46) by calculating the proportion of each traditional SDM occupied in the realized ranges of the polygon SDM approach for 97,026 species with sufficient raw point data. I found that while models based on raw point records tend to underestimate species range, their spatial congruence with the range polygons is generally high (median range filling = 60%). In addition, range filling is negatively correlated with geographic range size meaning that large-ranged species based on the traditional SDM approach are underfilling their potential ranges (Fig. S9). These finding indicates that the impact of my method on overall patterns is likely minimal given the high congruence of 60%. The underestimation observed in the raw point record especially for large-ranged species highlights the inherent challenges of using raw point records to infer complete species distributions.

Extent-of-occurrence maps from IUCN. I contrasted my modelled estimations with extent-of-occurrence polygon maps from the International Union for the Conservation of Nature (IUCN) for 9,081 plant species. The IUCN maps were derived as hand-drawn polygons based on expert knowledge of species ecology and distribution, local inventories, atlases, and relevant literature (26, 27). I contrasted the boundaries of my modeled maps with the range maps of 9,081 plant species from the IUCN spatial database. To achieved this, I projected each map in WGS84 coordinate reference system (EPSG:4326 code) and assessed congruence between the maps using the V-measure statistic in the R package sabre v.0.4.3 (47). The V-measure was initially developed for comparing spatial congruence between regionalizations (47) and can be used to compare the boundaries of two categorical maps. Spatial association was computed using the function vmeasure_calc and setting the option B > 1 so that completeness is weighted more than homogeneity as my goal was to generate reliable range maps for macroecology and biogeography and not necessarily expecting a one-to-one match between my modeled maps and the hand-drawn maps from expert knowledge. V-measure scores range from 0 (incongruence) to 1 (indicating perfect similarity). I found relatively high degree of similarity between my modeled maps and range polygons from the IUCN with most of the pairwise comparisons, represented by the 25^th and 75^th percentiles of the V-measure scores, falling within the range of 0.03 and 0.6. This level of congruence lends support to the accuracy and reliability of my modeled maps, affirming their ability to faithfully represent the expected plant distributions as predicted by expert knowledge.

Range maps for trees of southern Africa. I also compared my modeled maps with range polygons for trees of southern Africa. The dataset came from a study that mapped tree diversity hotspots in southern Africa (48) using polygon range maps that represent the geographic area encompassing the known extent of over 1400 tree species in southern Africa. The maps were derived from Coates Palgrave (49) and Van Wyk et al. (50) as hand-drawn polygons based on expert knowledge of the flora of southern Africa. I also contrasted the boundaries of my modeled maps with the range maps of 886 tree species of southern Africa from Coates Palgrave (49) and Van Wyk et al. (50) using the V-measure statistic. I found some degree of similarity between my modelled estimations and that of trees of southern Africa. In fact, most of the pairwise comparisons, represented by the 25^th and 75^th percentiles of the V-measure scores, fell within the range of 0.07 to 0.36, indicating ability of representing expected plant distributions based on expert knowledge.

Quality control and data limitation. A general shortcoming of this study is the assumption of even distribution of occurrences across a species range. In reality, species distributions are not evenly distributed – some regions have more occurrences than others. This assumption of evenness may not fully capture fine-scale community processes. While I anticipate that analyses of greater sampling coverage within a plant range might refine my results and capture finer-scale community processes such as species interactions, currently, the data are not available to explore this further. The hidden hotspots uncovered in this study may eventually guide future plant collecting to realize the heterogenous distributions expected in nature. Users are encouraged to be aware of these limitations and carefully check for possible effects in their own analyses and results. Despite these limitations, the strong alignment of my modelled estimates with three different distribution datasets (traditional SDM of using raw input occurrence records, extent-of-occurrence range polygons from the IUCN, and range maps for trees of southern Africa), suggests potential predictions into unsampled regions to overcome inherent biases in vascular plant sampling. Another limitation is that I generated maps for a substantial number of plant species, 201,681 species, which accounts for ~60% of the total described plant species (1, 51, 52). This means that geographic information for approximately 40% of vascular plant species is still lacking. As this data is continuously updated on DRYAD, my intention is to enhance and revise the maps as new distribution data becomes available. However, these maps still hold significant value for addressing questions at regional to global scales where macroecological and biogeographical processes such as niche conservatism, speciation and dispersal operate. As additional data is acquired, I intend to refine the maps to make them more relevant to local and community-scale research questions.

Data analysis

The data analysis is divided into four steps.

Calculations of diversity patterns. To create consistent spatial representation, I projected each plant species modeled range polygon to a Wagner IV projection and subsequently rasterized using a masked raster grid of 20 km ´ 20 km at Wagner IV projection. To contrast my modelled estimates to those of terrestrial tetrapods (amphibians, birds, reptiles and mammals), I downloaded expert range maps for each tetrapod species from the International Union of Conservation of Nature’s (IUCN) Red List of Threatened Species spatial database (44) for amphibians, reptiles and mammals, and BirdLife International (53) for birds. These range polygons were also rasterized at 20 km resolution and reprojected to the Wagner IV projection. All species rasters were converted to a community matrix using the rast2comm function in the R package phyloregion (32) resulting in a sparse community matrix of 201,681 species ´ 350,488 pixels for plants, 6,738 species ´ 293,582 pixels for amphibians, 9,469 species ´ 349,579 pixels for birds, 9,617 ´ 305,244 pixels for reptiles, and 5,394 species ´ 348,959 pixels for mammals, that are essential for subsequent biodiversity calculations. To account for shared phylogenetic relationships in my diversity calculations, I paired the plant community matrix with a phylogenetic tree generated using the expanded megaphylogeny (GBOTB.extended.TPL) of Smith and Brown (15) as a backbone and the build.nodes.1 function in the R package V.PhyloMaker2 (14). Missing taxa from the megaphylogeny were added using scenario "S2” which allows the generation of random trees. The phylogenetic tree for amphibians was obtained from ref. (54), birds (55), reptiles (56), and mammals (57). Using the community matrix, I calculated species richness as the counts of species per pixel, phylogenetic diversity as the sum of branch lengths connecting species from the tip to the root of a phylogeny (58) using the function PD in the R package phyloregion v.1.0.9 (32), weighted endemism as species richness weighted by range size (32) using weighted_endemism function in phyloregion, and phylogenetic endemism as range-weighted phylogenetic diversity (59) using phylo_endemism function in the R package phyloregion v.1.0.9 (32). I also estimated spatial clustering or dispersion of biodiversity patterns using Moran’s I spatial autocorrelation measure (Monte Carlo test, 999 randomizations), with values of 1 indicating clustered patterns and 0 corresponding to even diversity patterns. To identify areas of exceptional diversity, I mapped hotspots by selecting the top 10% richest pixels for each of the calculated diversity metrics using the hotspots function in the R package phyloregion v.1.0.9 (32).

Validation using machine learning random forest extrapolations. I tested my modelled estimates for the effects of sampling biases common in plant occurrence records (Fig. S6) by training a random forest model to extrapolate my modelled estimates under a globally unbiased plant sampling. I used the R package ranger (60) to build models predicting the key diversity metrics based on 1,000 random trees, using the following variables as predictor variables: topography (29), summed tetrapod diversity, plant sampling bias grid (see step 5), floristic realms of the world, and each of the 19 bioclimatic variables (29). I utilized tetrapod diversity derived from range polygons as a variable in training the random forest machine learning models because their geographic sampling is less biased and they represent the most reliable and readily available data on the expected distributions of species at a global scale.

I resampled each predictor variable to a grain resolution of 10 km and reprojected to Wagner IV projection. I then computed the variance inflation factor for the predictor variables to remove highly correlated variables in a stepwise manner using the vifstep function in the R package usdm version 2.1-6 (30) by setting the VIF threshold = 5. My estimate of sampling density (i.e., sampling biases) was calculated using a spatial kernel density estimate (20) through the R package spatialEco v.2.0-1 (21) to generate sampling density grid standardized to 0-1 (22). I assessed model accuracy using 5-fold spatial block cross-validation (61) and by systematically dividing the data into spatial blocks using predefined floristic realms of the world (16). Spatial cross-validation allowed the identification of optimal combination of hyperparameters that minimize the root-mean-square error (RMSE) or "mtry", by tuning from 1 to 10, resulting in mtry = 4 for all diversity metrics except phylogenetic diversity with mtry = 5 (Fig. S7). These settings were used to train the final models and to generate extrapolated values per pixel for each metric under a scenario of globally unbiased sampling. Based on this scenario of globally unbiased sampling, I set all cells in the sampling bias grid to 1 under the assumption that all places across the globe have equal plant sampling, although I recognize that biodiversity is not randomly distributed. I then contrasted overlap of the extrapolated maps from the globally unbiased sampling to the modelled estimation maps.

Spatial correlations. I conducted spatially corrected correlations between plant diversity patterns and those of the different tetrapod classes for each metric. The spatial correlations were conducted using a corrected Pearson’s correlation for spatial autocorrelation using the package SpatialPack v.0.4 (62).

Protection levels by protected areas. To determine the level of protection of each diversity hotspot, I masked the hotspots based on modelled estimates onto existing protected area networks. I used version 1.6 of the protected areas of the IUCN categories I–IV compiled from the World Database on Protected Areas (63). I then calculated the proportion of the area covered by each hotspot inside and outside protected areas using the functions mask and crop in the R package terra v.1.7-3 (42).

Last, I explored the impact of protected areas on common species-level attributes including functional traits (such as plant height and seed size), evolutionary history (evolutionary distinctiveness), and extinction risk (defined as degree of threat facing a species), for each diversity hotspot inside and outside protected areas. Majority of the trait data were compiled from the Missouri Botanical Garden’s Tropicos database (https://www.tropicos.org/), supplemented by online regional databases. A detailed list of all consulted online regional databases is included in the supplementary materials (Dataset S2) while Table S2 lists the number of species with data for each trait category. It is important to note that the trait analysis focuses only on hotspots (top 10% of highest-ranking pixels for each metric), so the traits do not represent the entire global range. The extinction risk data in this study came from published dataset of machine learning predictions of conservation status for over 150,000 land plant species (64). The model considers factors known to influence extinction risk, including geographic range, environmental variables, and morphological traits (64). It estimates the likelihood of a species facing some level of threat. The predicted probability is then scaled to match the IUCN Red List categories using the expected probability of extinction over 100 years of each taxon (65) as follows: Least Concern = 0.001, Near Threatened & Conservation Dependent = 0.01, Vulnerable = 0.1, Endangered = 0.67, and Critically Endangered = 0.999. I used these machine-learning predictions because they provide broader coverage (for over 150,000 species) whereas formal IUCN Red List assessments are only available for a fraction of plant species (~34,000 as of March 15, 2023). This allows me to analyze a much larger dataset and identify potential coverage within protected and unprotected hotspots. Evolutionary distinctiveness was calculated with the evol_distinct function in the R package phyloregion v.1.0.9 (32) and is defined as the partitioning of phylogenetic branch lengths by the total number of species spanned by the tree weighted by the amount of unique evolutionary history they represent (66). In this way, evolutionary distinctiveness is considered as a species-level attribute, whereas diversity metrics such as phylogenetic endemism quantify biodiversity across geographic sites. Thus, the analysis aims to assess whether protected hotspots harbor a higher concentration of evolutionarily distinct lineages compared to unprotected hotspots. This means that within hotspots of phylogenetic endemism, for example, one can easily assess species that overlap these protected or unprotected hotspots on the basis of their evolutionary distinctiveness. To achieve this, I first identify grid cells unique to either protected or unprotected hotspots. Then, I identified all species that intersected with each type of hotspot. For each identified species within a hotspot, I extracted its relevant trait. I compared the functional traits between protected and unprotected hotspots using a t-test followed by Cohen's d with 1000 bootstrap replicates to estimate the effect size.

References

1. R. Govaerts, E. Nic Lughadha, N. Black, R. Turner, A. Paton, The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Sci. Dat. 8, 215 (2021).

2. A. D. Zizka et al., CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. Methods Ecol. Evol. 10, 744–751 (2019).

3. R. K. Brummitt, World geographical scheme for recording plant distributions, 2nd ed (Biodiversity Information Standards (TDWG) http://www.tdwg.org/ standards/109, 2001).

4. R. J. Hijmans, S. Phillips, J. Leathwick, J. Elith, dismo: Species Distribution Modeling (R package version 1.3-14, <https://CRAN.R-project.org/package=dismo>, 2023).

5. A. R. D. Rabosky et al., Coral snakes predict the evolution of mimicry across New World snakes. Nat. Commun. 7, 11484 (2016).

6. V. H. Heywood, Flowering plants of the world (Batsford, 1993).

7. APG IV, An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants. Bot. J. Linn. Soc. 181, 1–20 (2016).

8. B. L. Bateman, H. T. Murphy, A. E. Reside, K. Mokany, J. VanDerWal, Appropriateness of full-, partial- and no-dispersal scenarios in climate change impact modelling. Divers. Distrib. 19, 1224–1234 (2013).

9. N. V. Barve et al., The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecol. Model. 222, 1810–1819 (2011).

10. D. R. Brillinger, A particle migrating randomly on a sphere. In: Selected works of David Brillinger. Springer. p. 73–87 (2012).

11. S. Louca, Phylogeographic Estimation and Simulation of Global Diffusive Dispersal. Syst. Biol. 70, 340–359 (2021).

12. F. Perrin, Étude mathématique du mouvement brownien de rotation. In Annales scientifiques de l’École Normale Supérieure 45, 1–51 (1928).

13. S. Louca, M. Doebeli, Efficient comparative phylogenetics on large trees. Bioinformatics 34, 1053–1055 (2018).

14. Y. L. Jin, H. Qian, V.PhyloMaker: An R package that can generate very large phylogenies for vascular plants. Ecography 42, 1353–1359 (2019).

15. S. A. Smith, J. W. Brown, Constructing a broadly inclusive seed plant phylogeny. Am. J. Bot. 105, 302–314 (2018).

16. D. M. Olson et al., Terrestrial ecoregions of the world: A new map of life on Earth. Bioscience 51, 933–938 (2001).

17. Folk, R. A. C. J. Visger, P. S. Soltis, D. E. Soltis, R. P. Guralnick, Geographic range dynamics drove ancient hybridization in a lineage of Angiosperms. Am. Nat. 192, 171–187 (2018).

18. C. Meyer, P. Weigelt, H. Kreft, Multidimensional biases, gaps and uncertainties in global plant occurrence information. Ecol. Lett. 19, 992–1006 (2016).

19. B. H. Daru, J. Rodriguez. Mass production of unvouchered records fails to represent global biodiversity patterns. Nat. Ecol. Evol. 7, 816-831 (2023).

20. T. Duong, M. L. Hazelton, Cross-validation bandwidth matrices for multivariate kernel density estimation. Scand. J. Stat. 32, 485-506 (2005).

21. J. S. Evans, M. A. Murphy, spatialEco (R package version 2.0-1, <https://github.com/jeffreyevans/spatialEco>, 2023).

22. L. R. Hertzog, A. Besnard, P. Jay-Robert, Field validation shows bias-corrected pseudo-absence selection is the best method for predictive species-distribution modelling. Divers. Distrib. 20, 1403–1413 (2014).

23. G. F. Ficetola et al., An evaluation of the robustness of global amphibian range maps. J. Biogeogr. 41, 211–221 (2014).

24. Y. Fourcade, Comparing species distributions modelled from occurrence data and from expert-based range maps. Implication for predicting range shifts with climate change. Ecol. Inform. 36, 8–14 (2016).

25. B. H. Alhajeri, Y. Fourcade, High correlation between species-level environmental data estimates extracted from IUCN expert range maps and from GBIF occurrence data. J. Biogeogr. 46, 1329–1341 (2019).

26. J. A. Velasco et al., Synergistic impacts of global warming and thermohaline circulation collapse on amphibians. Commun. Biol. 4, 141 (2021).

27. C. Ureta et al., Evaluation of animal and plant diversity suggests Greenland’s thaw hastens the biodiversity crisis. Commun. Biol. 5, 985 (2022).

28. C. Meyer, H. Kreft, R. Guralnick, W. Jetz, Global priorities for an effective information basis of biodiversity distributions. Nat. Commun. 6, 1–8 (2015).

29. S. E. Fick, R. J. Hijmans. WorldClim 2: new 1km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 37, 4302–4315 (2017).

30. B. Naimi et al. Where is positional uncertainty a problem for species distribution modelling. Ecography 37, 191–203 (2014).

31. S. J. Phillips, R. P. Anderson, R. E. Schapire, Maximum entropy modeling of species geographic distributions. Ecol. Model. 190, 231–259 (2006).

32. B. H. Daru, P. Karunarathne, K. Schliep, phyloregion: R package for biogeographic regionalization and macroecology. Methods Ecol. Evol. 11, 1483–1491 (2020).

33. A. Radosavljevic, R. P. Anderson, Making better MAXENT models ofspecies distributions: Complexity, overfitting and evaluation. J. Biogeogr. 41, 629–643 (2014).

34. C. Liu, P. M. Berry, T. P. Dawson, R. G. Pearson, Selecting thresholds of occurrence in the prediction of species distributions. Ecography 28, 385–393 (2005).

35. M. H. Zweig, G. Campbell, Receiver-operating characteristic (Roc) plots—a fundamental evaluation tool in clinical medicine. Clin. Chem. 39, 561–577 (1993).

36. A. H. Fielding, J. F. Bell, A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 24, 38–49 (1997).

37. O. Allouche, A. Tsoar, R. Kadmon, Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). J. Appl. Ecol. 43, 1223–1232 (2006).

38. M. S. Boyce, P. R. Vernier, S. E. Nielsen, F. K. A. Schmiegelow, Evaluating resource selection functions. Ecol. Model. 157, 281–300 (2002).

39. A. H. Hirzel, G. Le Lay, V. Helfer, C. Randin, A. Guisan, Evaluating the ability of habitat suitability models to predict species presences. Ecol. Model. 199, 142–152 (2006).

40. J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data. Biometrics 33, 159–175 (1977).

41. D. Zurell et al., A standard protocol for reporting species distribution models. Ecography 43, 1261–1277 (2020).

42. R. Hijmans, terra: Spatial Data Analysis (R package version 1.7-3, <https://CRAN.R-project.org/package=terra>, 2023).

43. M. Strimas-Mackey, smoothr: Smooth and Tidy Spatial Features (R package version 1.0.0, <https://CRAN.R-project.org/package=smoothr>, 2023).

44. The IUCN Red List of Threatened Species.Version 2022-1 (IUCN, accessed 19 October 2022); https://www.iucnredlist.org

45. R Core Team, R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/, 2022).

46. B. J. Seliger, B. J. McGill, J. C. Svenning, J. L. Gill, Widespread underfilling of the potential ranges of North American trees. J. Biogeogr. 48, 359–371 (2021).

47. J. Nowosad, T. F. Stepinski, Spatial association between regionalizations using the information-theoretical V-measure. Int. J. Geogr. Inform. Sci. 32, 12, 2386–2401 (2018)

48. B. H. Daru, M. Van der Bank, T. J. Davies, Spatial incongruence among hotspots and complementary areas of tree diversity in southern Africa. Divers. Distrib. 21, 769–780 (2015).

49. M. Coates Palgrave, Keith Coates Palgrave trees of southern Africa, 3rd edn (Struik, Cape Town, South Africa, 2002).

50. B. Van Wyk, E. van der Berg, M. Coates Palgrave, M. Jordaan, Dictionary of names for southern African trees, 1st edn (Briza, Pretoria, South Africa, 2011).

51. E. N. Lughadha et al., Counting counts: revised estimates of numbers of accepted species of flowering plants, seed plants, vascular plants and land plants with a review of other recent estimates. Phytotaxa 272, 82–88 (2016).

52. K. J. Willis, The state of the world’s plants 2017 Report (R. B. Gardens Ed) (Kew, UK: Royal Botanic Gardens, 2017).

53. Bird Species Distribution Maps of the World. Version 2020.1 (BirdLife International, 2020); http://datazone.birdlife.org/species/requestdis

54. W. Jetz, R. A. Pyron, The interplay of past diversification and evolutionary isolation with present imperilment across the amphibian tree of life. Nat. Ecol. Evol. 2, 850–858 (2018).

55. Jetz, W., Thomas, G. H., Joy, J. B., Hartmann, K. & Mooers, A. O. The global diversity of birds in space and time. Nature 491, 444–448 (2012).

56. J. R. R. Tonini, K. H. Beard, R. B. Ferreira, W. Jetz, R. A. Pyron, Fully-sampled phylogenies of squamates reveal evolutionary patterns in threat status. Biol. Conserv. 204, 23–31 (2016).

57. O. R. Bininda‐Emonds et al., The delayed rise of present‐day mammals. Nature 446, 507–512 (2007).

58. D. P. Faith, Conservation evaluation and phylogenetic diversity. Biol. Conserv. 61, 1–10 (1992).

59. D. Rosauer, S. W. Laffan, M. D. Crisp, S. C. Donnellan L. G. Cook, Phylogenetic endemism: a new approach for identifying geographical concentrations of evolutionary history. Mol. Ecol. 18, 4061–4072 (2009).

60. M. N. Wright, A. Ziegler, ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. 77, 1–17 (2017).

61. R. Valavi, G. Guillera-Arroita, J. J. Lahoz-Monfort, J. Elith, Predictive performance of presence-only species distribution models: a benchmark study with reproducible code. Ecol. Monogr. 92, e01486 (2022).

62. R. Vallejos, F. Osorio, M. Bevilacqua, Spatial Relationships Between Two Georeferenced Variables: with Applications in R (Springer, New York. ISBN 978-3-030-56680-7, 2020).

63. UNEP-WCMC, User Manual for the World Database on Protected Areas and world database on other effective area-based conservation measures (1.6. UNEP-WCMC: Cambridge, UK. Available at: http://wcmc.io/WDPA_Manual, 2019).

64. T. A. Pelletier et al., Predicting plant conservation priorities on a global scale. Proc. Natl Acad. Sci. USA 115, 13027–13032 (2018).

65. D. W. Redding, A. Ø. Mooers, Incorporating evolutionary measures into conservation prioritization. Conserv. Biol. 20, 1670–1678 (2006).

66. M. W. Cadotte, T. J. Davies, Rarest of the rare: Advances in combining evolutionary distinctiveness and scarcity to inform conservation at biogeographical scales. Divers. Distrib. 16, 376–385 (2010).

Data from: Predicting undetected native vascular plant diversity at a global scale

Data files

Abstract

Methods

Works referencing this dataset