Data and code from: Historical temperature stability and environmental drivers shape patterns of phylogenetic diversity in cactaceae
Data files
Feb 24, 2026 version files 15.18 MB
-
2Coord_cac2_epsg_4326.txt
13.21 MB
-
cact_occ_drivers.txt
1.85 MB
-
cacttree.treefile
47.57 KB
-
MPD___MNTD.R
4.97 KB
-
R_code_GWR.R
43.37 KB
-
R_code_pSEM.R
6.47 KB
-
README.md
5.75 KB
-
references.txt
1.60 KB
Abstract
Reliable biodiversity data are fundamental for assessing species distributions, detecting knowledge gaps, and informing conservation strategies. We present a comprehensive dataset containing 435,694 georeferenced occurrence records for 1,892 species of Cactaceae across the Americas. This large-scale compilation provides an overview of the spatial distribution of cactus species, revealing geographic and taxonomic biases in sampling efforts. The dataset offers a valuable resource for advancing ecological and biogeographic research and supports future analyses aimed at understanding the diversity, distribution, and conservation needs of this emblematic plant family. This database is an expansion of the database compiled by Amaral et al. (2022).
1. DESCRIPTION
This dataset includes 435,694 georeferenced occurrence records for 1,892 species of Cactaceae across the Americas. The data were compiled from multiple sources and curated to support macroecological and biogeographic studies. This compilation expands on the database published by Amaral et al. (2022).
The repository contains:
- A cleaned occurrence dataset.
- A phylogenetic tree for the family.
- An integrated analysis-ready data table (
cact_occc_drivers.csv); - R scripts for calculating phylogenetic diversity metrics (
ses.MPDandses.MNTD); - Scripts for running Piecewise SEM and Geographically Weighted Path Models (GWpath).
2. FILE LIST AND DESCRIPTIONS
| Filename | Description |
|---|---|
2Coord_cac2_epsg_4326.txt |
Raw dataset with georeferenced occurrence records for Cactaceae species. |
cact_occ_drivers.txt |
Main input file for analyses, combining occurrences and environmental variables. |
cacttree.treefile |
Phylogenetic tree for Cactaceae (Amaral et al., 2022), in Newick format. |
MPD___MNTD.R |
R script to compute ses.MPD and ses.MNTD use null models. |
R_code_pSEM.R |
R script for Piecewise Structural Equation Modeling (pSEM). |
R_code_GWR.R |
R script implementing Geographically Weighted Regression Path Modeling (GWpath). |
| references.txt | main references used for building the models and database |
README.md |
This file. |
3. NAMES AND ABBREVIATIONS USED
- longitude = Long
- Latitude = Lat
Soil variables (soil grid) used to calculate the PCA of soil nutrients (SNUT)
- nitrogen (N) (in cg/kg)
- cation_exchange = cation exchange capacity (in mmol(c)/kg)
- org_carbon = soil organic carbon content (in dg/kg)
- stock_carbon = soil organic carbon stock (in hg/m³)
- carbon_density = soil organic carbon density (in t/ha)
- ph_water = soil pH in water (in pH10)
Soil variables (ssoilgrid) used to calculate the PCA of soil composition (SCOM)
- sand = sand content (in g/kg)
- clay = clay content (in g/kg)
- fragments = coarse fragments (in cm³/dm³)
- silt = silt content (in g/kg)
Variables used
- PCA soil composition
(SCOM) - PCA soil nutrient
(SNUT) - elevation (m)
(ELV) - slope (°)
(SLP) - Radiation (kJ m⁻² day⁻)
(RAD) - temperature (°C)
(TEMP) - precipitation (mm)
(PREC) - deltaprec = delta of precipitation
(DPREC) - deltatemp = delta of temperature
(DTEMP) - mntd or mntd2 = standardized effect size of mean nearest taxon distance
(MNTD) - mpd or mpd2 = standardized effect size of mean pairwise distance
(MPD)
Variables unused
- lat_distance = Latitudinal distance
(LAT) - conservation = conservation area
(CSV) - completeness = sample completeness
- PD = phylogenetic diversity (based on the sum of the branches)
- richness = species richness
- rbiog = biogeographical regions (NEO - NEOTROPIC; NEA - NEARTIC)
4. WHY DON'T WE USE THE CONSERVATION STATUS OF SPECIES?
This study does not require conservation status information for any species, as all analytical procedures were conducted exclusively at the community level. Phylogenetic structure metrics (MPD and MNTD) were calculated using species presence within communities, and the subsequent analyses—pSEM at the continental scale and GWR at the regional scale—depend solely on these community-level patterns.
No species-specific ecological or conservation-related inferences were made from the locality data, and the study does not examine species distributions, threats, or range-level spatial patterns. Therefore, the conservation status of individual taxa does not influence the results or their interpretation and was not included in the analyses.
While conservation status information is essential for studies involving sensitive species data or fine-scale distribution analyses, it is not applicable to the analytical framework used here.
5. REFERENCES
Amaral, D. T., Bonatelli, I. A. S., Romeiro-Brito, M., Moraes, E. M., & Franco, F. F. (2022). Spatial patterns of evolutionary diversity in Cactaceae show low ecological representation within protected areas. Biological Conservation, 273, 109677.
Barreto, E., Graham, C. H., & Rangel, T. F. (2019). Environmental factors explain the spatial mismatches between species richness and phylogenetic diversity of terrestrial mammals. Global Ecology and Biogeography, 28, 1855–1865.
Bivand, R., & Yu, D. (2017). spgwr: Geographically weighted regression. R package version 0.6-32. Retrieved from CRAN: Package spgwr
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2009). Geographically weighted regression. In The Sage Handbook of Spatial Analysis (pp. 243–254).
Fox, J., et al. (2007). The CAR package. R Foundation for Statistical Computing, 1109, 1431.
Kempner, A., R. Castro-Souza, D. A. Moi, et al. 2026. “ Historical Temperature Stability and Environmental Drivers Shape Patterns of Phylogenetic Diversity in Cactaceae.” Journal of Biogeography 53, no. 1: e70134. https://doi.org/10.1111/jbi.70134.
Lefcheck, J. S. (2016). piecewiseSEM: Piecewise structural equation modelling in R for ecology, evolution, and systematics. Methods in Ecology and Evolution, 7(5), 573–579.
Our database expands on the foundational dataset compiled by Amaral et al. (2022), incorporating additional occurrence records of Cactaceae species sourced from publicly available digital biodiversity platforms. We retrieved data from three main repositories: (i) GBIF (Global Biodiversity Information Facility), (ii) iNaturalist (iNat), and (iii) iDigBio (Integrated Digitized Biocollections), using the “spocc” package in R version 4.4.1 (Chamberlain, Ram & Hart, 2021; R Core Team, 2024), with data indexed up to October 9, 2023. Taxonomic validation was performed using the Caryophyllales.org checklist (Korotkova et al., 2021), allowing the removal of misidentified taxa and entries with spelling inconsistencies. Records with spatial inaccuracies or vague locality information were excluded using the CoordinateCleaner package (Zizka et al., 2019). The final dataset comprises 435,694 georeferenced occurrences representing 1,892 Cactaceae species.
Phylogenetic metrics Mean Pairwise Distance (MPD) and Mean Nearest Taxon Distance (MNTD) were calculated from a phylogenetic matrix including all Cactaceae species with valid occurrence data. MPD and MNTD values were standardized using a null model approach, resulting in the metrics ses.MPD and ses.MNTD (standardized effect sizes), to control for species richness effects.
To assess the direct effects of abiotic variables on ses.MPD and ses.MNTD, we used structural equation models (SEM) implemented through the piecewiseSEM package (Lefcheck, 2016) in R version 4.4.1. Causal paths were defined based on an initial hypothetical model. Multicollinearity among predictors was evaluated using the Variance Inflation Factor (VIF), calculated with the car package (Fox & Weisberg, 2007), and variables with VIF ≥ 3 were excluded. Path significance was estimated using maximum likelihood, and model fit was assessed with Shipley’s d-separation test using Fisher’s C statistic (p > 0.05 indicating a good fit).
We also employed GWpath models (Geographically Weighted Path Models) (Barreto et al., 2019) using the spgwr package (Bivand & Yu, 2017) in R. This approach applies Geographically Weighted Regression (GWR) to estimate local models for each grid cell, based on a Gaussian kernel weighting function defined by distance (Fotheringham et al., 2009). The GWR model was built using the causal paths from the adjusted piecewiseSEM model. A bandwidth of 800 km was used. We then mapped the strongest environmental predictors of phylogenetic diversity for each grid cell, using categorical maps following the method of Barreto et al. (2019).
