Supporting data for: Spatial biodiversity rasters, modeling datasets, and occurrence records for California butterflies
Data files
Mar 20, 2026 version files 21.86 MB
-
CONSENSUS_hot10_SR_ED_PL.tif
322.12 KB
-
EDexp_EPSG3310_1km.tif
1.87 MB
-
EDexp_hot10.tif
324.89 KB
-
FINAL_25km4pt_unfixed_predictor_correlations_subsampled.csv
1.98 KB
-
FINAL_25km4pt_unfixed_subsampled_modeling_dataset_CORE.csv
1.13 MB
-
FINAL_25km4pt_unfixed_subsampled_modeling_dataset.csv
1.13 MB
-
matern_glmm_final_fulldatapoints.R
12.24 KB
-
NRI_NTI.R
2.69 KB
-
pendant_length.R
7.54 KB
-
PLexp_EPSG3310_1km.tif
1.88 MB
-
PLexp_hot10.tif
323.64 KB
-
Raw_species_locality_data.zip
12.64 MB
-
README.md
7.52 KB
-
SRexp_EPSG3310_1km.tif
1.88 MB
-
SRexp_hot10.tif
331.30 KB
Abstract
This dataset supports the manuscript “Spatial integration of taxonomic and evolutionary diversity refines butterfly conservation in California.” It includes cleaned georeferenced occurrence records for 211 butterfly species across California, derived spatial rasters of expected species richness (SRexp), probability-weighted evolutionary distinctiveness (EDexp), and probability-weighted pendant lineage length (PLexp), and binary top-decile hotspot layers for each metric and their spatial consensus.
The repository also contains a spatially subsampled modeling dataset (n = 2,948 grid cells) used in spatial generalized linear mixed models (GLMMs), including raw and standardized environmental predictors and response variables. A predictor correlation matrix derived from this subsampled dataset is provided.
Analytical R scripts are included for evolutionary metric calculation, spatial GLMM fitting with Matérn correlation structures, and phylogenetic community structure analyses (Net Relatedness Index and Nearest Taxon Index). Raster outputs are provided in GeoTIFF format (NAD83 California Albers projection, ~1 km resolution). These materials enable independent replication of spatial biodiversity mapping and modeling analyses presented in the associated study.
Creator
Dr. Khuram Zaman
Bakersfield College
Associated Manuscript
Zaman, K. 2026. Spatial integration of taxonomic and evolutionary diversity refines butterfly conservation in California. Insect Conservation and Diversity.
Dataset Overview
This repository contains occurrence records, derived spatial raster outputs, subsampled modeling datasets, and analytical scripts used to quantify taxonomic and evolutionary diversity of 211 butterfly species across California.
The dataset includes:
- Raw species occurrence records
- Expected species richness raster (SRexp)
- Probability-weighted evolutionary distinctiveness raster (EDexp)
- Probability-weighted pendant lineage length raster (PLexp)
- Top-decile hotspot rasters
- Consensus hotspot raster
- Spatially subsampled GLMM dataset (n = 2,948 grid cells)
- Predictor correlation matrix
- R scripts for phylogenetic, spatial, and modeling analyses
All diversity rasters are projected in NAD83 California Albers (EPSG:3310) at approximately 1 km resolution.
File Descriptions
1. Occurrence Data
Raw_species_locality_data.zip
Cleaned georeferenced butterfly occurrence records compiled by species.
Format: CSV (UTF-8 encoding)
Coordinate reference system: WGS84 (decimal degrees)
2. Diversity Raster Layers (GeoTIFF)
SRexp_EPSG3310_1km.tif
Expected species richness derived from stacked SDM probability surfaces.
EDexp_EPSG3310_1km.tif
Probability-weighted evolutionary distinctiveness surface.
PLexp_EPSG3310_1km.tif
Probability-weighted terminal branch length surface.
Projection: NAD83 California Albers (EPSG:3310)
Resolution: ~1 km
Format: GeoTIFF
3. Top-Decile Hotspot Rasters
SRexp_hot10.tif
EDexp_hot10.tif
PLexp_hot10.tif
Binary rasters identifying grid cells within the top 10% of statewide values for each diversity metric.
4. Consensus Hotspot Raster
CONSENSUS_hot10_SR_ED_PL.tif
Binary raster identifying grid cells simultaneously in the top decile for SRexp, EDexp, and PLexp.
5. Spatial GLMM Modeling Dataset:
FINAL_25km4pt_unfixed_subsampled_modeling_dataset.csv
Complete spatially subsampled dataset used for all spaMM models. Includes raw and standardized predictor variables.
FINAL_25km4pt_unfixed_subsampled_modeling_dataset_CORE.csv
Reduced version containing core response and predictor variables.
FINAL_25km4pt_unfixed_predictor_correlations_subsampled.csv
Pearson correlation matrix for predictors calculated from the subsampled modeling dataset.
Subsampling procedure:
- 25 km × 25 km spatial bins
- Maximum 4 grid cells per bin
- Random sampling with
set.seed(1) - Final sample size: n = 2,948
Detailed Variable Descriptions for Tabular Data Files:
FINAL_25km4pt_unfixed_subsampled_modeling_dataset.csv
FINAL_25km4pt_unfixed_subsampled_modeling_dataset_CORE.csv
These files contain the spatially subsampled dataset used for spatial GLMM analyses. Each row represents a ~1 km grid cell selected using stratified sampling (25 km × 25 km bins, up to 4 grid cells per bin).
Spatial identifiers:
- cell — Unique identifier for each grid cell
- x — Projected X coordinate (meters; NAD83 California Albers projection)
- y — Projected Y coordinate (meters; NAD83 California Albers projection)
Response variables:
- SRexp — Expected species richness (sum of species distribution model probabilities across species)
- EDexp — Expected evolutionary distinctiveness (probability-weighted evolutionary distinctiveness values)
- PLexp — Expected pendant lineage length (probability-weighted terminal branch lengths)
Climatic predictor variables (WorldClim Bioclim variables):
- b1 — Annual Mean Temperature (°C)
- b2 — Mean Diurnal Range (°C)
- b3 — Isothermality (unitless; ratio of diurnal range to annual range × 100)
- b4 — Temperature Seasonality (standard deviation × 100)
- b7 — Annual Temperature Range (°C)
- b12 — Annual Precipitation (mm)
- b15 — Precipitation Seasonality (coefficient of variation)
- b18 — Precipitation of Warmest Quarter (mm)
Land-use and disturbance variables:
- Burn — Annual wildfire burn probability (unitless; range 0–1)
- CF — CropFrac; proportion of grid cell classified as cultivated crops (range 0–1)
- DF — DevFrac; proportion of grid cell classified as developed land (range 0–1)
- NOF — NatOpenFrac; proportion of grid cell classified as natural open habitats (range 0–1)
Standardized variables:
Variables with the suffix “_z” indicate standardized values (z-scores), calculated as:
(value − mean) / standard deviation
These standardized variables were used in spatial GLMM analyses to allow comparison of effect sizes across predictors.
FINAL_25km4pt_unfixed_predictor_correlations_subsampled.csv
This file contains pairwise Pearson correlation coefficients among environmental predictor variables calculated from the spatially subsampled modeling dataset.
Column descriptions:
- Variable1 — Name of the first predictor variable
- Variable2 — Name of the second predictor variable
- Correlation — Pearson correlation coefficient (r), ranging from −1 to 1
6. R Scripts
matern_glmm_final_fulldatapoints.R
Constructs spatial bins, generates subsampled modeling dataset, fits spaMM GLMMs, and extracts model metrics.
pendant_length.R
Implements Monte Carlo perturbation of phylogenetic branch lengths and calculates evolutionary distinctiveness and pendant length.
NRI_NTI.R
Calculates Net Relatedness Index (NRI) and Nearest Taxon Index (NTI) using richness-constrained null models.
Environmental Predictors
Environmental predictor rasters (WorldClim v2.0, NLCD 2024, SoilGrids, Wildfire Risk to Communities) are publicly available from their respective repositories and are not redistributed here due to size and licensing considerations. All sources are cited in the associated manuscript.
Data Processing Summary
- Occurrence records filtered for coordinate validity and study region
- Spatial thinning applied prior to SDM construction
- Species distribution models constructed using Maxent
- Probability surfaces stacked to generate SRexp
- ED and pendant length calculated from a time-calibrated phylogeny
- Probability-weighted evolutionary rasters generated
- Spatial GLMMs fitted using spaMM with a Matérn correlation structure
Limitations
- Presence-only occurrence data
- Uneven spatial sampling effort
- Coarse (~1 km) environmental resolution
- Derived metrics represent modeled expectations rather than direct abundance estimates
Recommended Dataset Citation
Zaman, K. 2026. Supporting data for: Spatial integration of taxonomic and evolutionary diversity refines butterfly conservation in California. Dryad Digital Repository. https://doi.org/10.5061/dryad.9p8cz8wzh
License
These data are released under the Creative Commons Zero (CC0 1.0) Public Domain Dedication.
This waiver allows unrestricted use, distribution, and reuse of the data without any conditions.
Contact
Dr. Khuram Zaman
Bakersfield College
khuram.zaman@bakersfieldcollege.edu
