A global biogeographic regionalization for butterflies
Data files
Jun 19, 2025 version files 4.48 GB
-
Butterfly_Phyloregionalization.zip
4.48 GB
-
README.md
19.54 KB
Abstract
The partitioning of global biodiversity into biogeographic regions is critical for understanding the impacts of global-scale ecological and evolutionary processes on species assemblages as well as prioritizing areas for conservation. However, the lack of globally comprehensive data on species distributions preclude fine-scale estimation of biogeographical regionalization for numerous taxa of ecological, economic, and conservation interest. Using a recently-published phylogeny and novel curated native range maps for over 10,000 species of butterflies around the world, we delineated biogeographic regions for the world’s butterflies using phylogenetic dissimilarity. We uncovered 19 distinct phylogenetically delimited regions (phyloregions) nested within 6 realms. Regional boundaries were predicted by spatial turnover in modern-day temperature and precipitation seasonality, but historical climate change also left a significant fingerprint on deeper- (realm-) level boundaries. We use a culturally and ecologically important group of insects to expand our understanding of how historical and contemporary factors drive the distribution of organismal lineages on Earth. As insects and global biodiversity more generally face unprecedented challenges from anthropogenic factors, our research provides the groundwork for prioritizing regions and taxa for conservation, especially with the goal of preserving the legacies of our biosphere’s evolutionary history.
Description of the data and file structure
The dataset consists of two folders (input and output) whose contents are described hierarchically as follows:
1. Input Data
1.1. determinants
This data in this folder is derived from .tif files of climate change since the last glacial maximum, elevation, mean annual precipitation and temperature, and precipitation and temperature seasonality at a resolution of 5 minutes, available from the WorldClim database (worldclim.org) . In our code these files are analyzed in the terra
package, but can also be viewed and analyzed with QGIS, ImageJ, or other image analysis software.
1.1.1. boundary_predictors
.csv files including values of predictors for each 100-km^2 grid cell present within 200 km of any border between butterfly phyloregions, as generated from 5_boundary_determinants.R. These files can be viewed in Microsoft Excel or R. We include 6 .csv files for the following predictor variables. Here “heterogeneity” refers to the coefficient of variation between a focal cell and the 8 cells surrounding it and is therefore unitless, but we provide the original units of each raw climate predictor for clarity.
elevation = Terrain Ruggedness Index (m);
LGM_velocity = velocity of climate change since the last glacial maximum (m/year);
MAP = heterogeneity of mean annual precipitation (mm);
MAT = heterogeneity of mean annual temperature (ºC);
prec_Seasonality = heterogeneity of precipitation seasonality (annual coefficient of variation from mm);
temp_Seasonality = heterogeneity of temperature seasonality (annual coefficient of variation from ºC).
- response: Is the grid cell on a phyloregion or not?
- grids: The name of the grid cell
- column 3: The variable as described above.
1.1.2. realm_predictors
Same as above, but restricted to cells buffering only deep (realm)
boundaries between phyloregions.
- response: Is the grid cell on a phyloregion or not?
- grids: the name of the grid cell
- column 3: the variable as described above.
1.1.3. region_predictors
Same as above, but restricted to cells buffering only shallow (region)
boundaries between phyloregions.
- response: Is the grid cell on a phyloregion or not?
- grids: the name of the grid cell
- column 3: the variable as described above.
1.2. distance_matrices
Beta diversity values calculated between all 100-km^2 grid cells for species beta diversity, species-level phylogenetic beta diversity (averaged across 100 trees), and genus-level phylogenetic beta diversity
(averaged across 100 trees). Distance matrices are stored as sparse matrices (.rds files), and can be opened, viewed, and analyzed in R using the Matrix
or phyloregion
packages.
1.3. full_phylogeny_100trees.tre
100 species-level phylogenetic trees generated using the Kawahara et al. (2023) backbone phylogeny and the additional species in our dataset using Jin and Qian’s (2022) Scenario 2 - new tips are bound to randomly selected nodes at and below the genus- or family-level basal node. The phylogeny can be viewed and analyzed in R using the ape
package or specialized phylogenetic tree manipulation software such as FigTree.
1.4. grids_100km
Shapefile (.shp) for 100-km^2 grids covering the globe including the main file .shp and companion files: .dbf, .prj, .sbx, .shx. Description of these file extensions is given as follows:
.shp: The main geospatial data file that contains feature geometry.
.dbf: The dBASE that contains the attributes of features.
.prj: The file that contains the coordinate system and map projection information.
.shx: The file containing the index of feature geometry.
The main .shp file can be opened and analyzed in R and many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
1.5. Kawahara_tree_AA154_secondary_fossils_strategyB.tre
Original time-calibrated phylogeny of butterflies from Kawahara et al.
2023. The phylogeny can be viewed and analyzed in R using the ape
package or specialized phylogenetic tree manipulation software such as FigTree.
1.6. other_regions
Geopackages of phyloregionalization schemes developed for birds, amphibians, mammals, and all three combined (“zooregions’; Holt et al. 2013); reptiles (Falaschi et al. 2022); and plants (Carta et al. 2022). Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
1.7. polygons
Geopackages of range polygons for all butterflies included in our analyses, created by Daru (2024). Polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
1.8. wrld_simpl
Shapefile of national borders circa 2010 including the main file .shp and companion files: .dbf, .prj, .sbx, .shx as described above. The shapefile (.shp) can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
2. Output Data
2.1. distance_t_tests
.csv files detailing the results of distance-based t-tests (4_distance_t.R) used to assess the proximity of butterfly phyloregion borders to borders of phyloregions for other taxa (results_distance_t_taxa_regions.csv). Also includes results of t-tests examining the proximity of borders for butterflies between others those generated using 7 clustering algorithms (including UPGMA) for realm-level borders (results_distance_t_clusters_realms.csv) and region-level borders (results_distance_t_clusters_regions.csv). .csv files can be viewed and analyzed in Microsoft Excel or R.
- corr: Correlation coefficient between distances from borders under
each scheme - Fstat: The F-value from the model
- p.value: The p-value from the model
- dof: Degrees of freedom
2.2. HGLM
.csv files detailing the results of heirarchical generalized linear models (5_hglm.R) used to examine the contribution of climatic and other variables to butterfly phyloregion borders. We ran HGLMs for all
borders, only deep (realm-level) borders, and only shallow (region-level) borders. .csv files can be viewed and analyzed in Microsoft Excel or R.
- variable: Border predictor as described in 1.1.1.
- fisher_z: Effect size calculated from HGLM t-value.
- upper: Upper bound of 95% confidence interval.
- lower: Lower bound of 95% confidence interval.
- p_value: p-value from HGLM.
2.3. node_based
2.3.1. node_model_results.csv
Results of 1,365 one-way ANOVAs conducted across 91 candidate “indicator nodes” in 15 realm-realm comparisons, conducted by 6_node_results.R. ANOVAs compared the mean grid-level specific overrepresentation score (SOS) for each of the two daughter lineages of the candidate node. Candidate nodes that did not occur in both of the realms under comparison were excluded from further analyses, as were comparisons between mean SOS scores of the same sign. This file can be viewed and analyzed in Microsoft Excel or R. Species in each node are detailed in node_species.rds.
- Realm pair: Pair of realms across which the focal node is being
compared - Node: Numbered candidate “indicator node”
- Rsquared: R^2 value from model
- P value: p-value from model
- DF: Degrees of freedom (grid cells)
- SOS1: Mean SOS for the grid cells in which one daughter lineage
occurs - SOS2: Mean SOS for the grid cells in which the other daughter
lineage occurs - Relevance: Are the two mean SOS values different signs (relevant) or
the same?
2.3.2. node_age_result.csv
Ages (in millions of years) of each relevant “indicator node” compared in node_model_results.csv. This file can be viewed and analyzed in Microsoft Excel or R.
- node: Numbered candidate “indicator node”
- node_age: Age of node in millions of years
- Descendents: Identity of the node’s daughter lineages, according to taxonomy presented by Kawahara et al. (2023).
- Clade: Which family (or not) does the node fall within?
2.3.3. node_species.rds
List of candidate “indicator nodes” including all of the species in their paired daughter clades. The list can be viewed and analyzed in R.
2.3.4. nodiv_data.rds
nodiv_data object created in 6_get_node_data.R to be used in futher node-based analyses (6_node_results.R). This object can be viewed and analyzed in the R package nodiv
.
2.3.5. nodiv_result.rds
nodiv_result object created in 6_get_node_data.R from nodiv_data.rds to be used in further node-based analyses (6_node_results.R). This object can be viewed and analyzed in the R package nodiv
.
2.3.6. SOS.csv
Specific overrepresentation scores (SOS) values for each candidate “indicator node” in all grid cells. This file can be viewed and analyzed in Microsoft Excel or R.
- column 1: grid cell name
- columns 2-92: SOS values for all candidate “indicator nodes”
2.4. PRESAB_100km_coastal_crop.csv
Long-format community matrix indicating species presence and absence for all grid cells. Created with 1_phyloregionalization.R. This file can be viewed and analyzed in Microsoft Excel or R.
2.5. region_stats.csv
Mean species richness (number of species), phylogenetic diversity (Faith’s PD in millions of years), weighted species endemism (species richness, with each species inversely weighted by the number of grid cells it occupies), and weighted phylogenetic endemism (PD, with branch lengths inversely weighted by the range size of their descendent clades) for each butterfly phyloregion. Created with 2_region_stats.R. This file can be viewed and analyzed in Microsoft Excel or R.
2.6. Regionalizations
2.6.1. genera_upgma.gpkg
Geopackage of genus-level phyloregions generated using the UPGMA clustering algorithm. Created with 3_genus_phyloregions.R. Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
2.6.2. kawahara_upgma_species.gpkg
Geopackage of species-level phyloregions generated using the species in the original time-calibrated phylogeny from Kawahara et al. 2023. Created with 3_kawahara_phyloregions.R. Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
2.6.3. other_clustering_algorithms
Files created with 3_phyloregionalization_clusters.R.
2.6.3.1. gpkg
Geopackage files of species-level phyloregions generated using 6 additional clustering algorithms. Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
2.6.3.2. rds
Phyloregion objects of species-level phyloregions generated using 6 additional clustering algorithms. Objects can be viewed and analyzed in R using the phyloregion
package.
2.6.3.3. realm_gpkg
Geopackage files of species-level phyloregions generated using 6 additional clustering algorithms, showing only deep borders between realms. Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.
2.6.4. UPGMA_species
Geopackages and phyloregion objects created with 1_phyloregionalization.R from species beta diversity (bioregions) and phylogenetic beta diversity (phyloregions). Regionalizations are presented as polygons can be viewed and analyzed in R using the terra
package, many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc.; phyloregions can be viewed and analyzed in R using the phyloregion
package.
2.7. v_measures
.csv files of analyses comparing V-measure scores between butterfly phyloregions and those of other terrestrial organisms (4_V_measures.R), as well as between species-level UPGMA clustering for butterfly
phyloregions and other clustering algorithms for realms and regions (4_V_measures_clusters_regions.R, 4_V_measures_clusters_realms.R). .csv files can be viewed and analyzed in Microsoft Excel or R.
Sharing/Access information
Data was derived from the following sources:
- Climate, elevation, and climate change .tif files are derived from Fick & Hijmans (2017), and Sandel et al. (2011)
- The initial butterfly phylogeny backbone is derived from Kawahara et al. (2023)
- Phyloregions for birds, mammals, amphibians, and the combination of the three are derived from Holt et al. (2013); reptile phyloregions are derived from Falaschi et al. (2023); and plant phyloregions are derived from Carta et al. (2022)
- Butterfly range polygons were created by Daru (2024)
- The
wrld_simpl
shapefile is available from the R packagemaptools
(Bivand & Lewin-Koh 2023)
Code/Software
Code is written in R v.4.4.1. R can be downloaded from https://www.r-project.org/ by selecting a local CRAN mirror and clicking the download link for your operating system. Install time is typically under 10 minutes on a personal computer.
The code and output plots can be displayed and manipulated in the RStudio GUI, which is available for download at https://posit.co/download/rstudio-desktop/. Installation time is typically under 10 minutes on a personal computer.
1. 1_phyloregionalization.R
This script constructs optimal regionalizations for butterflies based on phylogenetic and taxonomic beta diversity using the “elbow” method (Salvador & Chan 2004). It plots the resulting regions on maps, as dendrograms, and as NMDS ordinations. The optimal clustering algorithm determined here and used throughout the text and additional analyses was the unweighted pair-group method with arithmetic averages (UPGMA) method.
2. 2_Region_stats.R
This script calculates species diversity, phylogenetic diversity (Faith’s PD), and weighted species and phylogenetic endemism (Crisp et al. 2001, Rosauer et al. 2009) across phyloregions. This is calculated by computing individual grid-cell-level values within each region and then averaging across all the grid cells in the region.
3. Additional phyloregionalization schemes
These scripts use alternative methods (different tree topologies, species lists, and clustering algorithms) for building phyloregions of butterflies.
3.1. 3_genus_phyloregions.R
Here we use the UPGMA algorithm to build a phyloregionalization scheme using an average phylogeny where tips represent butterfly genera.
3.2. 3_kawahara_phyloregions.R
Here we use the UPGMA algorithm to build a phyloregionalization scheme using the original Kawahara et al. (2023) tree topology and set of 1,691 butterfly species in the original tree for which we have range polygons.
3.3. 3_phyloregionalization_clusters.R
Here we build phyloregionalization schemes using the full species dataset with 6 additional clustering algorithms: (1) centroid linkage clustering (UPGMC), (2) median linkage clustering (WPGMC), (3) weighted average linkage clustering (WPGMA), Ward clustering with (4) and without (5) Ward’s (1963) clustering criterion, and (6) maximum (complete) lineage clustering.
4. Analyses of congruence between regionalization schemes
These scripts compare our UPGMA butterfly phyloregionalization scheme with other regionalization schemes.
4.1. 4_distance_t.R
This script assesses the pairwise correlations of boundaries across different butterfly phyloregionalization schemes created by different clustering algorithms by first calculating the distance of each grid cell from a boundary in each scheme. We then use a modified version of t test to test whether the distance of each cell from a boundary is similar among the schemes. If two phyloregionalization schemes are similar, boundaries cross the same areas of the planet and a given cell shows a similar distance from the boundary in the two schemes.
4.2. 4_V_measures_clusters_realms.R
This script uses V-measure calculations (Nowosad & Stepinski 2018) to quantify the congruency between deep-level clusters (realms) across butterfly phyloregionalization schemes created by different clustering algorithms. To test whether V-measures were significantly different from a random expectation, we compared the observed measures to a random distribution of measures created from 999 random regionalization schemes with the same number of regions as the scheme of comparison.
4.3. 4_V_measures_clusters_regions.R
This script is the same as described in 4.2 but compares only shallow-level clusters (regions) across butterfly phyloregionalization schemes.
4.4. 4_V_measures.R
This script is the same as described in 4.2 but compares all clusters within our UPGMA butterfly phyloregionalization scheme to clusters within phyloregionalization schemes for mammals, birds, amphibians, and all tetrapods (Holt et al. 2013), reptiles (Falaschi et al. 2023), and plants (Carta et al. 2022).
5. Regional boundary predictors
These scripts compile and analyze the correlations between regional boundaries and physical or climatic conditions across those boundaries.
5.1. 5_boundary_determinants.R
We first calculate climate heterogeneity (as coefficient of variation), terrain ruggedness index, and the velocity of climate change since the last glacial maximum for cells within 200 km of a regional boundary. We start with all boundaries, and then create subsets of the data with only deep boundaries (between realms) and shallow boundaries (between regions).
5.2. 5_hglm.R
We then constructed hierarchical generalized linear models (HGLMs) to model boundaries (as a binary variable - border cell or not?) as a function of each of the boundary determinants compiled in 5.1 using a binomial error distribution.
6. Node-based analyses
These scripts use the node-based framework developed by Borregaard et al. (2014) to identify “indicator clades” that disproportionately contribute to the phylogenetic distinctiveness of butterfly phylorealms.
6.1. 6_get_node_data.R
In this script we first calculate specific overrepresentation scores (SOS) across all grid cells for the descendant lineages of clades in the butterfly phylogeny with minimal geographic overlap. We then create a nodiv_data object to be used in subsequent analyses (6.2), and a list of nodes and their daughter species.
6.2. 6_node_results.R
This script models SOS values as a function of cell realm membership for each of 15 realm pairs. We identify which of the nodes with minimal geographic overlap contributed to differences in the butterfly faunas of paired realms, focusing only on comparisons between positive and negative mean SOS values across realm boundaries. We modeled the contribution of each node to divisions between realms as R^2 values from linear models of SOS as a function of realm.