Data and code from: Biogeographic processes underlying global patterns of plant diversity
Data files
May 08, 2026 version files 182.35 MB
-
DRYAD_DATA.zip
182.33 MB
-
README.md
26.88 KB
Abstract
The uneven global distribution of plant diversity remains a fundamental question in biogeography. Using dated phylogenies of over 300,000 plant species and ancestral biogeographical stochastic mapping, we show that in-situ speciation is the predominant process underlying extant plant diversity and accounts for 78 % of biogeographic events across realms. The Neotropic contributed 37 % of in-situ speciation, likely due to its role as center of species diversification. Dispersal between realms was less frequent (16 % of events) but facilitated floristic exchanges, especially in eastern Hemisphere. Extinction was least frequent but more pronounced in East Asia. These findings support the tropical conservatism hypothesis in which many clades originated in the tropics and only recently expanded into temperate zones, where limited time and biome conservatism have restricted speciation and diversity.
Barnabas H. Daru, Cornelius O. Nichodemus and L. Francisco Henao-Diaz
Overview
This repository contains the data, analysis scripts, and visualization tools used in our study of the biogeographic processes underlying global plant diversity. Our study shows the role of in-situ speciation, dispersal, and extinction since the Cretaceous-Paleogene boundary (~66 Ma). The key findings include:
- In-situ speciation: Predominant, contributing to 77.5 % of biogeographic events.
- Dispersal: Less frequent (15.8 %) but played a role in connecting regions, especially in facilitating floristic exchanges among realms in the eastern Hemisphere.
- Extinction was a less frequent process shaping floristic assembly but was more pronounced in East Asia.
- Evidence supporting the tropical conservatism hypothesis.
This repository DRYAD_DATA.zip contains all resources necessary to replicate analyses and generate figures.
Abstract
The uneven global distribution of plant diversity remains a fundamental question in biogeography. Using dated phylogenies of over 300,000 plant species and ancestral biogeographical stochastic mapping, we show that in-situ speciation is the predominant process underlying extant plant diversity and accounts for 78 % of biogeographic events across realms. The Neotropic contributed 37 % of in-situ speciation, likely due to its role as center of species diversification. Dispersal between realms was less frequent (16 % of events) but facilitated floristic exchanges, especially in eastern Hemisphere. Extinction was least frequent but more pronounced in East Asia. These findings support the tropical conservatism hypothesis in which many clades originated in the tropics and only recently expanded into temperate zones, where limited time and biome conservatism have restricted speciation and diversity.
This repository includes datasets, analysis codes, and scripts for visualizing the results.
DRYAD_DATA README
This Dryad deposit contains all data and code underlying the manuscript “Biogeographic processes underlying global patterns of plant diversity” (B. H. Daru et al.; doi: https://doi.org/10.1126/science.adv6172).
The top-level folder DRYAD_DATA is organized into two main subfolders:
- DATA/ — input datasets and analysis outputs
- CODES/ — scripts used for intermediate steps, core analyses, and figure generation
1) DATA subfolder
External data sources and licensing (CC0 compatibility)
This submission incorporates or was created using materials from external resources, including phylogenetic trees, geographic base maps (e.g., the World Geographical Scheme for Recording Plant Distributions; WGSRPD), GreenMaps.gpkg, and specaccum_iNEXT.RData.
We confirm that we have the right to redistribute the contents of this repository/archive under the CC0 waiver. For each external resource, one of the following applies:
- the resource is in the public domain; or
- it is distributed under an open license that permits redistribution in a manner compatible with CC0; or
- the external resource was used only as an input to generate derived outputs, and any redistributed files do not contain proprietary or restricted content from the original source.
All external resources are cited in the associated manuscript and/or documentation.
1.1 INPUT
Key input files and folders used throughout the analyses:
- biogeographical_events.csv
1.1.1 Data dictionary: biogeographical_events.csv
Tabular summary of inferred biogeographic events per clade. Each row corresponds to one clade (see clade).
Columns
n_taxa
Number of taxa (tips/species) in the focal clade used for the biogeographic inference.
Units: count (integer).in_situ_speciation_mean
Mean estimated number of within-area (“in situ”) speciation events for the clade.
Units: expected event count (can be non-integer because it is a mean across replicates/posterior samples).founder_event_speciation_mean
Mean estimated number of founder-event speciation events (i.e., speciation associated with a jump dispersal/founder event) for the clade.
Units: expected event count (mean; non-integer possible).allopatry_mean
Mean estimated number of allopatric speciation events.
Units: expected event count (mean; non-integer possible).subset_sympatry_mean
Mean estimated number of subset sympatry speciation events.
Units: expected event count (mean; non-integer possible).dispersal_mean
Mean estimated number of dispersal events (range expansions/movements among areas) for the clade.
Units: expected event count (mean; non-integer possible).extinction_mean
Mean estimated number of extinction (range contraction/local extinction) events for the clade.
Units: expected event count (mean; non-integer possible).clade
Clade identifier used to match rows to downstream analyses/other tables.
Type: categorical string (e.g.,clade_1,clade_10, ...).
Units: none.
Notes
- Columns ending in
_meanare means (not raw counts) and therefore may be fractional. - clades/ — example clade phylogenies provided to demonstrate how to run the workflow (four clades):
plant_clade_1.treplant_clade_2.treplant_clade_3.treplant_clade_4.tre
- CSVs/ — supporting tabular inputs used by multiple scripts
- geodata/ — geographic datasets used for mapping and realm-level summaries, including:
GreenMaps.gpkgrealm_names_circleplots.csvrealm_names.csvrichness_realms.csvrichness_realms_updated.csvspecaccum_iNEXT.RData
- global/ — core reference inputs used in the BioGeoBEARS analyses. This folder includes four example clade-specific input pairs (phylogeny + geography file):
angiosperms_clade_1_phylogeny,angiosperms_clade_1_geographyangiosperms_clade_2_phylogeny,angiosperms_clade_2_geographyangiosperms_clade_3_phylogeny,angiosperms_clade_3_geographyangiosperms_clade_4_phylogeny,angiosperms_clade_4_geography
and the time-stratification inputs used across models:- timeslices/:
timeperiods_5.txt,timeperiods_25.txt,timeperiods_145.txt - dispersal_multipliers/:
dispersal_multipliers_5.txt,dispersal_multipliers_25.txt,dispersal_multipliers_145.txt
- phylogeny/ — phylogenetic inputs and associated files used in the core analyses
Data dictionary: realm_names_circleplots.csv
Lookup table defining biogeographic realm labels and plotting metadata (e.g., colors) for circle plot figures. Each row corresponds to one realm/region.
Columns
cluster
Numeric identifier for a realm/region grouping used internally for plotting or ordering.
Units: none (integer code).region
Human-readable region name.
Type: categorical string.
Examples:Afrotropic,Australasia,IndoMalay,Nearctic,Neotropic,Palearctic.REALM
Abbreviated realm code.
Type: categorical string (short code).
Examples:AFR,AUS,IDM,NEA,NEO,PAL.areanames
Single-letter plotting label used to annotate areas/realms in figures.
Type: categorical string (typically one character).
Examples:a,b,c, ...CLR
Color used for plotting the realm/region.
Type/format: hex color code.
Examples:#516A78.
Data dictionary: realm_names.csv
Lookup table of biogeographic realms/regions and associated metadata used for plotting and for region-level predictor variables. Each row corresponds to one realm/region.
Columns
region
Human-readable region/realm name.
Type: categorical string (e.g.,Afrotropic,Australasia,East_Asia).cluster
Numeric identifier for the realm/region grouping (used for ordering/grouping in analyses/figures).
Units: none (integer code).REALM
Abbreviated realm code.
Type: categorical string (short code; e.g.,AFR,AUS,EAS,IDM,NEA,NEO).areanames
Short plotting label for the region (used in some figures).
Type: categorical string (typically a single character; e.g.,a,b,c, ...).CLR
Color assigned to the realm/region for plotting.
Type/format: hex color code (e.g.,#7A8E42).realm_richness
Total species richness (number of species) associated with the realm/region, used as a region-level richness summary.
Units: count.area
Total area of the realm/region.
Units: area (as provided in the source dataset; see Notes).peri
Perimeter of the realm/region polygon.
Units: length (as provided in the source dataset; see Notes).isol
Isolation metric for the realm/region.
Units: distance (as provided in the source dataset; see Notes).
Notes
- Units for
area,peri, andisoldepend on the original geospatial data source and projection used to compute these metrics. In this workflow they are treated as continuous numeric predictors (often log-transformed or standardized prior to modeling, depending on the analysis script).
Data dictionary: richness_realms_updated.csv
Realm-level richness and sampling-completeness summaries produced with iNEXT (interpolation/extrapolation of diversity based on Hill numbers). Each row corresponds to one biogeographic realm.
Columns
realm
Realm name (region identifier).
Type: categorical string (e.g.,Afrotropic,Australasia).T
Number of sampling units (assemblages) used for the realm in the incidence-based iNEXT analysis.
Units: count (integer).U
Total incidence frequency across all sampling units (i.e., the sum of incidences over species; often interpreted as the total number of “detections/occurrences” across sampling units).
Units: count (integer).S.obs
Observed species richness in the realm.
Units: count.SC
Sample coverage (estimated completeness of the sample), as defined in iNEXT.
Scale: proportion in [0, 1].Q1
Number of “uniques”: species that occur in exactly one sampling unit.
Units: count.Q2
Number of species that occur in exactly two sampling units.
Units: count.Q3…Q10
Frequency counts: number of species occurring in exactly 3, 4, …, 10 sampling units, respectively.
Units: count.
Notes
- This table summarizes incidence-based diversity inputs/diagnostics used by iNEXT; definitions of
T,U,SC, andQkfollow iNEXT conventions for incidence frequency data (Chao et al.; iNEXT package documentation).
Data dictionary: richness_realms.csv
Grid-cell–level species richness data with each grid cell assigned to a biogeographic realm. Each row corresponds to one spatial grid cell.
Columns
realms
Biogeographic realm assignment for the grid cell.
Type: categorical string (e.g.,Nearctic,Palearctic,Neotropic).grids
Grid-cell identifier (a unique code naming each spatial grid cell).
Type: categorical string (e.g.,ABT,AFG).richness
Species richness in the grid cell.
Units: count (number of species).
1.2 OUTPUT
Derived datasets and summary tables produced by the workflow:
1.2.1 Output folders (contents)
02_Summary_dispersals/
Clade-level summary tables/figures related to dispersal events (derived from event reconstructions).02_Summary_extinction/
Clade-level summary tables/figures related to extinction events.02_Summary_in_situ_speciation/
Clade-level summary tables/figures related to within-area (“in situ”) speciation events.04_Events_over_time/
Time-binned summaries of inferred events (e.g., counts/rates through time), used for temporal analyses and plotting.
These folders have the similar file types with summaries for the different biogeographic events. The data dictionary for the columns are similar and are indicated, for example for
02_Summary_dispersals/1_Summary_dispersalbetweenRegion_Paleogene.csvas follows:
Directed, realm-to-realm dispersal summary for the Paleogene time interval. Each row is a directed edge from a source realm (source) to a target realm (target), with dispersal magnitude aggregated within this interval.
Columns
source
Source realm label (single-letter code). MatchesareanamesinINPUT/geodata/realm_names.csv.
Type: categorical string (e.g.,a,b, ...).target
Target realm label (single-letter code). MatchesareanamesinINPUT/geodata/realm_names.csv.
Type: categorical string (e.g.,f,b, ...).weight
Total dispersal magnitude fromsourcetotargetduring the Paleogene interval (aggregated across clades/replicates as implemented in the workflow).
Units: event weight / expected event count (continuous; may be non-integer).region
Human-readable name of the source region/realm.
Type: categorical string (e.g.,Afrotropic).cluster
Numeric identifier for the source realm/region (used for ordering/grouping).
Units: none (integer code).REALM
Abbreviated code for the source realm.
Type: categorical string (e.g.,AFR).CLR
Plotting color assigned to the source realm.
Type/format: hex color code.realm_richness
Total species richness of the source realm.
Units: count.area
Area of the source realm.
Units: area (as provided in the geodata source).peri
Perimeter of the source realm polygon.
Units: length (as provided in the geodata source).isol
Isolation metric of the source realm.
Units: distance (as provided in the geodata source).weight_per_species
Dispersal weight standardized by source-realm richness:weight / realm_richness.
Units: dispersal weight per species (continuous).
Notes
- This file is one of several interval-specific dispersal summaries in
02_Summary_dispersals/. Equivalent files for other intervals share the same column definitions. 05_predictor_variables/
Derived predictor datasets used in statistical models (see files listed below).
1.2.2 Datasets of predictor variables (files)
These CSV files contain processed predictor variables assembled for downstream modeling/visualization:
diversity_metrics_variables.csvextinction_variables.csvdispersal_variables.csvregion_abiotic_variables.csvplate_tectonics.csvinsitu_speciation_variables.csv
1.2.2.1 Data dictionary: 05_predictor_variables/dispersal_variables.csv
Derived, clade-by-region dispersal predictor variables used in downstream models. Each row corresponds to one clade × region combination.
Columns
reg_abbr
Region abbreviation used for compact labels. MatchesareanamesinINPUT/geodata/realm_names.csv.
Type: categorical string (typically a single letter, e.g.,a,b, ...).Source
Dispersal metric summarizing dispersal originating from the focal region (i.e., outgoing dispersal from the region), for the given clade.
Units: event weight / expected event count (continuous; may be non-integer).Recipient
Dispersal metric summarizing dispersal into the focal region (i.e., incoming dispersal to the region), for the given clade.
Units: event weight / expected event count (continuous; may be non-integer).clade
Clade identifier/name used in analyses (e.g., an order-level clade such asChloranthales).
Type: categorical string.region
Human-readable region/realm name corresponding toreg_abbr.
Type: categorical string (e.g.,Afrotropic,Australasia).
1.2.2.2 Data dictionary: 05_predictor_variables/diversity_metrics_variables.csv
Derived diversity metrics calculated for each clade × region combination and used as predictors/response summaries in downstream analyses. Each row corresponds to one clade in one region/realm.
Columns
region
Human-readable region/realm name.
Type: categorical string (e.g.,Afrotropic,Australasia).sr
Species richness of the focal clade in the focal region.
Units: count.pd
Phylogenetic diversity (PD) of the focal clade in the focal region (Faith’s PD; sum of branch lengths connecting the taxa present).
Units: branch-length units of the underlying phylogeny (typically time, e.g., Myr, if the tree is time-calibrated).we
Weighted endemism (WE) of the focal clade in the focal region.
Units: unitless index (continuous).pe
Phylogenetic endemism (PE) of the focal clade in the focal region.
Units: branch-length units weighted by range restriction (continuous; depends on phylogeny branch-length units).ed
Evolutionary distinctiveness (ED) summary for the focal clade in the focal region (as implemented in the workflow).
Units: branch-length units of the underlying phylogeny (continuous).clade
Clade identifier/name used in analyses (e.g.,Chloranthales).
Type: categorical string.
1.2.2.3 Data dictionary: 05_predictor_variables/extinction_variables.csv
Derived, clade-by-region extinction predictor variable used in downstream models. Each row corresponds to one clade × region combination.
Columns
reg_abbr
Region abbreviation used for compact labels. MatchesareanamesinINPUT/geodata/realm_names.csv.
Type: categorical string (typically a single letter, e.g.,a,b, ...).extinction
Estimated extinction metric for the focal clade in the focal region (as summarized/aggregated by the workflow).
Units: event weight / expected event count (continuous; may be non-integer).clade
Clade identifier/name used in analyses (e.g.,Chloranthales).
Type: categorical string.region
Human-readable region/realm name corresponding toreg_abbr.
Type: categorical string (e.g.,Afrotropic,Australasia).
1.2.2.4 Data dictionary: 05_predictor_variables/insitu_speciation_variables.csv
Derived, clade-by-region in situ speciation predictor variable used in downstream models. Each row corresponds to one clade × region combination.
Columns
reg_abbr
Region abbreviation used for compact labels. MatchesareanamesinINPUT/geodata/realm_names.csv.
Type: categorical string (typically a single letter, e.g.,a,b, ...).insitu_speciation
Estimated within-region (“in situ”) speciation metric for the focal clade in the focal region (as summarized/aggregated by the workflow).
Units: event weight / expected event count (continuous; may be non-integer).clade
Clade identifier/name used in analyses (e.g.,Chloranthales).
Type: categorical string.region
Human-readable region/realm name corresponding toreg_abbr.
Type: categorical string (e.g.,Afrotropic,Australasia).
1.2.2.5 Data dictionary: 05_predictor_variables/plate_tectonics.csv
Region-level plate tectonics predictor used in downstream models. Each row corresponds to one biogeographic region/realm.
Columns
region
Human-readable region/realm name.
Type: categorical string (e.g.,Afrotropic,Australasia).tectonics
Plate-tectonics metric for the region (as derived/aggregated by the workflow; e.g., a summary of tectonic activity or plate-boundary-related measure over the region).
Units: as provided by the source/derivation in the workflow (continuous numeric predictor).
1.2.2.6 Data dictionary: 05_predictor_variables/region_abiotic_variables.csv
Region-level abiotic predictor variables used in downstream models. Each row corresponds to one biogeographic region/realm.
Columns
region
Human-readable region/realm name.
Type: categorical string.mean_temp
Mean temperature for the region (spatial average).
Units: as in source climate dataset (typically °C).temp_seasonality
Temperature seasonality for the region (spatial average).
Units: as in source climate dataset (often standard deviation or coefficient of variation; dataset-dependent).mean_precip
Mean precipitation for the region (spatial average).
Units: as in source climate dataset (typically mm/year).precip_seasonality
Precipitation seasonality for the region (spatial average).
Units: as in source climate dataset (often coefficient of variation; dataset-dependent).solar_rad
Mean solar radiation for the region (spatial average).
Units: as in source dataset (dataset-dependent).clim_velocity
Climate velocity metric for the region.
Units: distance per time (as provided in source; dataset-dependent).elevation
Mean elevation of the region.
Units: meters (typically).sd_elev
Standard deviation of elevation within the region (topographic heterogeneity).
Units: meters.surface_relief_ratio
Surface relief ratio for the region (topographic roughness/relief index).
Units: unitless index (continuous).terrain_ruggedness
Terrain ruggedness metric for the region (topographic heterogeneity).
Units: as defined by the ruggedness index used (continuous).area
Total area of the region.
Units: area (as provided in the geodata source).peri
Perimeter length of the region polygon.
Units: length (as provided in the geodata source).isol
Isolation metric for the region.
Units: distance (as provided in the geodata source).
- core/ — example outputs from BioGeoBEARS runs for three representative clades, generated under each of the six models used in the manuscript. Subfolders correspond to models:
- BAYAREALIKE/ — outputs for the three example clades under BAYAREALIKE
- BAYAREALIKEj/ — outputs for the three example clades under BAYAREALIKE+j
- DEC/ — outputs for the three example clades under DEC
- DECj/ — outputs for the three example clades under DEC+j
- DIVALIKE/ — outputs for the three example clades under DIVALIKE
- DIVALIKEj/ — outputs for the three example clades under DIVALIKE+j
- historic_maps/ — reconstructed/processed historical map layers used in downstream summaries and visualization
- region_classification/ — files defining region/realm assignments used in analyses
- replication_parameters/ — parameter settings and metadata used to control replication and resampling steps
- response_predictor_variables/ — compiled response and predictor variables used in statistical models
- z-score/ — standardized (z-scored) versions of selected variables used in downstream analyses
1.3 Missing values and “NA” / “n/a” entries
Some columns in the data tables (csv files) contain missing or unavailable values that appear as NA (and, where applicable, n/a). These entries are intentionally used to document cases where information could not be provided.
NAindicates that the value is missing/unknown at the time of data preparation (i.e., the information was not available from the source records and could not be reliably inferred).n/aindicates that the field is not applicable for that record (i.e., the variable does not apply in that context, so no value should exist).
Missing values occur for reasons such as:
- the source did not report the information,
- the information was withheld/redacted,
- the record predates collection of that field, or
- the field is not applicable to that observation.
No missing values were imputed unless explicitly stated elsewhere in this README. Future users should treat NA as unknown/missing and n/a as structurally not applicable when filtering, summarizing, or modeling.
2) CODES subfolder
2.1 Intermediate_analysis/
R scripts used to generate intermediate datasets and summaries used in the manuscript.
- 1_makephylo.R
Builds the phylogeny used in downstream analyses. - 2_classify_species_to_region_WCVP.R
Classifies species to biogeographic regions/realms using the World Checklist of Vascular Plants (WCVP). - 3_cladeByTrait_WCVP.R
Samples clades for downstream analyses by identifying clades with 9–100 species, using a trait-based clade delineation implemented inCladeByTrait()(sourced fromfunctions.R). - 4_prepare_BGB_runfiles.R
Prepares BioGeoBEARS run inputs (geography file + clade-specific phylogeny) for each predefined clade. The script:- reads a species-by-region table and converts it to a sparse community matrix (species × regions),
- for each clade tree file (
*.tre), subsets the community matrix and global phylogeny to the clade’s species, - writes (i) a BioGeoBEARS geography file (PHYLIP-like encoded format) and (ii) a pruned, fully dichotomous phylogeny (multi-tree if input contains multiple trees).
- 5_specaccum_iNEXT.R
Quantifies and compares resident-species accumulation across the major biogeographic realms while accounting for unequal area/sampling among realms. Uses sample-based incidence rarefaction/extrapolation (iNEXT) with 1,000 randomizations (package default), adding sampling units (“grids”) in random order. - 6a_DispersalBetweenRegion_all.R
Summarizes dispersal-related events from BioGeoBEARS BSM output (all time). - 6b_DispersalBetweenRegion_Paleogene.R
Same as above, restricted to the Paleogene. - 6c_DispersalBetweenRegion_Neogene.R
Same as above, restricted to the Neogene. - 6d_DispersalBetweenRegion_Quaternary.R
Same as above, restricted to the Quaternary. - 6e_Insitu_Speciation_all.R
Summarizes in situ speciation events from BioGeoBEARS BSM output. - 6f_Extinction_all.R
Summarizes extinction events from BioGeoBEARS BSM output.
2.2 Core_analysis/
Scripts implementing the core BioGeoBEARS model fitting used in the manuscript:
- DEC.R
- DECj.R
- DIVALIKE.R
- DIVALIKEj.R
- BAYAREALIKE.R
- BAYAREALIKEj.R
2.3 Figures/
Scripts to reproduce the main figures:
- Figure1.R
- Figure2.R
- Figure3.R
- Figure4.R
Additional supporting code:
-
stats_events_over_time.R
Code used for Figure 2 analyses (within-realm change in dispersal, extinction, and in situ speciation events over the past ~70 Ma). -
functions.R
Shared functions used across scripts in Intermediate_analysis/, Core_analysis/, and Figures/.
