Diversification and evolution of Hawaiian Megalagrion damselflies (Pinapinao, Odonata: Coenagrionidae)

Published Oct 07, 2025 on Dryad. https://doi.org/10.5061/dryad.18931zd80

Data files

Oct 07, 2025 version files 857.11 MB

megalagrion_dryad_repository.zip

857.07 MB
README.md

35.81 KB

Abstract

Hawaiʻi’s pinapinao (Megalagrion McLachlan) comprises a radiation of 23 endemic damselfly species within Coenagrionidae. Despite being a unique study system for understanding geology’s impacts on evolutionary processes among Odonata, the understanding of these damselflies’ temporal, geographic, and phylogenetic origins remains incomplete. Testing macroevolutionary hypotheses has been hampered by conflicting topologies. To resolve these uncertainties, we performed phylogenetic analyses including divergence time estimation with 90 nuclear loci (>50 kbp) and 2 mitochondrial loci (>1 kbp), sampling representatives from 37 genera within core Coenagrionidae and 90% of Megalagrion species, including multiple island populations. We used ancestral range estimations, diversification analyses, agent-based simulation modeling, and ancestral state reconstruction to infer the group’s origin and biogeography and assess traits’ roles in diversification. Our findings indicate Megalagrion’s ancestor diverged from core Coenagrionidae early Eocene (~51 MA) and diversified early Miocene (~19 MA), suggesting Megalagrion’s MRCA predates Kauaʻi’s emergence by 7–21 MY. Diversification analyses suggest a low rate after Megalagrion diverged from Coenagrionidae, followed by a sudden increase around 19 MA, and simulation modeling supports extinction playing a significant role. Extant Megalagrion diversity is largely explained by ecological diversification into at least five clades with distinct breeding habitats that likely evolved on the Northwestern Hawaiian Islands, which are now sunken seamounts. Speciation continued as descendants dispersed to the current Hawaiian Islands as islands emerged. Species breeding in seeps further diversified within the island of Kauaʻi. Our results highlight including geologic changes over time in evolutionary studies and increase understanding of diversification patterns, biogeography, and adaptive radiation on islands. This dataset contains scripts, data, and files used to perform the study described above and published in the article "Diversification and Evolution of Hawaiian Megalagrion Damselflies (Pinapinao, Odonata: Coenagrionidae)" in Systematic Entomology.

Study Overview

This repository contains data and analysis code for a phylogenomic and biogeographic study of Hawaiian endemic damselflies (Megalagrion), investigating evolutionary relationships, inter-island colonization patterns, diversification rates, and trait evolution using target capture sequencing.

Key Findings

Phylogenomic analysis of Megalagrion species across all main Hawaiian islands
Biogeographic reconstruction of colonization patterns across the archipelago
Analysis of diversification rate variation through time
Ancestral state reconstruction of breeding habitat and gill morphology

Description of the Data and File Structure

Missing Data Codes

NA - Not applicable; indicates that a parameter is not part of the specified model (e.g., the Alpha parameter is not estimated in constant-rate models)

Abbreviations Used

AIC - Akaike Information Criterion (model selection statistic)
AICc - Corrected Akaike Information Criterion (for small sample sizes)
HPD - Highest Posterior Density (credible interval in Bayesian analysis)
RF - Robinson-Foulds distance (phylogenetic tree distance metric)
ML - Maximum Likelihood
MCMC - Markov Chain Monte Carlo
DEC - Dispersal-Extinction-Cladogenesis (biogeographic model)
ASR - Ancestral State Reconstruction
Ma - Million years ago

Repository Structure

This summarizes the structure of "megalagrion_dryad_repository.zip"

├── README.md                    # This file
├── software_requirements.md     # Required software and versions
│
├── data/                        # All input data files
│   ├── ancestral_state_reconstruction/  # Trait data for ASR analyses
│   ├── astral/                  # Gene trees and species trees
│   ├── beast/                   # BEAST analysis input files
│   ├── biogeobears/             # Geographic data and constraints
│   │   ├── four_areas/          # Four-area biogeographic model
│   │   ├── five_areas/          # Five-area biogeographic model
│   │   ├── six_areas/           # Six-area biogeographic model
│   │   └── seven_areas/         # Seven-area biogeographic model
│   ├── calibration/             # Time-calibrated trees
│   ├── figures/                 # Geographic vector files and data
│   ├── iqtree/                  # Maximum likelihood trees
│   ├── pruned_trees/            # Subset trees for specific analyses
│   ├── rpanda/                  # Diversification analysis data
│   └── treepl/                  # Tree dating input/output
│
├── src/                         # Analysis scripts
│   ├── ancestral_state_reconstruction/
│   ├── astral/                  # Species tree reconstruction
│   ├── biogeobears/             # Biogeographic analyses
│   ├── preliminary_calibration/ # Preliminary tree dating
│   ├── prune_trees/             # Tree pruning scripts
│   ├── rpanda/                  # Diversification rate analyses
│   └── treepl/                  # Tree dating scripts
│
├── results/                     # Analysis outputs
│   ├── ancestral_state_reconstruction/
│   ├── astral/
│   │   ├── full/                # Full taxon sampling results
│   │   ├── target/              # Target taxon sampling results
│   │   └── gene_trees/          # Gene tree results
│   ├── beast/
│   ├── biogeobears/
│   │   ├── four_areas/          # Four-area biogeographic results
│   │   ├── five_areas/          # Five-area biogeographic results
│   │   ├── six_areas/           # Six-area biogeographic results
│   │   └── seven_areas/         # Seven-area biogeographic results
│   ├── rpanda/
│   └── treepl/
│
├── simulation/                  # Agent-based model data
│
├── trees_from_cluster/          # Raw phylogenetic pipeline output (unmodified)
│   ├── core_coenagrionidae_full_sampling/    # Full taxon sampling analysis
│   └── core_coenagrionidae_limited_sampling/ # Limited taxon sampling analysis
│
└── analysis_descriptions/       # Detailed method descriptions
    ├── phylogenetics_methods.md
    ├── biogeography_methods.md
    ├── diversification_methods.md
    └── trait_evolution_methods.md

Data Description

Ancestral State Reconstruction (`data/ancestral_state_reconstruction/`)

breeding_habitat.csv - Species breeding habitat classifications (lentic/lotic)
- breeding_habitat - Habitat type where larvae develop (lentic = still water, lotic = flowing streams, seep = seepage areas, phytotelmata = plant-held water, terrestrial = terrestrial substrates)
breeding_habitats_full_megalagrion.csv - List of specimen IDs for all Megalagrion taxa included in ancestral state reconstruction (one specimen ID per line, no headers)
gill_type.csv - Larval gill morphology classifications
- gill_type - Morphology of larval gills (foliate = leaf-like, lanceolate = lance-shaped, saccate = sac-like)

Phylogenetic Trees

Time-Calibrated Trees (data/calibration/):

calibrated.tree / calibrated.nwk - Primary time-calibrated phylogeny
ultrametric_core_coenagrionidae.tree / .nwk - Ultrametric tree for core Coenagrionidae
calibrated_full_core_coenagrionidae.tree / .nwk - Full taxon sampling calibrated tree
calibrated_full_core_coenagrionidae_species.tree / .nwk - Species-level calibrated tree
calibrated_pseudagrion_and_relatives.tree / .nwk - Pseudagrion clade calibrated tree
calibrated_teinobasis_and_relatives.tree / .nwk - Teinobasis clade calibrated tree

Maximum Likelihood Trees (data/iqtree/):

full_sampling_target_alignment.tree - ML tree from full taxon sampling
full_sampling_target_alignment_rooted.tree - Rooted ML tree
limited_sampling_full_alignment.tree - ML tree from limited sampling with full alignment
limited_sampling_target_alignment.tree - ML tree from limited sampling
supermatrix_limited_sampling_target_alignment.fasta - Concatenated alignment
supermatrix_limited_sampling_target_without_teinobasis.fasta - Alignment excluding Teinobasis

Pruned Trees for Analyses (data/pruned_trees/):

megalagrion_tree.tree - Megalagrion-only phylogeny
megalagrion_species_tree.tree / .nex - Megalagrion species tree
limited_sampling_full_alignment_megalagrion.tree - Megalagrion from limited sampling
limited_sampling_target_alignment_megalagrion.tree - Megalagrion target-only tree
core_coenagrionidae_genus.tree - Genus-level Coenagrionidae tree
core_coenagrionidae_genus_multiple_megalagrion.tree - Genus tree with multiple Megalagrion
core_coenagrionidae_genus_all_megalagrion.tree - Genus tree with all Megalagrion

Tree Dating Files (data/treepl/):

input_tree.nwk - Input tree for dating
dated_tree.nwk - TreePL dated tree output
dated_tree.nwk.r8s - r8s format dated tree
calibration_nodes.csv - Node calibration data
- node - Node number in phylogenetic tree
- min_age - Minimum age constraint for calibration (million years ago)
- max_age - Maximum age constraint for calibration (million years ago)
treepl.config - TreePL configuration file

Species Trees (data/astral/):

alignments/full/ - Per-locus alignments including flanking regions (92 loci)
alignments/target/ - Per-locus target-only alignments (92 loci)
Each alignment file corresponds to one target capture locus used for phylogenetic discordance analyses

BEAST Analysis Files (data/beast/):

BD_geo.xml - Birth-death with geography model specification
BD.xml - Birth-death model specification
BD_geo_prior_only_v2.xml - Prior-only analysis for birth-death with geography
BD_geo_prior_only.log - BEAST log file from prior-only analysis
BD_geo_prior_only.trees - Tree sample from prior-only analysis
BD_geo_prior_only_v2.xml.state - BEAST state file

Biogeographic Data (`data/biogeobears/`)

Contains geographic range data and dispersal constraints for biogeographic analyses. Each area model directory contains:

geography.txt - Geographic range data for each taxon
- Line 1: Number of taxa, number of areas, area codes in parentheses
- Following lines: Taxon name followed by binary presence/absence codes (1=present, 0=absent) for each area
areas_allowed.txt - Dispersal constraints between areas (time-stratified format)
- Contains multiple matrices separated by blank lines, one for each time period
- Each matrix has a header row with area codes, followed by a binary matrix (1=allowed, 0=not allowed)
- File ends with "END"
distance_matrix.txt - Geographic distances between areas (tab-delimited)
- Header row with area codes
- Following rows contain distances (kilometers)
times.txt - Time boundaries for dispersal model stratification (million years ago, one per line)

Area Models:

four_areas/ - Four-area biogeographic model (main islands only)
five_areas/ - Five-area model including extinct seamounts
six_areas/ - Six-area model with additional geographic regions
seven_areas/ - Seven-area model with full geographic detail

Diversification Analysis (`data/rpanda/`)

taxonomy_one_per_species.txt - Taxonomic data for diversification rate analyses (tab-delimited)
- Species - Species name with ID prefix
- Genus - Genus name
- Subtribe - Subtribe classification
- Tribe - Tribe classification
- Subfamily - Subfamily classification
- Group - Higher taxonomic group
- Family - Family name
- intree - Binary indicator of whether taxon is included in phylogeny (1=yes, 0=no)
Time-calibrated trees for RPANDA analyses (see data/calibration/)

Figure Data (`data/figures/`)

hawaii_bathymetry.tif - Bathymetric raster data for the Hawaiian archipelago
island_ages.csv - Ages of Hawaiian islands for temporal analyses
- island - Island name (English)
- hawaiian_name - Island name (Hawaiian)
- emerge - Age of island emergence above sea level (million years ago)
- submerge - Age when the island fully submerged (million years ago, 0 = still above water)
- height - Maximum elevation of the island (meters above sea level)
- height_label_x - X-coordinate for height label placement in figures (million years ago)
seamount_colors.csv - Color scheme for seamount visualization
- island - Island or seamount name
- color - Hex color code for visualization
six_areas_tip_states.txt - Geographic state assignments for phylogeny tips (tab-delimited)
- Row 1: Area names as column headers (Kauaʻi, ʻOahu, Molokaʻi, Lānaʻi, Maui, Hawaiʻi)
- Following rows: Taxon names with presence/absence status for each area ("absent" or area name)
world.svg - World map vector graphic
ne_10m_land/ - Natural Earth 1:10m land polygon shapefiles
hawaiian_island_vector_images/ - SVG vector outlines for each Hawaiian island
hawaiian_island_seamount_vector_images/ - SVG vector outlines including submerged seamounts

Simulation Data (`simulation/`)

NetLogo Model Files:

frogger_05.09.2024_no_aesthetics.nlogo - Agent-based model used for BehaviorSpace experiments (no visual elements, faster performance)
frogger_05.09.2024.nlogo - Full model with visualization aesthetics

Experiment Details:

50 replicate runs conducted May 31, 2024
Each run simulates 20 million years of evolution in a dynamic oceanic hotspot island system
Data collected at timestep intervals throughout each simulation run

Experiment Results:

frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv - Time-series data from all 50 runs (30 MB, 2,108 rows)
- Contains all parameters and output metrics recorded at multiple time points during each of the 50 runs
Model parameters (held constant across all runs):
- selection_intensity - Value representing how intense selection pressure is (dimensionless, value = 0.3)
- eruption_size_stdev - Proportion of eruption size equal to 1 standard deviation (cubic meters, value = 0.5)
- volcano_slope - Slope of volcanoes that are created (degrees, value = 5)
- sea_level - Elevation of sea level (meters, value = 5000)
- patch_width - Width of a patch (meters, value = 10000)
- extinction_rate - Value representing how likely populations are to go extinct (dimensionless, value = 0.2)
- competition_intensity - Intensity of competition (dimensionless, value = 0.3)
- tectonic_speed - Speed of movement of the tectonic plate (meters per year, value = 0.07)
- mutation_rate - Value representing how many additional mutations enter the population (dimensionless, value = 0.2)
- speciation_threshold - Threshold beyond which organisms form a new species (dimensionless, value = 1.75)
- subsidence_rate - Rate of subsidence at sea level (meters per year, value = 1.5E-4)
- time_scale - Time per tick (years, value = 10000)
- deposition_rate - Rate of volcanic deposition (cubic meters per year, value = 50000000)
- eruption_size_max - Maximum eruption size allowed (cubic meters, value = 50000000000000)
- eruption_freq_mean - Average frequency of eruptions (years, value = 500000)
- gene_flow_magnitude - Magnitude of gene flow (dimensionless, value = 0.3)
- eruption_freq_stdev - Standard deviation of the frequency of eruptions (years, value = 100000)
- eruption_xcor_stdev - Standard deviation of the x-coordinates of the eruptions (dimensionless, value = 4)
- erosion_rate - Rate of erosion for patches above sea level (meters per year, value = 2.5E-4)
- dispersal_rate - Value representing how often dispersal events occur (dimensionless, value = 0.3)
Output metrics (recorded at each time point for each run):
- [step] - Simulation timestep number
- time - Simulated time
- average-island-size - Mean island area
- number-of-islands - Count of islands present
- length-of-island-chain - Length of island chain
- species-count - Number of extant species
- average-species-per-area - Species density
- niche-breadth-all-species - Average ecological niche breadth across all species
- extinction-rate - Extinction rate
- speciation-rate - Speciation rate
- diversification-rate - Net diversification rate
- length species_database - Cumulative number of species ever created
- Note: Units for output metrics are not specified in raw NetLogo output
frogger_05.09.2024_no_aesthetics May_31_2024-stats.csv - Additional detailed statistics from all 50 runs (1.5 GB, 3,662,903 rows)
frogger_05.09.2024_no_aesthetics_May_31_2024-lists.csv - List-format output from BehaviorSpace (633 bytes)
frogger_05.09.2024_no_aesthetics May_31_2024-table.csv - Table-format output (empty file)

Summary Statistics:

summary_05_09_2024_behaviorspace_experiment.csv - Mean and standard deviation of final values across all 50 runs
- Contains summary statistics with units specified in column headers
- Columns (as labeled in file, with mean ± SD values):
  - island size (km^2) - 2878.29 ± 311.19
  - number of islands - 8.36 ± 0.45
  - length of island chain (km) - 821.02 ± 36.88
  - total species - 558.34 ± 764.98
  - extant species - 16.65 ± 17.17
  - number of species per 100km^2 - 0.17 ± 0.16
  - extinction rate (species/MY) - -27.08 ± 37.39
  - speciation rate (species/MY) - 27.92 ± 38.25
  - diversification rate (species/MY) - 0.83 ± 0.86
summary_05_09_2024_behaviorspace_experiment.txt - Text version of summary statistics

Analysis Scripts:

netlogo_parse_behavior_space.r - Parses BehaviorSpace results and calculates summary statistics
- Inputs: frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv
- Outputs: summary_05_09_2024_behaviorspace_experiment.csv
- Required R packages: tidyverse, readr, dplyr, ggplot2, bit, janitor

Analysis Scripts (`src/`)

The src/ directory contains R and shell scripts used to perform all analyses:

ancestral_state_reconstruction/ - corHMM scripts for breeding habitat and gill type evolution
astral/ - Species tree estimation and gene tree discordance analysis scripts
biogeobears/ - BioGeoBEARS scripts for ancestral range reconstruction (biogeobears.r, make_dispersal_matrix.r)
preliminary_calibration/ - Initial tree dating and calibration exploration
prune_trees/ - Scripts to extract taxon subsets from larger phylogenies
rpanda/ - Diversification rate analysis scripts (rpanda.r, create_publication_materials.r)
treepl/ - TreePL configuration and tree dating scripts

Analysis Method Documentation (`analysis_descriptions/`)

Detailed methodological descriptions for each analysis type:

phylogenetics_methods.md - Maximum likelihood and species tree inference methods
biogeography_methods.md - BioGeoBEARS model specifications and area definitions
diversification_methods.md - RPANDA diversification rate model descriptions
trait_evolution_methods.md - Ancestral state reconstruction methodology

Raw Phylogenetic Pipeline Output (`trees_from_cluster/`)

This directory contains unmodified output from the original phylogenetic analysis pipeline run on a computing cluster. It includes all intermediate files, scripts, and raw outputs from the phylogenomic analysis workflow.

core_coenagrionidae_full_sampling/ (298 MB):

Raw IQ-TREE phylogenetic analyses with full taxon sampling
taxa.csv - List of taxon names included in analysis (no headers, one taxon per line)
loci_counts.csv - Number of loci sequenced per taxon
- taxon - Specimen/taxon identifier
- num_loci - Number of target capture loci successfully sequenced
probe_alignments/ - 92 per-locus FASTA alignments
probe_trees/ - Gene trees for each locus
probe_orthologs/ - Ortholog sequences identified from target capture data
Concatenation scripts and supermatrix construction
ALICUT alignment trimming outputs (ALICUT_info.txt)
Processing scripts for ortholog filtering and taxon selection

core_coenagrionidae_limited_sampling/ (833 MB):

Raw IQ-TREE phylogenetic analyses with limited taxon sampling
taxa.csv - List of taxon names included in analysis (no headers, one taxon per line)
loci_counts.csv - Number of loci sequenced per taxon
- taxon - Specimen/taxon identifier
- num_loci - Number of target capture loci successfully sequenced
full_orthologs/ - Full-length sequence alignments (probe regions + flanking sequences, 92 loci with multiple files per locus)
probe_orthologs/ - Target-only alignments (probe regions only, 92 loci with aligned and unaligned versions)
full_trees/ - Per-locus gene trees from full-length alignments
probe_trees/ - Per-locus gene trees from target-only alignments
probe_alignments/ - Intermediate alignment files
concatenated_probe_alignments/ - Concatenated supermatrix files for ASTRAL analysis
partitioned_alignments/ - Partitioned alignment schemes
- full/ - Full-length partitioned alignments
- probe/ - Probe-only partitioned alignments
- flank/ - Flanking region partitioned alignments
- probe_nuclear/ - Nuclear probe partitions
reference_probes_20kb/ - Reference probe sequences
ALICUT and ALISCORE alignment filtering outputs (scripts and log files)
Concatenated supermatrix alignments and IQ-TREE best tree/model files

Note: These directories preserve the complete computational workflow as originally executed. Curated and analysis-ready versions of key outputs (trees, alignments) are provided in data/iqtree/, data/astral/, and data/pruned_trees/.

Analysis Results (`results/`)

The results/ directory contains outputs from all analyses described in the manuscript.

Phylogenetic Discordance (results/astral/):
Contains ASTRAL species tree analyses and gene tree discordance metrics organized in three subdirectories:

full/ - Analyses using full-length alignments (probe + flanking sequences)
- all/ - Full taxon set results
  - species/ - Species tree files (cladogram, rooted, ultrametric)
  - gene/ - Gene tree files for all loci
  - densitree/ - Densitree visualization files
- megalagrion/ - Megalagrion-only subset (same species/gene/densitree structure plus discordance analysis files)
  - Table1_summary_statistics.csv - Summary statistics for species tree analysis
    - Statistic - Name of summary statistic
    - Value - Value of statistic
  - Table2_concordant_genes.csv - Genes most concordant/discordant with species tree
    - Type - Category (Most concordant or Most discordant)
    - Rank - Ranking within category
    - Locus - Locus identifier
    - Locus_index - Numeric index of locus
    - RF_distance - Robinson-Foulds distance (normalized, dimensionless, range 0-1)
  - gene_tree_rf_distances.csv - Robinson-Foulds distances for all gene trees
    - gene_index - Numeric index of gene tree
    - locus - Locus identifier
    - rf_distance - Robinson-Foulds distance from species tree (normalized, dimensionless, range 0-1)
target/ - Analyses using target-only alignments (probe regions only)
- Same structure as full/ (all/ and megalagrion/ subdirectories, each with species/gene/densitree/)
gene_trees/ - Individual gene tree files organized by alignment type (full/ and target/)

Ancestral State Reconstruction (results/ancestral_state_reconstruction/):

breeding_habitat_model_selection.txt - Model comparison for breeding habitat evolution
gill_type_model_selection.txt - Model comparison for gill morphology evolution
reconstruction_breeding_habitat_*.pdf - Phylogenies with ancestral habitat states
reconstruction_gill_type_*.pdf - Phylogenies with ancestral gill type states
*.rdata - R workspace files with full reconstruction results
Transition rate matrices for character evolution models

Biogeographic Analyses (results/biogeobears/):
Four subdirectories (four_areas/, five_areas/, six_areas/, seven_areas/) each containing 12 model results:

result_*.RData - BioGeoBEARS model fit results (12 files per area model)
- 3 base models: DEC, DIVALIKE, BAYAREALIKE
- Each with 4 variants: base, +J (founder-event speciation), +x (distance-dependent dispersal), +J+x
result_table.csv - Summary table of model fit statistics
- LnL - Log-likelihood of model
- numparams - Number of parameters in model
- d - Dispersal rate (range expansion events per lineage per million years)
- e - Extinction rate (local range loss events per lineage per million years)
- j - Founder-event speciation weight (relative per-event weight at cladogenesis, dimensionless)
- x - Distance-dependent dispersal exponent (modifies dispersal probability as distance^x, dimensionless)
- AIC - Akaike Information Criterion value
- AIC_wt - Akaike weight (model probability)
stats_table_*.csv - Pairwise model comparison statistics (likelihood ratio tests)
plot_*.svg - Geographic range evolution visualizations (12 SVG files per area model, one per model variant)
Each area model tests different biogeographic scenarios (main islands only vs. including extinct seamounts)

Diversification Rate Analyses (results/rpanda/):

Table1_BestModels.csv - Best-fitting diversification models
- Taxonomic Group - Clade or taxonomic group analyzed
- Model - Best-fitting model abbreviation (e.g., BCST = constant birth rate, BVAR = time-varying birth rate)
- Type - Model description
- Param. - Number of parameters in model
- AICc - Corrected Akaike Information Criterion value
- λ (initial) - Initial speciation rate (events per lineage per million years)
- α (change) - Rate of change in speciation rate through time (per million years)
Table2_ParameterComparison.csv - Model parameter comparisons
- Taxonomic Group - Clade or taxonomic group analyzed
- Model - Model abbreviation
- λ - Estimated speciation rate (events per lineage per million years)
- Δλ - Change in speciation rate (events per lineage per million years)
- α - Rate of exponential change in speciation (per million years)
- Δα - Change in rate parameter (per million years)
Table3_ModelSelection.csv - Model selection statistics
- Combination - Node numbers where rate shifts occur
- Type - Whether shifts are at the crown or the stem nodes
- Parameters - Total number of model parameters
- AICc - Corrected Akaike Information Criterion value
- Delta_AICc - Difference in AICc from best model
estimated_shifts_subclades.csv - Rate shift estimates for clades
- Node - Node number in phylogeny where shift occurs
- Clade - Name of clade or taxonomic group
- Models - Model abbreviation
- shift location - Whether the shift is at the crown or the stem node
- Shift? - Binary indicator of rate shift presence (1 = shift present, 0 = no shift)
- Parameters - Number of model parameters
- logL - Log-likelihood of model
- AICc - Corrected Akaike Information Criterion value
- delta_AICc - Difference in AICc from best model
- Lambda - Speciation rate (events per lineage per million years)
- delta Lambda from backbone - Change in speciation rate relative to backbone (events per lineage per million years)
- Alpha - Rate of exponential change in speciation (per million years)
- delta alpha from backbone - Change in alpha parameter relative to backbone (per million years)
- Mu - Extinction rate (events per lineage per million years)
- delta mu from backbone - Change in extinction rate relative to backbone (events per lineage per million years)
- Beta - Rate of exponential change in extinction (per million years)
- delta beta from backbone - Change in beta parameter relative to backbone (per million years)
estimated_shifts_subclades_crown.csv - Rate shift estimates for crown node shifts (same variable definitions as above)
estimated_shifts_subclades_stem.csv - Rate shift estimates for stem node shifts (same variable definitions as above)
SummaryStatistics.txt - Summary of diversification analyses
PDF figures of phylogenies with rate shifts and parameter comparisons

BEAST Analyses (results/beast/):

BD_geo.tree / BD.tree - Birth-death phylogenetic MCMC samples
BD_geo.log / BD.log - BEAST parameter log files
BD_geo_20burnin.tree, BD_geo_25burnin.tree, BD_geo_40burnin.tree - Tree files with different burn-in thresholds
prior_posterior_comparison.csv - Prior vs. posterior distribution comparisons
- Parameter - BEAST model parameter name
- Mean_Prior - Mean value from prior distribution
- HPD_lower_Prior - Lower bound of 95% highest posterior density interval for prior
- HPD_upper_Prior - Upper bound of 95% highest posterior density interval for prior
- Mean_Posterior - Mean value from posterior distribution
- HPD_lower_Posterior - Lower bound of 95% highest posterior density interval for posterior
- HPD_upper_Posterior - Upper bound of 95% highest posterior density interval for posterior
- HPD_width_Prior - Width of prior HPD interval
- HPD_width_Posterior - Width of posterior HPD interval
- Width_Ratio - Ratio of posterior to prior HPD width
- Note: Birth and death rate parameters are in events per lineage per million years
prior_posterior_summary.csv - Summary statistics for prior and posterior distributions
- Parameter - BEAST model parameter name
- Type - Distribution type (Prior or Posterior)
- Mean - Mean value of parameter
- Median - Median value of parameter
- HPD_lower - Lower bound of 95% highest posterior density interval
- HPD_upper - Upper bound of 95% highest posterior density interval
- Note: Age parameters (mrca.age) are in million years; rate parameters are in events per lineage per million years
BD_geo.nwk - Maximum clade credibility tree in Newick format

Tree Dating (results/treepl/):

calibration_nodes_plot.pdf - Visualization of calibration points on phylogeny
dated_tree_summary.pdf - Summary of tree dating results

Sharing/Access Information

Links to other publicly accessible locations of the data:

Raw sequence data: NCBI Sequence Read Archive (BioProject PRJNA1265769)
- https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1265769

Data was derived from the following sources:

Fossil calibration ages derived from published literature (see data/treepl/calibration_nodes.csv)
Island emergence and subsidence ages from published geological studies

Code/Software

All analysis scripts are written in R and shell scripts. Detailed methodological descriptions for each analysis type are provided in the analysis_descriptions/ directory. Software requirements and versions are specified in software_requirements.md.

Analysis Workflow

Analyses should be run in the following order:

1. Initial Phylogenetic Pipeline (`trees_from_cluster/`)

Master script: robbie_method.sh - Orchestrates the complete phylogenetic workflow
Key steps:
- Sequence processing and quality control
- Per-locus alignment with MAFFT
- Alignment filtering with ALICUT/ALISCORE
- Gene tree estimation with IQ-TREE
- Concatenated supermatrix analysis
Inputs: Raw sequence data (not included in repository; available from NCBI BioProject PRJNA1265769)
Outputs:
- Per-locus alignments in probe_alignments/ and full_orthologs/
- Gene trees in probe_trees/ and full_trees/
- Supermatrix alignments and trees
- taxa.csv and loci_counts.csv
Note: This directory contains raw pipeline output and should not be modified

2. ASTRAL Species Tree Analysis (`src/astral/`)

Scripts:
- 01_gene_trees.sh - Generate individual gene trees
- 02_astral.sh - Run ASTRAL coalescent analysis
- 03_root_trees.r - Root trees for downstream analyses
Inputs: Gene trees from step 1, alignments from data/astral/alignments/
Outputs:
- Species trees in results/astral/
- Gene tree discordance metrics (gene_tree_rf_distances.csv, Table1_summary_statistics.csv, Table2_concordant_genes.csv)

3. Preliminary Tree Dating (`src/preliminary_calibration/`)

Purpose: Generate starting tree for BEAST analysis
Method: Fossil calibration using the chronos function in R (ape package)
Inputs: ML tree from step 1, fossil calibrations from data/treepl/calibration_nodes.csv
Outputs: Preliminary time-calibrated tree for BEAST

4. BEAST Divergence Time Estimation

XML files: data/beast/BD.xml, data/beast/BD_geo.xml
Method: Bayesian birth-death analysis with fossil and geographic calibrations
Software: Run externally using BEAST v2.6+
Inputs:
- Starting tree from step 3
- Fossil calibrations (Table 2 in manuscript)
- Geographic calibrations (island emergence ages)
Outputs:
- MCMC samples in results/beast/*.log
- Tree samples in results/beast/*.tree
- Maximum clade credibility tree in results/beast/BD_geo.nwk
- Prior/posterior comparisons (prior_posterior_comparison.csv, prior_posterior_summary.csv)
Note: This produces the main time-calibrated tree used for BioGeoBEARS and ancestral state reconstruction

5. TreePL Dating for Extended Sampling (`src/treepl/`)

Purpose: Date tree with extended taxon sampling for RPANDA
Method: Penalized likelihood dating
Inputs: ML tree with extended sampling, data/treepl/calibration_nodes.csv
Outputs:
- Time-calibrated trees in data/calibration/
- Visualizations in results/treepl/

6. Biogeographic Analysis (`src/biogeobears/`)

Scripts:
- biogeobears.r - Main BioGeoBEARS analysis
- make_dispersal_matrix.r - Generate distance-based dispersal matrices
Inputs:
- Time-calibrated tree from BEAST (step 4)
- Geographic data in data/biogeobears/ (geography.txt, distance_matrix.txt, areas_allowed.txt, times.txt)
Outputs:
- Model results in results/biogeobears/*/result_*.RData
- Model comparison tables (result_table.csv, stats_table_*.csv)
- Ancestral range visualizations (plot_*.svg)

7. Diversification Analysis (`src/rpanda/`)

Scripts:
- rpanda.r - Main RPANDA analysis
- create_publication_materials.r - Generate figures and tables
Inputs:
- Time-calibrated tree with extended sampling from TreePL (step 5)
- Taxonomy file data/rpanda/taxonomy_one_per_species.txt
Outputs:
- Model selection results (Table1_BestModels.csv, Table2_ParameterComparison.csv, Table3_ModelSelection.csv)
- Rate shift estimates (estimated_shifts_subclades.csv, estimated_shifts_subclades_crown.csv, estimated_shifts_subclades_stem.csv)
- Summary statistics and visualizations

8. Ancestral State Reconstruction (`src/ancestral_state_reconstruction/`)

Script: ancestral_state_reconstruction.r
Method: corHMM hidden Markov models
Inputs:
- ML tree from step 1
- Trait data in data/ancestral_state_reconstruction/ (breeding_habitat.csv, gill_type.csv)
Outputs:
- Model selection results (breeding_habitat_model_selection.txt, gill_type_model_selection.txt)
- Reconstructions (reconstruction_*.pdf)
- R data objects (*.rdata)

9. Agent-Based Simulation (Standalone)

Software: NetLogo v6.4.0
Model files:
- simulation/frogger_05.09.2024_no_aesthetics.nlogo - Used for BehaviorSpace experiments
- simulation/frogger_05.09.2024.nlogo - Full version with visualization
Purpose: Simulate diversification in dynamic oceanic hotspot island systems
Analysis: 50 replicate runs conducted May 31, 2024
Outputs:
- Raw BehaviorSpace results (frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv)
- Detailed statistics (frogger_05.09.2024_no_aesthetics May_31_2024-stats.csv)
- Summary statistics (summary_05_09_2024_behaviorspace_experiment.csv)
Post-processing: netlogo_parse_behavior_space.r calculates means and standard deviations across all runs

Key software and packages used:

R (version 4.0+) - Statistical computing and graphics
IQ-TREE - Maximum likelihood phylogenetic inference
ASTRAL - Species tree estimation from gene trees
BioGeoBEARS - Biogeographic analysis (R package)
RPANDA - Diversification rate analysis (R package)
BEAST - Bayesian phylogenetic inference
TreePL - Phylogenetic tree dating
corHMM - Ancestral state reconstruction (R package)
Densitree - Tree visualization (available under GPL license at https://www.cs.auckland.ac.nz/~remco/DensiTree/)

See software_requirements.md for complete version information and installation instructions.

Contact

For questions about the data or analyses:
Robert Hadfield
Email: robert7hadfield@gmail.com

Acknowledgments

We thank the collectors who provided specimens for this study and the curators at FSCA and BYU for access to specimens.

Taxon Sampling

We gathered a total of 56 specimens from all 21 extant species of Megalagrion, plus one undescribed species (Table 1). Specimens were sourced from the personal collection of Steve Jordan, the Florida State Collection of Arthropods, Brigham Young University, and fieldwork following local and state permitting requirements in Hawaiʻi. In addition to having complete sampling across the genus, we wanted to have representatives of each taxon from the different habitats in which they are found. With few exceptions, we obtained one representative of each species of Megalagrion from every major volcano, including multiple volcanoes on the same island, except the island of Hawaiʻi, where we obtained at least one specimen of each species on the island. We acknowledge the subspecies division between M. nigrohamatum nigrohamatum (Blackburn) and M. nigrohamatum nigrolineatum (Perkins) and include specimens from each subspecies. We did not obtain specimens from the two extinct species (M. jugorum (Perkins) and M. molokaiense (Perkins)) or from several extirpated populations of extant species. Historical dry specimens from both extinct species and several of these extirpated populations are present at the Bishop Museum in Honolulu, Hawaiʻi. However, DNA was not extracted from these specimens due to their taxonomic importance as irreplaceable types.

In addition to our sampling of Megalagrion, we sampled representatives from 37 of 58 valid genera across core Coenagrionidae (sensu Dijkstra et al., 2014), which included all genera previously hypothesized to be closely related to Megalagrion (Table S1) (Dumont et al., 2010; E. W. O’Grady & May, 2003). We attempted to sample all genera that may be closely related to Megalagrion. Of the 21 genera from core Coenagrionidae not included in our sampling scheme, 10 are monotypic, and the other 11 genera account for a total of 55 species. Our purpose in including this level of outgroup sampling was to determine the placement of Megalagrion within Coenagrionidae, and to help estimate molecular divergence times and ancient biogeography.

For diversification analyses with RPANDA, an extended taxon sampling set was used with additional species sampled from genera across core Coenagrionidae, totaling 411 samples (Table S2).

DNA Extraction & Sequencing

For each specimen, thoracic muscle tissue was extracted when possible; otherwise, a single leg was used. Specimens were vouchered at the Bean Life Science Museum, Brigham Young University in Provo, Utah. DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Cat. No. 69504). For at least one specimen from each genus, sequences of 1,306 nuclear loci were obtained through anchored hybrid enrichment (AHE) sequencing using a probe set designed to capture single-copy orthologs of exons across odonate genomes (Bybee et al., 2021). For the remaining samples of Megalagrion, we captured a reduced set of 90 nuclear loci (>50 kbp) as well as 2 mitochondrial loci (~1,000 bp). AHE sequencing was performed by Rapid Genomics (Gainesville, FL, USA) following their standard protocols. This genome-wide approach with multiple loci helps capture phylogenetic signals that represent the evolution of species as a whole rather than specific genes (Degnan & Rosenberg, 2006).

Sequence Processing & Quality Control

Anchored hybrid enrichment data were processed with a pipeline modeled after Breinholt et al. (2018). Raw sequence reads were cleaned and adapters were removed using fastp v0.23.2 (Chen et al., 2018). The 90 nuclear and 2 mitochondrial loci were assembled following Goodman (2023) using iterative baited assembly following Breinholt et al. (2018) with SPAdes genome assembler v3.15.4 (Bankevich et al., 2012) and the Tanypteryx hageni (Selys) reference genome (Tolman et al., 2023). A more closely related zygopteran or coenagrionid genome may have improved locus recovery and alignment accuracy. However, the anisopteran Tanypteryx hageni genome was used because probes were designed to be single copy with this genome and to be compatible with extended datasets across Odonata for further studies with broader taxon sampling. Sequences from the expanded probe set with 1,306 nuclear loci were filtered to include only the 90 nuclear loci also present in the smaller probe set. Contamination filtering was performed by excluding identical sequences from different genera using the -ublast function of USEARCH v11.0.667 (Edgar, 2010). For each locus, only one ortholog was chosen for each taxon based on bit score, length, and coverage. We excluded taxa for which fewer than 30% of the 92 loci (< 27 loci) were recovered. In our main sampling scheme, 1 of 104 samples (<1%) recovered fewer than 27 loci. In the extended sampling scheme used for RPANDA analyses, 28 of 411 samples (~7%) had fewer than 27 of the 92 loci. Some samples from both sampling schemes were sequenced with the expanded probe set of 1,306 nuclear loci, all of which were filtered to include only the 90 nuclear loci present in the smaller probe set. Anchored hybrid enrichment sequencing recovered not only the targeted loci, but also some flanking regions on either side of each locus.

To align sequences for each locus, we used the L-INS-I algorithm from MAFFT v7.520 (Katoh & Standley, 2013). To ensure that the sequences aligned properly, we aligned the target region first, including the sequence used as a reference during assembly. The reference sequence was included to identify the position of the target region of the alignment. Then, we aligned the full-length sequences, including the flanking regions, to the alignment containing only the target region using the --addlong function of MAFFT. The target-region sequences (except for the target-region reference sequence) were then removed from the alignment, leaving an alignment of the full-length sequences, plus the target-region reference sequence. Aligned sequences were filtered using Aliscore 22.ii.2012 (Kück et al., 2010; Misof & Misof, 2009) and AliCUT v2.31 (Kück, 2009), which aim to remove alignment, ambiguous, or randomly similar sites. We trimmed the ends of alignments until there was at least 50% taxon coverage.

This trimmed alignment was then split into three separate alignments, one for the target region (based on the position of the target-region reference sequence), and two for the flanking regions on each side. The two alignments for the flanking regions on either side of the target region were then concatenated. Loci with less than 50% taxon coverage were removed from further analysis. This process yielded two sets of alignments: one including only the target region, the other including only the flanking regions. The average pairwise distance between sequences in a concatenation of the flanking region alignments was more than three times the average pairwise distance between sequences in a concatenation of the target region alignments, indicating that the target and flanking regions have significantly different rates of evolution. Therefore, we chose to partition the target and flanking regions separately in phylogenetic analyses.

Phylogenetic Analysis

We performed our main phylogenetic analyses on two data sets. The first included a full alignment (both the target and flanking regions), and the second included a target-region-only alignment. We also performed phylogenetic analyses with the extended taxon sampling scheme and target-region-only alignment for RPANDA following the same procedure. For each analysis, we performed a partitioned maximum likelihood phylogenetic inference using IQ-TREE v2.2.0 (Minh et al., 2020). Our preliminary full partition scheme contained 92 or 184 partitions reflecting our alignments [92 loci × (1 target) or 92 loci × (1 target + 1 flanking)]. We used ModelFinder Plus (Kalyaanamoorthy et al., 2017) to select the optimal partition scheme by merging partitions until model fit ceased to improve, and to find the best-fitting model of evolution for each partition according to the Bayesian Information Criterion (-m MFP+MERGE option). We employed the identified partition scheme and evolutionary models to infer a maximum likelihood tree. To assess branch support, we conducted 1,000 ultrafast bootstrap replicates with nearest neighbor interchange optimization (-bnni 1000 option).

Coalescent Species Tree Analysis

We inferred species relationships using a multispecies coalescent approach implemented in ASTRAL (Mirarab & Warnow, 2015; Zhang et al., 2018) via the ASTER package v1.16.3.4 (Zhang et al., 2023). Individual gene trees were first reconstructed for each of the 92 loci from the full alignments using IQ-TREE with ModelFinder Plus to select the best-fitting substitution model for each locus. ASTRAL estimates the species tree by finding the tree that maximizes the quartet score given the input gene trees, accounting for incomplete lineage sorting under the multispecies coalescent model. Branch support was assessed using local posterior probabilities, which represent the probability of a quartet given the gene trees. To quantify gene tree discordance, we calculated Robinson-Foulds (RF) distances between each gene tree and the species tree using the phangorn package v2.11.1 (Schliep, K. P., 2011) in R. To visualize gene tree discordance in Megalagrion, we pruned trees to include only Megalagrion, then used the DensiTree function from the phangorn package to display all gene trees simultaneously as a cloudogram and compared them to the species tree generated by ASTRAL.

Divergence Time Estimation

To generate a starting tree for Bayesian divergence time estimation in BEAST, we used our maximum likelihood tree generated from the target-region-only alignment. From this tree, we reconstructed a maximum likelihood fossil calibrated tree using the chronos function from ape v5.7.1 in R (Paradis et al., 2004). We used four fossils to calibrate the tree (Agriocnemis lacteola Selys, Enallagma florissantella Cockerell, Diceratobasis worki Poinar, and Ischnura velteni Bechly, Table 2). These fossils were diagnosed using apomorphies in their respective descriptions, and none of them have been used as taxa in phylogenetic datasets previously, though I. velteni has been used for divergence time estimation (Blow et al., 2021). Each was placed at the node of the most recent common ancestor of their respective genera. Because we only had one sample from Diceratobasis Kennedy, its most recent common ancestor node was the node connecting Diceratobasis and its sister. We constrained the age of the root using a wide range around previous estimates of the age of the core Coenagrionidae (70 ± 20 MA) (Suvorov et al., 2021). We used a strict clock substitution model (discrete substitution model with lambda smoothing parameter = 1).

We used the preliminary fossil-calibration maximum likelihood tree from above as a starting tree for an unpartitioned Bayesian divergence time estimation using BEAST v2.7.6 (Bouckaert et al., 2014, 2019; Drummond & Rambaut, 2007). We used the maximum likelihood tree generated from an unpartitioned full alignment including both the target and flanking regions to constrain the topology using a multiple monophyletic constraint prior. We performed analyses with both fossil and geographic calibration (see below) priors, and with only fossil calibration priors (Table 3).

For both analyses, we used the birth-death diversification model, an uncorrelated relaxed clock substitution model, and a GTR+G+I model of evolution with all parameters estimated. We used a root age prior with a normal distribution around 70 MA with a standard deviation of 10 MA based on previous estimates of the age of the core Coenagrionidae (Suvorov et al., 2021). We used four fossil calibration points with log-normal prior distributions (Table 2, same as fossil calibrations for preliminary calibration). Log-normal fossil calibration prior distributions were offset so that the minimum was equal to the minimum age of the fossil, and μ was adjusted until the 95% quantile was near the previously estimated crown age for core Coenagrionidae (Suvorov et al., 2021). We ran the MCMC chain for at least 10,000,000 generations, and saved a tree every 5,000 generations.

For analyses with a geographic calibration, seven points were placed within Megalagrion clades with overwhelming phylogenetic evidence of following the progression rule, similar to previous studies (Haines et al., 2014; Johns et al., 2018). The emergence times of Oʻahu and Maui Nui were used for these calibration points. Within each clade, calibration points were placed at the most recent common ancestor of the Maui Nui/Hawaiʻi clade based on the emergence of Maui Nui and the most recent common ancestor of the Oʻahu + Maui Nui/Hawaiʻi clade based on the emergence of Oʻahu. The emergence time of Kauaʻi was not used for any geographic calibration points because taxa from Kauaʻi may have had ancestral ranges that once included islands that are now submerged. Geographic calibrations had a uniform prior distribution with a maximum equal to the emergence time of the island and a minimum of zero. Our geographic calibration strategy assumes that if a clade endemic to a younger range has a sister endemic to the next-oldest island, then the clade with the younger range does not predate the formation of its current range. Clades that do not have an extant sister on Kauaʻi are not included in this assumption, allowing for the possibility that the clade could have originated on Kauaʻi or older islands that are now submerged. For example, there is the possibility that the closely related terrestrially-breeding species M. oahuense and M. nesiotes have an extinct sister that once inhabited Kauaʻi or an older island; therefore, this clade is not used to aid in geographic calibration.

The MCMC results were analyzed for convergence and to determine the appropriate amount of burn-in using Tracer v1.7.2 (Rambaut et al., 2018). Maximum clade credibility trees were generated from the posterior trees using TreeAnnotator v2.7.6 with 20% burn-in. Node heights were set to common ancestor heights. To validate our calibration strategy and address potential concerns about interactions between multiple node age constraints, we performed prior-only analyses sampling from calibration priors without molecular data. We analyzed all 16 key parameters, including 12 calibration nodes (5 fossil-based, 7 geographic-based) and 4 evolutionary model parameters (birth rate, death rate, clock rate, tree height).

Ancestral Range Estimation

We performed ancestral range estimation using BioGeoBEARS v1.1.3 in R (Massana et al., 2015; N. Matzke, 2013; N. J. Matzke, 2013, 2014; Van Dam & Matzke, 2016). We use the dated tree from our BEAST analysis with both fossil and geographic calibration priors. Previous biogeographic analyses of Coenagrionidae with a higher-level taxonomic sampling, including Megalagrion and various Ischnurines, were not able to determine the geographic origin of Megalagrion. This is likely because Megalagrion is sister to the globally distributed diverse clade Ischnurinae, making it difficult to determine the origin of Megalagrion using only phylogenetic relationships and distribution data. We chose to focus our analysis on biogeographic patterns within Megalagrion rather than repeat an attempt to determine the origin of Megalagrion. We pruned the tree to include only Megalagrion. Operational taxonomic units in BioGeoBEARS analyses should be phylogenetically distinct, genetically isolated populations; therefore, we chose to preserve island-endemic populations of species present on multiple islands as separate operational taxonomic units. We collapsed multiple representatives of monophyletic island-endemic populations.

We coded islands into six areas (Fig. 1). The first area (K) included Kauaʻi, Niʻihau, and any unknown ancestral range, including islands that have since submerged. The second area (O) included Oʻahu. The third area (M) included Maui. The fourth area (L) included Lānaʻi. The fifth area (X) included Molokaʻi. The sixth area (H) included the island of Hawaiʻi. Kahoʻolawe was not included in the analysis because no Megalagrion damselflies have ever been recorded from the island. Area ‘K’ includes islands that have since submerged because some nodes in the tree are older than any of the existing main Hawaiian islands (S. Jordan et al., 2003). The islands of Maui Nui have been connected during the last glacial maximum 20 – 21 KA (Price & Elliott-Fisk, 2004). Oʻahu and Maui Nui have also been connected through Penguin Bank during periods of low sea level. Gene flow between Megalagrion populations on different islands of Maui Nui has been shown to be greater than between populations on Maui Nui and populations on the island of Hawaiʻi (Jones & Jordan, 2015; S. Jordan et al., 2005). We chose to code the islands of Maui Nui as separate areas because populations on different islands are still often phylogenetically distinct and have significant genetic differences, and marine channels represent a strong barrier to gene flow. By coding these islands separately, we also aimed to understand patterns of dispersal between the islands of Maui Nui and Oʻahu at a higher resolution.

We performed a time-stratified analysis, where areas and ranges were not allowed if none of the islands from that area had yet emerged (Table S3). The earliest estimated emergence times of islands were used to define available areas (D. Clague, 1996). Thus, area K was always available since it included submerged islands and Kauaʻi, which still exists. Area O was only available after the emergence of Oʻahu (3.7 MA). Areas M, L, and X were only available after the emergence of the earliest land of Maui Nui (Penguin Bank, 2.2 MA). Area H was only available after the emergence of the first volcano of the island of Hawaiʻi (Kohala, 0.6 MA).

To estimate the distance between islands, we drew contours at sea level around each area using the 2023 GEBCO Grid and terra v1.7.71 in R (GEBCO Compilation Group, 2023; Hijmans, 2024). We used the “distance” function from the terra R package to calculate the minimum distance between areas. These distances were used to create a distance matrix for DEC models with distance-dependent dispersal (+x). Dispersal multipliers were not applied, and the null range was not included (Massana et al., 2015).

Twelve separate models were run in BioGeoBEARS. These 12 models included the DEC, DIVALIKE, and BAYAREALIKE base models, base models with jump dispersal (+J), base models with distance-dependent dispersal (+x), and base models with both jump dispersal and distance-dependent dispersal (+J+x). All 12 models were compared using the Akaike information criterion (AIC) (Table S3), and the model with the lowest AIC value was selected as our final result. We included the +J model in our analysis because jump dispersal leading to founder event speciation is expected to have occurred in Megalagrion as individuals colonize new islands.

Diversification and Extinction

RPANDA

We performed a diversification analysis using RPANDA v2.3 (Morlon et al., 2016) in R following a recent methodological development (Mazet et al., 2023). We used a maximum-likelihood tree with expanded taxon sampling, which included approximately half of the described species for the included genera. This tree was generated using a full-length alignment including both target and flanking regions. We used ape v5.7.1 in R to collapse monophyletic species with multiple samples. We dated the tree using TreePL v1.0 (Smith & O’Meara 2012) with penalized likelihood. We applied the same four fossil calibration constraints as in our Bayesian divergence time estimation (Table 2). We ran TreePL first with the “prime” option to identify optimal smoothing and rate parameters, then ran TreePL with these parameters and the “thorough” option to ensure convergence. TreePL provides a computationally efficient alternative to Bayesian methods like BEAST, making it well-suited for this dataset with extended taxon sampling. We listed all species of core Coenagrionidae according to the World Odonata List (Paulson et al., 2024). Sampling fractions were generated using RPANDA. A backbone tree was chosen to include 10 clades of interest across the tree, all containing at least 10 species (Table 4). Note that names for these clades are not meant to be valid taxonomic suggestions but are purely for convenience. All possible shift combinations were tested using five diversification models (BCST, BVAR, BCST_DCST, BVAR_DCST, BCST_DVAR). We did not test the BVAR_DVAR model, which allows both birth and death rates to vary. The BVAR_DVAR model has been criticized for being problematic and producing unrealistic estimates, and it is not currently recommended for use with RPANDA (Burin et al., 2019; Mazet et al., 2023). We tested backbone.option with “crown.shift” and “stem.shift” and set multi.backbone to “all”. Model selection used AICc values, with the combination yielding the lowest AICc across all models considered optimal. We ran the shift.estimates function with multi.backbone = “all” to explore parameter space. For the clades with estimated shifts, we calculated the difference between their diversification parameter values from the backbone (family-wide) diversification parameters to identify the magnitude and direction of rate changes in each shifted clade.

Agent-based simulation modeling

Diversification analyses such as BAMM and RPANDA are useful tools for identifying potential patterns of speciation and extinction rates across clades and over time. However, neither of these analyses incorporates the dynamic geology of the Hawaiian Islands, which is an important driver of diversification patterns in Megalagrion, and has likely been a driver of diversification in Megalagrion for millions of years prior to the emergence of Kauaʻi. Although we have no extant populations from these sunken islands, we simulated an island system like the Hawaiian Islands and allowed a simulated lineage to evolve in this island system for tens of millions of years. Using this simulation model, we explored potential diversification patterns in a lineage like Megalagrion, and further tested our hypothesis that extinction played a significant role in the evolution of Megalagrion due to geologic change over time.

We created an agent-based simulation model in NetLogo v6.4.0 (Wilensky, 1999) following standard principles for designing agent-based models (ABMs) (Grimm & Railsback, 2012; Railsback & Grimm, 2012). The model is described following the Overview, Design concepts and Details (ODD) protocol (Supplementary methods) (Grimm, 2020; Grimm et al., 2006, 2010). The model is a generalized simulation of the evolution of a single lineage across an oceanic hotspot island system at a large time scale (10s of millions of years). The model can be divided into a geologic submodel and an evolutionary submodel. The geologic submodel models the creation, movement, and disappearance of islands in an oceanic-hotspot system. The simulated island system is modeled after patterns seen in the Hawaiian Islands, but is stochastic and generalized to simulate any oceanic hotspot system by adjusting geologic parameters. The evolutionary submodel models the evolution of a single lineage of organisms. This simulated lineage is modeled after Megalagrion, but is highly generalized and could be used to represent any lineage of organisms by adjusting biological and evolutionary parameters. Most biological and evolutionary parameters included in the model are not empirically known for Megalagrion, and/or do not correspond to easily measurable biological variables. Geologic, biological, and evolutionary parameters of the model are set manually and held constant throughout each run. However, variables such as island size, population size, number of species, extinction events, island age, endemism, the preferred niche of a species, genetic diversity, and others are not set, but are emergent from the output of the model.

To validate the geologic submodel, geologic parameters were set to values similar to those empirically found in the Hawaiian Islands. We ran 54 simulations of the model, each for 20 million years. For each run, we recorded the average length of the island chain, the average island size, the average island age, and the average number of islands throughout the simulation. The average and standard deviation of these data across all simulations were used to compare simulated island chains to the Hawaiian Islands.

At the end of each of these 54 simulations, we recorded the total number of species ever created and the number of extant species. The average and standard deviation of these data across all simulations were used to calculate the average speciation rate, extinction rate, and net diversification rate across all simulations.

Ancestral State Reconstruction

We performed stochastic character mapping on our maximum likelihood tree generated from a full alignment including both target and flanking regions using the make.simmap function of the R package phytools v2.1.1 (Revell, 2024) with the empirical Bayes method. Prior to ancestral state reconstruction, we selected the best evolutionary transition rate model by fitting three models to the data using the fitMk function (equal rates, symmetric rates, and all rates different) and evaluating model support based on AIC scores. We mapped breeding habitat and gill type separately. Characters were coded based on previous literature (Table 5) (S. Jordan et al., 2003; Polhemus, 1997; Polhemus & Asquith, 1996). Gills were coded as being foliate, lanceolate, or saccate. Breeding habitats were coded as lentic, lotic, seep, phytotelmata, or terrestrial. The breeding habitat and gill type of M. sp. (Jordan, et al. 2003) and M. mauka Daigle are not known. Therefore, two separate analyses were performed. For the first analysis, the character states of these species were given a flat prior distribution. The character states for these species were then estimated in the same way as character states were estimated for internal nodes. For the second analysis, these taxa were removed. We performed 10,000 simulations, then summarized all simulations by measuring the proportion of simulations that recovered each state at each node.