Diversification and evolution of Hawaiian Megalagrion damselflies (Pinapinao, Odonata: Coenagrionidae)
Data files
Oct 07, 2025 version files 857.11 MB
-
megalagrion_dryad_repository.zip
857.07 MB
-
README.md
35.81 KB
Abstract
Hawaiʻi’s pinapinao (Megalagrion McLachlan) comprises a radiation of 23 endemic damselfly species within Coenagrionidae. Despite being a unique study system for understanding geology’s impacts on evolutionary processes among Odonata, the understanding of these damselflies’ temporal, geographic, and phylogenetic origins remains incomplete. Testing macroevolutionary hypotheses has been hampered by conflicting topologies. To resolve these uncertainties, we performed phylogenetic analyses including divergence time estimation with 90 nuclear loci (>50 kbp) and 2 mitochondrial loci (>1 kbp), sampling representatives from 37 genera within core Coenagrionidae and 90% of Megalagrion species, including multiple island populations. We used ancestral range estimations, diversification analyses, agent-based simulation modeling, and ancestral state reconstruction to infer the group’s origin and biogeography and assess traits’ roles in diversification. Our findings indicate Megalagrion’s ancestor diverged from core Coenagrionidae early Eocene (~51 MA) and diversified early Miocene (~19 MA), suggesting Megalagrion’s MRCA predates Kauaʻi’s emergence by 7–21 MY. Diversification analyses suggest a low rate after Megalagrion diverged from Coenagrionidae, followed by a sudden increase around 19 MA, and simulation modeling supports extinction playing a significant role. Extant Megalagrion diversity is largely explained by ecological diversification into at least five clades with distinct breeding habitats that likely evolved on the Northwestern Hawaiian Islands, which are now sunken seamounts. Speciation continued as descendants dispersed to the current Hawaiian Islands as islands emerged. Species breeding in seeps further diversified within the island of Kauaʻi. Our results highlight including geologic changes over time in evolutionary studies and increase understanding of diversification patterns, biogeography, and adaptive radiation on islands. This dataset contains scripts, data, and files used to perform the study described above and published in the article "Diversification and Evolution of Hawaiian Megalagrion Damselflies (Pinapinao, Odonata: Coenagrionidae)" in Systematic Entomology.
Study Overview
This repository contains data and analysis code for a phylogenomic and biogeographic study of Hawaiian endemic damselflies (Megalagrion), investigating evolutionary relationships, inter-island colonization patterns, diversification rates, and trait evolution using target capture sequencing.
Key Findings
- Phylogenomic analysis of Megalagrion species across all main Hawaiian islands
- Biogeographic reconstruction of colonization patterns across the archipelago
- Analysis of diversification rate variation through time
- Ancestral state reconstruction of breeding habitat and gill morphology
Description of the Data and File Structure
Missing Data Codes
- NA - Not applicable; indicates that a parameter is not part of the specified model (e.g., the Alpha parameter is not estimated in constant-rate models)
Abbreviations Used
- AIC - Akaike Information Criterion (model selection statistic)
- AICc - Corrected Akaike Information Criterion (for small sample sizes)
- HPD - Highest Posterior Density (credible interval in Bayesian analysis)
- RF - Robinson-Foulds distance (phylogenetic tree distance metric)
- ML - Maximum Likelihood
- MCMC - Markov Chain Monte Carlo
- DEC - Dispersal-Extinction-Cladogenesis (biogeographic model)
- ASR - Ancestral State Reconstruction
- Ma - Million years ago
Repository Structure
This summarizes the structure of "megalagrion_dryad_repository.zip"
├── README.md # This file
├── software_requirements.md # Required software and versions
│
├── data/ # All input data files
│ ├── ancestral_state_reconstruction/ # Trait data for ASR analyses
│ ├── astral/ # Gene trees and species trees
│ ├── beast/ # BEAST analysis input files
│ ├── biogeobears/ # Geographic data and constraints
│ │ ├── four_areas/ # Four-area biogeographic model
│ │ ├── five_areas/ # Five-area biogeographic model
│ │ ├── six_areas/ # Six-area biogeographic model
│ │ └── seven_areas/ # Seven-area biogeographic model
│ ├── calibration/ # Time-calibrated trees
│ ├── figures/ # Geographic vector files and data
│ ├── iqtree/ # Maximum likelihood trees
│ ├── pruned_trees/ # Subset trees for specific analyses
│ ├── rpanda/ # Diversification analysis data
│ └── treepl/ # Tree dating input/output
│
├── src/ # Analysis scripts
│ ├── ancestral_state_reconstruction/
│ ├── astral/ # Species tree reconstruction
│ ├── biogeobears/ # Biogeographic analyses
│ ├── preliminary_calibration/ # Preliminary tree dating
│ ├── prune_trees/ # Tree pruning scripts
│ ├── rpanda/ # Diversification rate analyses
│ └── treepl/ # Tree dating scripts
│
├── results/ # Analysis outputs
│ ├── ancestral_state_reconstruction/
│ ├── astral/
│ │ ├── full/ # Full taxon sampling results
│ │ ├── target/ # Target taxon sampling results
│ │ └── gene_trees/ # Gene tree results
│ ├── beast/
│ ├── biogeobears/
│ │ ├── four_areas/ # Four-area biogeographic results
│ │ ├── five_areas/ # Five-area biogeographic results
│ │ ├── six_areas/ # Six-area biogeographic results
│ │ └── seven_areas/ # Seven-area biogeographic results
│ ├── rpanda/
│ └── treepl/
│
├── simulation/ # Agent-based model data
│
├── trees_from_cluster/ # Raw phylogenetic pipeline output (unmodified)
│ ├── core_coenagrionidae_full_sampling/ # Full taxon sampling analysis
│ └── core_coenagrionidae_limited_sampling/ # Limited taxon sampling analysis
│
└── analysis_descriptions/ # Detailed method descriptions
├── phylogenetics_methods.md
├── biogeography_methods.md
├── diversification_methods.md
└── trait_evolution_methods.md
Data Description
Ancestral State Reconstruction (data/ancestral_state_reconstruction/)
breeding_habitat.csv- Species breeding habitat classifications (lentic/lotic)breeding_habitat- Habitat type where larvae develop (lentic = still water, lotic = flowing streams, seep = seepage areas, phytotelmata = plant-held water, terrestrial = terrestrial substrates)
breeding_habitats_full_megalagrion.csv- List of specimen IDs for all Megalagrion taxa included in ancestral state reconstruction (one specimen ID per line, no headers)gill_type.csv- Larval gill morphology classificationsgill_type- Morphology of larval gills (foliate = leaf-like, lanceolate = lance-shaped, saccate = sac-like)
Phylogenetic Trees
Time-Calibrated Trees (data/calibration/):
calibrated.tree/calibrated.nwk- Primary time-calibrated phylogenyultrametric_core_coenagrionidae.tree/.nwk- Ultrametric tree for core Coenagrionidaecalibrated_full_core_coenagrionidae.tree/.nwk- Full taxon sampling calibrated treecalibrated_full_core_coenagrionidae_species.tree/.nwk- Species-level calibrated treecalibrated_pseudagrion_and_relatives.tree/.nwk- Pseudagrion clade calibrated treecalibrated_teinobasis_and_relatives.tree/.nwk- Teinobasis clade calibrated tree
Maximum Likelihood Trees (data/iqtree/):
full_sampling_target_alignment.tree- ML tree from full taxon samplingfull_sampling_target_alignment_rooted.tree- Rooted ML treelimited_sampling_full_alignment.tree- ML tree from limited sampling with full alignmentlimited_sampling_target_alignment.tree- ML tree from limited samplingsupermatrix_limited_sampling_target_alignment.fasta- Concatenated alignmentsupermatrix_limited_sampling_target_without_teinobasis.fasta- Alignment excluding Teinobasis
Pruned Trees for Analyses (data/pruned_trees/):
megalagrion_tree.tree- Megalagrion-only phylogenymegalagrion_species_tree.tree/.nex- Megalagrion species treelimited_sampling_full_alignment_megalagrion.tree- Megalagrion from limited samplinglimited_sampling_target_alignment_megalagrion.tree- Megalagrion target-only treecore_coenagrionidae_genus.tree- Genus-level Coenagrionidae treecore_coenagrionidae_genus_multiple_megalagrion.tree- Genus tree with multiple Megalagrioncore_coenagrionidae_genus_all_megalagrion.tree- Genus tree with all Megalagrion
Tree Dating Files (data/treepl/):
input_tree.nwk- Input tree for datingdated_tree.nwk- TreePL dated tree outputdated_tree.nwk.r8s- r8s format dated treecalibration_nodes.csv- Node calibration datanode- Node number in phylogenetic treemin_age- Minimum age constraint for calibration (million years ago)max_age- Maximum age constraint for calibration (million years ago)
treepl.config- TreePL configuration file
Species Trees (data/astral/):
alignments/full/- Per-locus alignments including flanking regions (92 loci)alignments/target/- Per-locus target-only alignments (92 loci)- Each alignment file corresponds to one target capture locus used for phylogenetic discordance analyses
BEAST Analysis Files (data/beast/):
BD_geo.xml- Birth-death with geography model specificationBD.xml- Birth-death model specificationBD_geo_prior_only_v2.xml- Prior-only analysis for birth-death with geographyBD_geo_prior_only.log- BEAST log file from prior-only analysisBD_geo_prior_only.trees- Tree sample from prior-only analysisBD_geo_prior_only_v2.xml.state- BEAST state file
Biogeographic Data (data/biogeobears/)
Contains geographic range data and dispersal constraints for biogeographic analyses. Each area model directory contains:
geography.txt- Geographic range data for each taxon- Line 1: Number of taxa, number of areas, area codes in parentheses
- Following lines: Taxon name followed by binary presence/absence codes (1=present, 0=absent) for each area
areas_allowed.txt- Dispersal constraints between areas (time-stratified format)- Contains multiple matrices separated by blank lines, one for each time period
- Each matrix has a header row with area codes, followed by a binary matrix (1=allowed, 0=not allowed)
- File ends with "END"
distance_matrix.txt- Geographic distances between areas (tab-delimited)- Header row with area codes
- Following rows contain distances (kilometers)
times.txt- Time boundaries for dispersal model stratification (million years ago, one per line)
Area Models:
four_areas/- Four-area biogeographic model (main islands only)five_areas/- Five-area model including extinct seamountssix_areas/- Six-area model with additional geographic regionsseven_areas/- Seven-area model with full geographic detail
Diversification Analysis (data/rpanda/)
taxonomy_one_per_species.txt- Taxonomic data for diversification rate analyses (tab-delimited)Species- Species name with ID prefixGenus- Genus nameSubtribe- Subtribe classificationTribe- Tribe classificationSubfamily- Subfamily classificationGroup- Higher taxonomic groupFamily- Family nameintree- Binary indicator of whether taxon is included in phylogeny (1=yes, 0=no)
- Time-calibrated trees for RPANDA analyses (see
data/calibration/)
Figure Data (data/figures/)
hawaii_bathymetry.tif- Bathymetric raster data for the Hawaiian archipelagoisland_ages.csv- Ages of Hawaiian islands for temporal analysesisland- Island name (English)hawaiian_name- Island name (Hawaiian)emerge- Age of island emergence above sea level (million years ago)submerge- Age when the island fully submerged (million years ago, 0 = still above water)height- Maximum elevation of the island (meters above sea level)height_label_x- X-coordinate for height label placement in figures (million years ago)
seamount_colors.csv- Color scheme for seamount visualizationisland- Island or seamount namecolor- Hex color code for visualization
six_areas_tip_states.txt- Geographic state assignments for phylogeny tips (tab-delimited)- Row 1: Area names as column headers (Kauaʻi, ʻOahu, Molokaʻi, Lānaʻi, Maui, Hawaiʻi)
- Following rows: Taxon names with presence/absence status for each area ("absent" or area name)
world.svg- World map vector graphicne_10m_land/- Natural Earth 1:10m land polygon shapefileshawaiian_island_vector_images/- SVG vector outlines for each Hawaiian islandhawaiian_island_seamount_vector_images/- SVG vector outlines including submerged seamounts
Simulation Data (simulation/)
NetLogo Model Files:
frogger_05.09.2024_no_aesthetics.nlogo- Agent-based model used for BehaviorSpace experiments (no visual elements, faster performance)frogger_05.09.2024.nlogo- Full model with visualization aesthetics
Experiment Details:
- 50 replicate runs conducted May 31, 2024
- Each run simulates 20 million years of evolution in a dynamic oceanic hotspot island system
- Data collected at timestep intervals throughout each simulation run
Experiment Results:
-
frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv- Time-series data from all 50 runs (30 MB, 2,108 rows)- Contains all parameters and output metrics recorded at multiple time points during each of the 50 runs
Model parameters (held constant across all runs):
selection_intensity- Value representing how intense selection pressure is (dimensionless, value = 0.3)eruption_size_stdev- Proportion of eruption size equal to 1 standard deviation (cubic meters, value = 0.5)volcano_slope- Slope of volcanoes that are created (degrees, value = 5)sea_level- Elevation of sea level (meters, value = 5000)patch_width- Width of a patch (meters, value = 10000)extinction_rate- Value representing how likely populations are to go extinct (dimensionless, value = 0.2)competition_intensity- Intensity of competition (dimensionless, value = 0.3)tectonic_speed- Speed of movement of the tectonic plate (meters per year, value = 0.07)mutation_rate- Value representing how many additional mutations enter the population (dimensionless, value = 0.2)speciation_threshold- Threshold beyond which organisms form a new species (dimensionless, value = 1.75)subsidence_rate- Rate of subsidence at sea level (meters per year, value = 1.5E-4)time_scale- Time per tick (years, value = 10000)deposition_rate- Rate of volcanic deposition (cubic meters per year, value = 50000000)eruption_size_max- Maximum eruption size allowed (cubic meters, value = 50000000000000)eruption_freq_mean- Average frequency of eruptions (years, value = 500000)gene_flow_magnitude- Magnitude of gene flow (dimensionless, value = 0.3)eruption_freq_stdev- Standard deviation of the frequency of eruptions (years, value = 100000)eruption_xcor_stdev- Standard deviation of the x-coordinates of the eruptions (dimensionless, value = 4)erosion_rate- Rate of erosion for patches above sea level (meters per year, value = 2.5E-4)dispersal_rate- Value representing how often dispersal events occur (dimensionless, value = 0.3)
Output metrics (recorded at each time point for each run):
[step]- Simulation timestep numbertime- Simulated timeaverage-island-size- Mean island areanumber-of-islands- Count of islands presentlength-of-island-chain- Length of island chainspecies-count- Number of extant speciesaverage-species-per-area- Species densityniche-breadth-all-species- Average ecological niche breadth across all speciesextinction-rate- Extinction ratespeciation-rate- Speciation ratediversification-rate- Net diversification ratelength species_database- Cumulative number of species ever created- Note: Units for output metrics are not specified in raw NetLogo output
-
frogger_05.09.2024_no_aesthetics May_31_2024-stats.csv- Additional detailed statistics from all 50 runs (1.5 GB, 3,662,903 rows) -
frogger_05.09.2024_no_aesthetics_May_31_2024-lists.csv- List-format output from BehaviorSpace (633 bytes) -
frogger_05.09.2024_no_aesthetics May_31_2024-table.csv- Table-format output (empty file)
Summary Statistics:
summary_05_09_2024_behaviorspace_experiment.csv- Mean and standard deviation of final values across all 50 runs- Contains summary statistics with units specified in column headers
- Columns (as labeled in file, with mean ± SD values):
island size (km^2)- 2878.29 ± 311.19number of islands- 8.36 ± 0.45length of island chain (km)- 821.02 ± 36.88total species- 558.34 ± 764.98extant species- 16.65 ± 17.17number of species per 100km^2- 0.17 ± 0.16extinction rate (species/MY)- -27.08 ± 37.39speciation rate (species/MY)- 27.92 ± 38.25diversification rate (species/MY)- 0.83 ± 0.86
summary_05_09_2024_behaviorspace_experiment.txt- Text version of summary statistics
Analysis Scripts:
netlogo_parse_behavior_space.r- Parses BehaviorSpace results and calculates summary statistics- Inputs:
frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv - Outputs:
summary_05_09_2024_behaviorspace_experiment.csv - Required R packages: tidyverse, readr, dplyr, ggplot2, bit, janitor
- Inputs:
Analysis Scripts (src/)
The src/ directory contains R and shell scripts used to perform all analyses:
ancestral_state_reconstruction/- corHMM scripts for breeding habitat and gill type evolutionastral/- Species tree estimation and gene tree discordance analysis scriptsbiogeobears/- BioGeoBEARS scripts for ancestral range reconstruction (biogeobears.r,make_dispersal_matrix.r)preliminary_calibration/- Initial tree dating and calibration explorationprune_trees/- Scripts to extract taxon subsets from larger phylogeniesrpanda/- Diversification rate analysis scripts (rpanda.r,create_publication_materials.r)treepl/- TreePL configuration and tree dating scripts
Analysis Method Documentation (analysis_descriptions/)
Detailed methodological descriptions for each analysis type:
phylogenetics_methods.md- Maximum likelihood and species tree inference methodsbiogeography_methods.md- BioGeoBEARS model specifications and area definitionsdiversification_methods.md- RPANDA diversification rate model descriptionstrait_evolution_methods.md- Ancestral state reconstruction methodology
Raw Phylogenetic Pipeline Output (trees_from_cluster/)
This directory contains unmodified output from the original phylogenetic analysis pipeline run on a computing cluster. It includes all intermediate files, scripts, and raw outputs from the phylogenomic analysis workflow.
core_coenagrionidae_full_sampling/ (298 MB):
- Raw IQ-TREE phylogenetic analyses with full taxon sampling
taxa.csv- List of taxon names included in analysis (no headers, one taxon per line)loci_counts.csv- Number of loci sequenced per taxontaxon- Specimen/taxon identifiernum_loci- Number of target capture loci successfully sequenced
probe_alignments/- 92 per-locus FASTA alignmentsprobe_trees/- Gene trees for each locusprobe_orthologs/- Ortholog sequences identified from target capture data- Concatenation scripts and supermatrix construction
- ALICUT alignment trimming outputs (ALICUT_info.txt)
- Processing scripts for ortholog filtering and taxon selection
core_coenagrionidae_limited_sampling/ (833 MB):
- Raw IQ-TREE phylogenetic analyses with limited taxon sampling
taxa.csv- List of taxon names included in analysis (no headers, one taxon per line)loci_counts.csv- Number of loci sequenced per taxontaxon- Specimen/taxon identifiernum_loci- Number of target capture loci successfully sequenced
full_orthologs/- Full-length sequence alignments (probe regions + flanking sequences, 92 loci with multiple files per locus)probe_orthologs/- Target-only alignments (probe regions only, 92 loci with aligned and unaligned versions)full_trees/- Per-locus gene trees from full-length alignmentsprobe_trees/- Per-locus gene trees from target-only alignmentsprobe_alignments/- Intermediate alignment filesconcatenated_probe_alignments/- Concatenated supermatrix files for ASTRAL analysispartitioned_alignments/- Partitioned alignment schemesfull/- Full-length partitioned alignmentsprobe/- Probe-only partitioned alignmentsflank/- Flanking region partitioned alignmentsprobe_nuclear/- Nuclear probe partitions
reference_probes_20kb/- Reference probe sequences- ALICUT and ALISCORE alignment filtering outputs (scripts and log files)
- Concatenated supermatrix alignments and IQ-TREE best tree/model files
Note: These directories preserve the complete computational workflow as originally executed. Curated and analysis-ready versions of key outputs (trees, alignments) are provided in data/iqtree/, data/astral/, and data/pruned_trees/.
Analysis Results (results/)
The results/ directory contains outputs from all analyses described in the manuscript.
Phylogenetic Discordance (results/astral/):
Contains ASTRAL species tree analyses and gene tree discordance metrics organized in three subdirectories:
full/- Analyses using full-length alignments (probe + flanking sequences)all/- Full taxon set resultsspecies/- Species tree files (cladogram, rooted, ultrametric)gene/- Gene tree files for all locidensitree/- Densitree visualization files
megalagrion/- Megalagrion-only subset (same species/gene/densitree structure plus discordance analysis files)Table1_summary_statistics.csv- Summary statistics for species tree analysisStatistic- Name of summary statisticValue- Value of statistic
Table2_concordant_genes.csv- Genes most concordant/discordant with species treeType- Category (Most concordant or Most discordant)Rank- Ranking within categoryLocus- Locus identifierLocus_index- Numeric index of locusRF_distance- Robinson-Foulds distance (normalized, dimensionless, range 0-1)
gene_tree_rf_distances.csv- Robinson-Foulds distances for all gene treesgene_index- Numeric index of gene treelocus- Locus identifierrf_distance- Robinson-Foulds distance from species tree (normalized, dimensionless, range 0-1)
target/- Analyses using target-only alignments (probe regions only)- Same structure as
full/(all/ and megalagrion/ subdirectories, each with species/gene/densitree/)
- Same structure as
gene_trees/- Individual gene tree files organized by alignment type (full/ and target/)
Ancestral State Reconstruction (results/ancestral_state_reconstruction/):
breeding_habitat_model_selection.txt- Model comparison for breeding habitat evolutiongill_type_model_selection.txt- Model comparison for gill morphology evolutionreconstruction_breeding_habitat_*.pdf- Phylogenies with ancestral habitat statesreconstruction_gill_type_*.pdf- Phylogenies with ancestral gill type states*.rdata- R workspace files with full reconstruction results- Transition rate matrices for character evolution models
Biogeographic Analyses (results/biogeobears/):
Four subdirectories (four_areas/, five_areas/, six_areas/, seven_areas/) each containing 12 model results:
result_*.RData- BioGeoBEARS model fit results (12 files per area model)- 3 base models: DEC, DIVALIKE, BAYAREALIKE
- Each with 4 variants: base, +J (founder-event speciation), +x (distance-dependent dispersal), +J+x
result_table.csv- Summary table of model fit statisticsLnL- Log-likelihood of modelnumparams- Number of parameters in modeld- Dispersal rate (range expansion events per lineage per million years)e- Extinction rate (local range loss events per lineage per million years)j- Founder-event speciation weight (relative per-event weight at cladogenesis, dimensionless)x- Distance-dependent dispersal exponent (modifies dispersal probability as distance^x, dimensionless)AIC- Akaike Information Criterion valueAIC_wt- Akaike weight (model probability)
stats_table_*.csv- Pairwise model comparison statistics (likelihood ratio tests)plot_*.svg- Geographic range evolution visualizations (12 SVG files per area model, one per model variant)- Each area model tests different biogeographic scenarios (main islands only vs. including extinct seamounts)
Diversification Rate Analyses (results/rpanda/):
Table1_BestModels.csv- Best-fitting diversification modelsTaxonomic Group- Clade or taxonomic group analyzedModel- Best-fitting model abbreviation (e.g., BCST = constant birth rate, BVAR = time-varying birth rate)Type- Model descriptionParam.- Number of parameters in modelAICc- Corrected Akaike Information Criterion valueλ (initial)- Initial speciation rate (events per lineage per million years)α (change)- Rate of change in speciation rate through time (per million years)
Table2_ParameterComparison.csv- Model parameter comparisonsTaxonomic Group- Clade or taxonomic group analyzedModel- Model abbreviationλ- Estimated speciation rate (events per lineage per million years)Δλ- Change in speciation rate (events per lineage per million years)α- Rate of exponential change in speciation (per million years)Δα- Change in rate parameter (per million years)
Table3_ModelSelection.csv- Model selection statisticsCombination- Node numbers where rate shifts occurType- Whether shifts are at the crown or the stem nodesParameters- Total number of model parametersAICc- Corrected Akaike Information Criterion valueDelta_AICc- Difference in AICc from best model
estimated_shifts_subclades.csv- Rate shift estimates for cladesNode- Node number in phylogeny where shift occursClade- Name of clade or taxonomic groupModels- Model abbreviationshift location- Whether the shift is at the crown or the stem nodeShift?- Binary indicator of rate shift presence (1 = shift present, 0 = no shift)Parameters- Number of model parameterslogL- Log-likelihood of modelAICc- Corrected Akaike Information Criterion valuedelta_AICc- Difference in AICc from best modelLambda- Speciation rate (events per lineage per million years)delta Lambda from backbone- Change in speciation rate relative to backbone (events per lineage per million years)Alpha- Rate of exponential change in speciation (per million years)delta alpha from backbone- Change in alpha parameter relative to backbone (per million years)Mu- Extinction rate (events per lineage per million years)delta mu from backbone- Change in extinction rate relative to backbone (events per lineage per million years)Beta- Rate of exponential change in extinction (per million years)delta beta from backbone- Change in beta parameter relative to backbone (per million years)
estimated_shifts_subclades_crown.csv- Rate shift estimates for crown node shifts (same variable definitions as above)estimated_shifts_subclades_stem.csv- Rate shift estimates for stem node shifts (same variable definitions as above)SummaryStatistics.txt- Summary of diversification analyses- PDF figures of phylogenies with rate shifts and parameter comparisons
BEAST Analyses (results/beast/):
BD_geo.tree/BD.tree- Birth-death phylogenetic MCMC samplesBD_geo.log/BD.log- BEAST parameter log filesBD_geo_20burnin.tree,BD_geo_25burnin.tree,BD_geo_40burnin.tree- Tree files with different burn-in thresholdsprior_posterior_comparison.csv- Prior vs. posterior distribution comparisonsParameter- BEAST model parameter nameMean_Prior- Mean value from prior distributionHPD_lower_Prior- Lower bound of 95% highest posterior density interval for priorHPD_upper_Prior- Upper bound of 95% highest posterior density interval for priorMean_Posterior- Mean value from posterior distributionHPD_lower_Posterior- Lower bound of 95% highest posterior density interval for posteriorHPD_upper_Posterior- Upper bound of 95% highest posterior density interval for posteriorHPD_width_Prior- Width of prior HPD intervalHPD_width_Posterior- Width of posterior HPD intervalWidth_Ratio- Ratio of posterior to prior HPD width- Note: Birth and death rate parameters are in events per lineage per million years
prior_posterior_summary.csv- Summary statistics for prior and posterior distributionsParameter- BEAST model parameter nameType- Distribution type (Prior or Posterior)Mean- Mean value of parameterMedian- Median value of parameterHPD_lower- Lower bound of 95% highest posterior density intervalHPD_upper- Upper bound of 95% highest posterior density interval- Note: Age parameters (mrca.age) are in million years; rate parameters are in events per lineage per million years
BD_geo.nwk- Maximum clade credibility tree in Newick format
Tree Dating (results/treepl/):
calibration_nodes_plot.pdf- Visualization of calibration points on phylogenydated_tree_summary.pdf- Summary of tree dating results
Sharing/Access Information
Links to other publicly accessible locations of the data:
- Raw sequence data: NCBI Sequence Read Archive (BioProject PRJNA1265769)
Data was derived from the following sources:
- Fossil calibration ages derived from published literature (see
data/treepl/calibration_nodes.csv) - Island emergence and subsidence ages from published geological studies
Code/Software
All analysis scripts are written in R and shell scripts. Detailed methodological descriptions for each analysis type are provided in the analysis_descriptions/ directory. Software requirements and versions are specified in software_requirements.md.
Analysis Workflow
Analyses should be run in the following order:
1. Initial Phylogenetic Pipeline (trees_from_cluster/)
- Master script:
robbie_method.sh- Orchestrates the complete phylogenetic workflow - Key steps:
- Sequence processing and quality control
- Per-locus alignment with MAFFT
- Alignment filtering with ALICUT/ALISCORE
- Gene tree estimation with IQ-TREE
- Concatenated supermatrix analysis
- Inputs: Raw sequence data (not included in repository; available from NCBI BioProject PRJNA1265769)
- Outputs:
- Per-locus alignments in
probe_alignments/andfull_orthologs/ - Gene trees in
probe_trees/andfull_trees/ - Supermatrix alignments and trees
taxa.csvandloci_counts.csv
- Per-locus alignments in
- Note: This directory contains raw pipeline output and should not be modified
2. ASTRAL Species Tree Analysis (src/astral/)
- Scripts:
01_gene_trees.sh- Generate individual gene trees02_astral.sh- Run ASTRAL coalescent analysis03_root_trees.r- Root trees for downstream analyses
- Inputs: Gene trees from step 1, alignments from
data/astral/alignments/ - Outputs:
- Species trees in
results/astral/ - Gene tree discordance metrics (
gene_tree_rf_distances.csv,Table1_summary_statistics.csv,Table2_concordant_genes.csv)
- Species trees in
3. Preliminary Tree Dating (src/preliminary_calibration/)
- Purpose: Generate starting tree for BEAST analysis
- Method: Fossil calibration using the chronos function in R (ape package)
- Inputs: ML tree from step 1, fossil calibrations from
data/treepl/calibration_nodes.csv - Outputs: Preliminary time-calibrated tree for BEAST
4. BEAST Divergence Time Estimation
- XML files:
data/beast/BD.xml,data/beast/BD_geo.xml - Method: Bayesian birth-death analysis with fossil and geographic calibrations
- Software: Run externally using BEAST v2.6+
- Inputs:
- Starting tree from step 3
- Fossil calibrations (Table 2 in manuscript)
- Geographic calibrations (island emergence ages)
- Outputs:
- MCMC samples in
results/beast/*.log - Tree samples in
results/beast/*.tree - Maximum clade credibility tree in
results/beast/BD_geo.nwk - Prior/posterior comparisons (
prior_posterior_comparison.csv,prior_posterior_summary.csv)
- MCMC samples in
- Note: This produces the main time-calibrated tree used for BioGeoBEARS and ancestral state reconstruction
5. TreePL Dating for Extended Sampling (src/treepl/)
- Purpose: Date tree with extended taxon sampling for RPANDA
- Method: Penalized likelihood dating
- Inputs: ML tree with extended sampling,
data/treepl/calibration_nodes.csv - Outputs:
- Time-calibrated trees in
data/calibration/ - Visualizations in
results/treepl/
- Time-calibrated trees in
6. Biogeographic Analysis (src/biogeobears/)
- Scripts:
biogeobears.r- Main BioGeoBEARS analysismake_dispersal_matrix.r- Generate distance-based dispersal matrices
- Inputs:
- Time-calibrated tree from BEAST (step 4)
- Geographic data in
data/biogeobears/(geography.txt, distance_matrix.txt, areas_allowed.txt, times.txt)
- Outputs:
- Model results in
results/biogeobears/*/result_*.RData - Model comparison tables (
result_table.csv,stats_table_*.csv) - Ancestral range visualizations (
plot_*.svg)
- Model results in
7. Diversification Analysis (src/rpanda/)
- Scripts:
rpanda.r- Main RPANDA analysiscreate_publication_materials.r- Generate figures and tables
- Inputs:
- Time-calibrated tree with extended sampling from TreePL (step 5)
- Taxonomy file
data/rpanda/taxonomy_one_per_species.txt
- Outputs:
- Model selection results (
Table1_BestModels.csv,Table2_ParameterComparison.csv,Table3_ModelSelection.csv) - Rate shift estimates (
estimated_shifts_subclades.csv,estimated_shifts_subclades_crown.csv,estimated_shifts_subclades_stem.csv) - Summary statistics and visualizations
- Model selection results (
8. Ancestral State Reconstruction (src/ancestral_state_reconstruction/)
- Script:
ancestral_state_reconstruction.r - Method: corHMM hidden Markov models
- Inputs:
- ML tree from step 1
- Trait data in
data/ancestral_state_reconstruction/(breeding_habitat.csv, gill_type.csv)
- Outputs:
- Model selection results (
breeding_habitat_model_selection.txt,gill_type_model_selection.txt) - Reconstructions (
reconstruction_*.pdf) - R data objects (
*.rdata)
- Model selection results (
9. Agent-Based Simulation (Standalone)
- Software: NetLogo v6.4.0
- Model files:
simulation/frogger_05.09.2024_no_aesthetics.nlogo- Used for BehaviorSpace experimentssimulation/frogger_05.09.2024.nlogo- Full version with visualization
- Purpose: Simulate diversification in dynamic oceanic hotspot island systems
- Analysis: 50 replicate runs conducted May 31, 2024
- Outputs:
- Raw BehaviorSpace results (
frogger_05.09.2024_no_aesthetics_May_31_2024-spreadsheet.csv) - Detailed statistics (
frogger_05.09.2024_no_aesthetics May_31_2024-stats.csv) - Summary statistics (
summary_05_09_2024_behaviorspace_experiment.csv)
- Raw BehaviorSpace results (
- Post-processing:
netlogo_parse_behavior_space.rcalculates means and standard deviations across all runs
Key software and packages used:
- R (version 4.0+) - Statistical computing and graphics
- IQ-TREE - Maximum likelihood phylogenetic inference
- ASTRAL - Species tree estimation from gene trees
- BioGeoBEARS - Biogeographic analysis (R package)
- RPANDA - Diversification rate analysis (R package)
- BEAST - Bayesian phylogenetic inference
- TreePL - Phylogenetic tree dating
- corHMM - Ancestral state reconstruction (R package)
- Densitree - Tree visualization (available under GPL license at https://www.cs.auckland.ac.nz/~remco/DensiTree/)
See software_requirements.md for complete version information and installation instructions.
Contact
For questions about the data or analyses:
Robert Hadfield
Email: robert7hadfield@gmail.com
Acknowledgments
We thank the collectors who provided specimens for this study and the curators at FSCA and BYU for access to specimens.
Taxon Sampling
We gathered a total of 56 specimens from all 21 extant species of Megalagrion, plus one undescribed species (Table 1). Specimens were sourced from the personal collection of Steve Jordan, the Florida State Collection of Arthropods, Brigham Young University, and fieldwork following local and state permitting requirements in Hawaiʻi. In addition to having complete sampling across the genus, we wanted to have representatives of each taxon from the different habitats in which they are found. With few exceptions, we obtained one representative of each species of Megalagrion from every major volcano, including multiple volcanoes on the same island, except the island of Hawaiʻi, where we obtained at least one specimen of each species on the island. We acknowledge the subspecies division between M. nigrohamatum nigrohamatum (Blackburn) and M. nigrohamatum nigrolineatum (Perkins) and include specimens from each subspecies. We did not obtain specimens from the two extinct species (M. jugorum (Perkins) and M. molokaiense (Perkins)) or from several extirpated populations of extant species. Historical dry specimens from both extinct species and several of these extirpated populations are present at the Bishop Museum in Honolulu, Hawaiʻi. However, DNA was not extracted from these specimens due to their taxonomic importance as irreplaceable types.
In addition to our sampling of Megalagrion, we sampled representatives from 37 of 58 valid genera across core Coenagrionidae (sensu Dijkstra et al., 2014), which included all genera previously hypothesized to be closely related to Megalagrion (Table S1) (Dumont et al., 2010; E. W. O’Grady & May, 2003). We attempted to sample all genera that may be closely related to Megalagrion. Of the 21 genera from core Coenagrionidae not included in our sampling scheme, 10 are monotypic, and the other 11 genera account for a total of 55 species. Our purpose in including this level of outgroup sampling was to determine the placement of Megalagrion within Coenagrionidae, and to help estimate molecular divergence times and ancient biogeography.
For diversification analyses with RPANDA, an extended taxon sampling set was used with additional species sampled from genera across core Coenagrionidae, totaling 411 samples (Table S2).
DNA Extraction & Sequencing
For each specimen, thoracic muscle tissue was extracted when possible; otherwise, a single leg was used. Specimens were vouchered at the Bean Life Science Museum, Brigham Young University in Provo, Utah. DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Cat. No. 69504). For at least one specimen from each genus, sequences of 1,306 nuclear loci were obtained through anchored hybrid enrichment (AHE) sequencing using a probe set designed to capture single-copy orthologs of exons across odonate genomes (Bybee et al., 2021). For the remaining samples of Megalagrion, we captured a reduced set of 90 nuclear loci (>50 kbp) as well as 2 mitochondrial loci (~1,000 bp). AHE sequencing was performed by Rapid Genomics (Gainesville, FL, USA) following their standard protocols. This genome-wide approach with multiple loci helps capture phylogenetic signals that represent the evolution of species as a whole rather than specific genes (Degnan & Rosenberg, 2006).
Sequence Processing & Quality Control
Anchored hybrid enrichment data were processed with a pipeline modeled after Breinholt et al. (2018). Raw sequence reads were cleaned and adapters were removed using fastp v0.23.2 (Chen et al., 2018). The 90 nuclear and 2 mitochondrial loci were assembled following Goodman (2023) using iterative baited assembly following Breinholt et al. (2018) with SPAdes genome assembler v3.15.4 (Bankevich et al., 2012) and the Tanypteryx hageni (Selys) reference genome (Tolman et al., 2023). A more closely related zygopteran or coenagrionid genome may have improved locus recovery and alignment accuracy. However, the anisopteran Tanypteryx hageni genome was used because probes were designed to be single copy with this genome and to be compatible with extended datasets across Odonata for further studies with broader taxon sampling. Sequences from the expanded probe set with 1,306 nuclear loci were filtered to include only the 90 nuclear loci also present in the smaller probe set. Contamination filtering was performed by excluding identical sequences from different genera using the -ublast function of USEARCH v11.0.667 (Edgar, 2010). For each locus, only one ortholog was chosen for each taxon based on bit score, length, and coverage. We excluded taxa for which fewer than 30% of the 92 loci (< 27 loci) were recovered. In our main sampling scheme, 1 of 104 samples (<1%) recovered fewer than 27 loci. In the extended sampling scheme used for RPANDA analyses, 28 of 411 samples (~7%) had fewer than 27 of the 92 loci. Some samples from both sampling schemes were sequenced with the expanded probe set of 1,306 nuclear loci, all of which were filtered to include only the 90 nuclear loci present in the smaller probe set. Anchored hybrid enrichment sequencing recovered not only the targeted loci, but also some flanking regions on either side of each locus.
To align sequences for each locus, we used the L-INS-I algorithm from MAFFT v7.520 (Katoh & Standley, 2013). To ensure that the sequences aligned properly, we aligned the target region first, including the sequence used as a reference during assembly. The reference sequence was included to identify the position of the target region of the alignment. Then, we aligned the full-length sequences, including the flanking regions, to the alignment containing only the target region using the --addlong function of MAFFT. The target-region sequences (except for the target-region reference sequence) were then removed from the alignment, leaving an alignment of the full-length sequences, plus the target-region reference sequence. Aligned sequences were filtered using Aliscore 22.ii.2012 (Kück et al., 2010; Misof & Misof, 2009) and AliCUT v2.31 (Kück, 2009), which aim to remove alignment, ambiguous, or randomly similar sites. We trimmed the ends of alignments until there was at least 50% taxon coverage.
This trimmed alignment was then split into three separate alignments, one for the target region (based on the position of the target-region reference sequence), and two for the flanking regions on each side. The two alignments for the flanking regions on either side of the target region were then concatenated. Loci with less than 50% taxon coverage were removed from further analysis. This process yielded two sets of alignments: one including only the target region, the other including only the flanking regions. The average pairwise distance between sequences in a concatenation of the flanking region alignments was more than three times the average pairwise distance between sequences in a concatenation of the target region alignments, indicating that the target and flanking regions have significantly different rates of evolution. Therefore, we chose to partition the target and flanking regions separately in phylogenetic analyses.
Phylogenetic Analysis
We performed our main phylogenetic analyses on two data sets. The first included a full alignment (both the target and flanking regions), and the second included a target-region-only alignment. We also performed phylogenetic analyses with the extended taxon sampling scheme and target-region-only alignment for RPANDA following the same procedure. For each analysis, we performed a partitioned maximum likelihood phylogenetic inference using IQ-TREE v2.2.0 (Minh et al., 2020). Our preliminary full partition scheme contained 92 or 184 partitions reflecting our alignments [92 loci × (1 target) or 92 loci × (1 target + 1 flanking)]. We used ModelFinder Plus (Kalyaanamoorthy et al., 2017) to select the optimal partition scheme by merging partitions until model fit ceased to improve, and to find the best-fitting model of evolution for each partition according to the Bayesian Information Criterion (-m MFP+MERGE option). We employed the identified partition scheme and evolutionary models to infer a maximum likelihood tree. To assess branch support, we conducted 1,000 ultrafast bootstrap replicates with nearest neighbor interchange optimization (-bnni 1000 option).
Coalescent Species Tree Analysis
We inferred species relationships using a multispecies coalescent approach implemented in ASTRAL (Mirarab & Warnow, 2015; Zhang et al., 2018) via the ASTER package v1.16.3.4 (Zhang et al., 2023). Individual gene trees were first reconstructed for each of the 92 loci from the full alignments using IQ-TREE with ModelFinder Plus to select the best-fitting substitution model for each locus. ASTRAL estimates the species tree by finding the tree that maximizes the quartet score given the input gene trees, accounting for incomplete lineage sorting under the multispecies coalescent model. Branch support was assessed using local posterior probabilities, which represent the probability of a quartet given the gene trees. To quantify gene tree discordance, we calculated Robinson-Foulds (RF) distances between each gene tree and the species tree using the phangorn package v2.11.1 (Schliep, K. P., 2011) in R. To visualize gene tree discordance in Megalagrion, we pruned trees to include only Megalagrion, then used the DensiTree function from the phangorn package to display all gene trees simultaneously as a cloudogram and compared them to the species tree generated by ASTRAL.
Divergence Time Estimation
To generate a starting tree for Bayesian divergence time estimation in BEAST, we used our maximum likelihood tree generated from the target-region-only alignment. From this tree, we reconstructed a maximum likelihood fossil calibrated tree using the chronos function from ape v5.7.1 in R (Paradis et al., 2004). We used four fossils to calibrate the tree (Agriocnemis lacteola Selys, Enallagma florissantella Cockerell, Diceratobasis worki Poinar, and Ischnura velteni Bechly, Table 2). These fossils were diagnosed using apomorphies in their respective descriptions, and none of them have been used as taxa in phylogenetic datasets previously, though I. velteni has been used for divergence time estimation (Blow et al., 2021). Each was placed at the node of the most recent common ancestor of their respective genera. Because we only had one sample from Diceratobasis Kennedy, its most recent common ancestor node was the node connecting Diceratobasis and its sister. We constrained the age of the root using a wide range around previous estimates of the age of the core Coenagrionidae (70 ± 20 MA) (Suvorov et al., 2021). We used a strict clock substitution model (discrete substitution model with lambda smoothing parameter = 1).
We used the preliminary fossil-calibration maximum likelihood tree from above as a starting tree for an unpartitioned Bayesian divergence time estimation using BEAST v2.7.6 (Bouckaert et al., 2014, 2019; Drummond & Rambaut, 2007). We used the maximum likelihood tree generated from an unpartitioned full alignment including both the target and flanking regions to constrain the topology using a multiple monophyletic constraint prior. We performed analyses with both fossil and geographic calibration (see below) priors, and with only fossil calibration priors (Table 3).
For both analyses, we used the birth-death diversification model, an uncorrelated relaxed clock substitution model, and a GTR+G+I model of evolution with all parameters estimated. We used a root age prior with a normal distribution around 70 MA with a standard deviation of 10 MA based on previous estimates of the age of the core Coenagrionidae (Suvorov et al., 2021). We used four fossil calibration points with log-normal prior distributions (Table 2, same as fossil calibrations for preliminary calibration). Log-normal fossil calibration prior distributions were offset so that the minimum was equal to the minimum age of the fossil, and μ was adjusted until the 95% quantile was near the previously estimated crown age for core Coenagrionidae (Suvorov et al., 2021). We ran the MCMC chain for at least 10,000,000 generations, and saved a tree every 5,000 generations.
For analyses with a geographic calibration, seven points were placed within Megalagrion clades with overwhelming phylogenetic evidence of following the progression rule, similar to previous studies (Haines et al., 2014; Johns et al., 2018). The emergence times of Oʻahu and Maui Nui were used for these calibration points. Within each clade, calibration points were placed at the most recent common ancestor of the Maui Nui/Hawaiʻi clade based on the emergence of Maui Nui and the most recent common ancestor of the Oʻahu + Maui Nui/Hawaiʻi clade based on the emergence of Oʻahu. The emergence time of Kauaʻi was not used for any geographic calibration points because taxa from Kauaʻi may have had ancestral ranges that once included islands that are now submerged. Geographic calibrations had a uniform prior distribution with a maximum equal to the emergence time of the island and a minimum of zero. Our geographic calibration strategy assumes that if a clade endemic to a younger range has a sister endemic to the next-oldest island, then the clade with the younger range does not predate the formation of its current range. Clades that do not have an extant sister on Kauaʻi are not included in this assumption, allowing for the possibility that the clade could have originated on Kauaʻi or older islands that are now submerged. For example, there is the possibility that the closely related terrestrially-breeding species M. oahuense and M. nesiotes have an extinct sister that once inhabited Kauaʻi or an older island; therefore, this clade is not used to aid in geographic calibration.
The MCMC results were analyzed for convergence and to determine the appropriate amount of burn-in using Tracer v1.7.2 (Rambaut et al., 2018). Maximum clade credibility trees were generated from the posterior trees using TreeAnnotator v2.7.6 with 20% burn-in. Node heights were set to common ancestor heights. To validate our calibration strategy and address potential concerns about interactions between multiple node age constraints, we performed prior-only analyses sampling from calibration priors without molecular data. We analyzed all 16 key parameters, including 12 calibration nodes (5 fossil-based, 7 geographic-based) and 4 evolutionary model parameters (birth rate, death rate, clock rate, tree height).
Ancestral Range Estimation
We performed ancestral range estimation using BioGeoBEARS v1.1.3 in R (Massana et al., 2015; N. Matzke, 2013; N. J. Matzke, 2013, 2014; Van Dam & Matzke, 2016). We use the dated tree from our BEAST analysis with both fossil and geographic calibration priors. Previous biogeographic analyses of Coenagrionidae with a higher-level taxonomic sampling, including Megalagrion and various Ischnurines, were not able to determine the geographic origin of Megalagrion. This is likely because Megalagrion is sister to the globally distributed diverse clade Ischnurinae, making it difficult to determine the origin of Megalagrion using only phylogenetic relationships and distribution data. We chose to focus our analysis on biogeographic patterns within Megalagrion rather than repeat an attempt to determine the origin of Megalagrion. We pruned the tree to include only Megalagrion. Operational taxonomic units in BioGeoBEARS analyses should be phylogenetically distinct, genetically isolated populations; therefore, we chose to preserve island-endemic populations of species present on multiple islands as separate operational taxonomic units. We collapsed multiple representatives of monophyletic island-endemic populations.
We coded islands into six areas (Fig. 1). The first area (K) included Kauaʻi, Niʻihau, and any unknown ancestral range, including islands that have since submerged. The second area (O) included Oʻahu. The third area (M) included Maui. The fourth area (L) included Lānaʻi. The fifth area (X) included Molokaʻi. The sixth area (H) included the island of Hawaiʻi. Kahoʻolawe was not included in the analysis because no Megalagrion damselflies have ever been recorded from the island. Area ‘K’ includes islands that have since submerged because some nodes in the tree are older than any of the existing main Hawaiian islands (S. Jordan et al., 2003). The islands of Maui Nui have been connected during the last glacial maximum 20 – 21 KA (Price & Elliott-Fisk, 2004). Oʻahu and Maui Nui have also been connected through Penguin Bank during periods of low sea level. Gene flow between Megalagrion populations on different islands of Maui Nui has been shown to be greater than between populations on Maui Nui and populations on the island of Hawaiʻi (Jones & Jordan, 2015; S. Jordan et al., 2005). We chose to code the islands of Maui Nui as separate areas because populations on different islands are still often phylogenetically distinct and have significant genetic differences, and marine channels represent a strong barrier to gene flow. By coding these islands separately, we also aimed to understand patterns of dispersal between the islands of Maui Nui and Oʻahu at a higher resolution.
We performed a time-stratified analysis, where areas and ranges were not allowed if none of the islands from that area had yet emerged (Table S3). The earliest estimated emergence times of islands were used to define available areas (D. Clague, 1996). Thus, area K was always available since it included submerged islands and Kauaʻi, which still exists. Area O was only available after the emergence of Oʻahu (3.7 MA). Areas M, L, and X were only available after the emergence of the earliest land of Maui Nui (Penguin Bank, 2.2 MA). Area H was only available after the emergence of the first volcano of the island of Hawaiʻi (Kohala, 0.6 MA).
To estimate the distance between islands, we drew contours at sea level around each area using the 2023 GEBCO Grid and terra v1.7.71 in R (GEBCO Compilation Group, 2023; Hijmans, 2024). We used the “distance” function from the terra R package to calculate the minimum distance between areas. These distances were used to create a distance matrix for DEC models with distance-dependent dispersal (+x). Dispersal multipliers were not applied, and the null range was not included (Massana et al., 2015).
Twelve separate models were run in BioGeoBEARS. These 12 models included the DEC, DIVALIKE, and BAYAREALIKE base models, base models with jump dispersal (+J), base models with distance-dependent dispersal (+x), and base models with both jump dispersal and distance-dependent dispersal (+J+x). All 12 models were compared using the Akaike information criterion (AIC) (Table S3), and the model with the lowest AIC value was selected as our final result. We included the +J model in our analysis because jump dispersal leading to founder event speciation is expected to have occurred in Megalagrion as individuals colonize new islands.
Diversification and Extinction
RPANDAWe performed a diversification analysis using RPANDA v2.3 (Morlon et al., 2016) in R following a recent methodological development (Mazet et al., 2023). We used a maximum-likelihood tree with expanded taxon sampling, which included approximately half of the described species for the included genera. This tree was generated using a full-length alignment including both target and flanking regions. We used ape v5.7.1 in R to collapse monophyletic species with multiple samples. We dated the tree using TreePL v1.0 (Smith & O’Meara 2012) with penalized likelihood. We applied the same four fossil calibration constraints as in our Bayesian divergence time estimation (Table 2). We ran TreePL first with the “prime” option to identify optimal smoothing and rate parameters, then ran TreePL with these parameters and the “thorough” option to ensure convergence. TreePL provides a computationally efficient alternative to Bayesian methods like BEAST, making it well-suited for this dataset with extended taxon sampling. We listed all species of core Coenagrionidae according to the World Odonata List (Paulson et al., 2024). Sampling fractions were generated using RPANDA. A backbone tree was chosen to include 10 clades of interest across the tree, all containing at least 10 species (Table 4). Note that names for these clades are not meant to be valid taxonomic suggestions but are purely for convenience. All possible shift combinations were tested using five diversification models (BCST, BVAR, BCST_DCST, BVAR_DCST, BCST_DVAR). We did not test the BVAR_DVAR model, which allows both birth and death rates to vary. The BVAR_DVAR model has been criticized for being problematic and producing unrealistic estimates, and it is not currently recommended for use with RPANDA (Burin et al., 2019; Mazet et al., 2023). We tested backbone.option with “crown.shift” and “stem.shift” and set multi.backbone to “all”. Model selection used AICc values, with the combination yielding the lowest AICc across all models considered optimal. We ran the shift.estimates function with multi.backbone = “all” to explore parameter space. For the clades with estimated shifts, we calculated the difference between their diversification parameter values from the backbone (family-wide) diversification parameters to identify the magnitude and direction of rate changes in each shifted clade.
Agent-based simulation modelingDiversification analyses such as BAMM and RPANDA are useful tools for identifying potential patterns of speciation and extinction rates across clades and over time. However, neither of these analyses incorporates the dynamic geology of the Hawaiian Islands, which is an important driver of diversification patterns in Megalagrion, and has likely been a driver of diversification in Megalagrion for millions of years prior to the emergence of Kauaʻi. Although we have no extant populations from these sunken islands, we simulated an island system like the Hawaiian Islands and allowed a simulated lineage to evolve in this island system for tens of millions of years. Using this simulation model, we explored potential diversification patterns in a lineage like Megalagrion, and further tested our hypothesis that extinction played a significant role in the evolution of Megalagrion due to geologic change over time.
We created an agent-based simulation model in NetLogo v6.4.0 (Wilensky, 1999) following standard principles for designing agent-based models (ABMs) (Grimm & Railsback, 2012; Railsback & Grimm, 2012). The model is described following the Overview, Design concepts and Details (ODD) protocol (Supplementary methods) (Grimm, 2020; Grimm et al., 2006, 2010). The model is a generalized simulation of the evolution of a single lineage across an oceanic hotspot island system at a large time scale (10s of millions of years). The model can be divided into a geologic submodel and an evolutionary submodel. The geologic submodel models the creation, movement, and disappearance of islands in an oceanic-hotspot system. The simulated island system is modeled after patterns seen in the Hawaiian Islands, but is stochastic and generalized to simulate any oceanic hotspot system by adjusting geologic parameters. The evolutionary submodel models the evolution of a single lineage of organisms. This simulated lineage is modeled after Megalagrion, but is highly generalized and could be used to represent any lineage of organisms by adjusting biological and evolutionary parameters. Most biological and evolutionary parameters included in the model are not empirically known for Megalagrion, and/or do not correspond to easily measurable biological variables. Geologic, biological, and evolutionary parameters of the model are set manually and held constant throughout each run. However, variables such as island size, population size, number of species, extinction events, island age, endemism, the preferred niche of a species, genetic diversity, and others are not set, but are emergent from the output of the model.
To validate the geologic submodel, geologic parameters were set to values similar to those empirically found in the Hawaiian Islands. We ran 54 simulations of the model, each for 20 million years. For each run, we recorded the average length of the island chain, the average island size, the average island age, and the average number of islands throughout the simulation. The average and standard deviation of these data across all simulations were used to compare simulated island chains to the Hawaiian Islands.
At the end of each of these 54 simulations, we recorded the total number of species ever created and the number of extant species. The average and standard deviation of these data across all simulations were used to calculate the average speciation rate, extinction rate, and net diversification rate across all simulations.
Ancestral State Reconstruction
We performed stochastic character mapping on our maximum likelihood tree generated from a full alignment including both target and flanking regions using the make.simmap function of the R package phytools v2.1.1 (Revell, 2024) with the empirical Bayes method. Prior to ancestral state reconstruction, we selected the best evolutionary transition rate model by fitting three models to the data using the fitMk function (equal rates, symmetric rates, and all rates different) and evaluating model support based on AIC scores. We mapped breeding habitat and gill type separately. Characters were coded based on previous literature (Table 5) (S. Jordan et al., 2003; Polhemus, 1997; Polhemus & Asquith, 1996). Gills were coded as being foliate, lanceolate, or saccate. Breeding habitats were coded as lentic, lotic, seep, phytotelmata, or terrestrial. The breeding habitat and gill type of M. sp. (Jordan, et al. 2003) and M. mauka Daigle are not known. Therefore, two separate analyses were performed. For the first analysis, the character states of these species were given a flat prior distribution. The character states for these species were then estimated in the same way as character states were estimated for internal nodes. For the second analysis, these taxa were removed. We performed 10,000 simulations, then summarized all simulations by measuring the proportion of simulations that recovered each state at each node.
