Global integration of phylogenomic data and fine-scale partitioning strategies refine the evolutionary tree of Adephaga beetles (Insecta: Coleoptera)

Cardenas, Cody Raul 1 ; Gustafson, Grey2 ; Toussaint, Emmanuel1

Published Nov 24, 2025 on Dryad. https://doi.org/10.5061/dryad.w9ghx3fzf

Data files

Nov 24, 2025 version files 39.58 GB

GeaSub_2.9kv1.fasta

4.95 MB
README.md

16.07 KB
Supplemental_1.pdf

7.06 MB
Supplemental_2.pdf

1.33 MB
Supplemental_3.pdf

474.96 KB
Supplemental_4.pdf

82.10 KB
Supplemental_data2.zip

39.57 GB
Supplemental_Tables_dryadcorrections.xlsx

2.49 MB

Abstract

Over the past decade, genomic-scale data have revolutionized insect phylogenomics by allowing the generation of increasingly comprehensive genomic and taxonomic datasets. However, because different approaches have been used, it is often difficult to understand to what extent these data can be integrated to reconstruct evolutionary trees. In this study, we focus on the beetle suborder Adephaga to explore whether genomic data produced in the past decade can be combined to reconstruct the largest phylogenomic tree of this clade to date. To that end, we collected publicly available transcriptomes, genomes, and target sequence capture data of Adephaga beetles to generate a global dataset. Taking advantage of a newly developed bioinformatic pipeline, we demonstrate the overall compatibility of data types, especially of ultraconserved elements and exon-capture data. We also examined the impact of factors such as the treatment of off-target flanking genomic regions, data trimming regimes and partitioning, as well as varying levels of taxonomic and genomic sampling on phylogenomic inference. Using a matrix of 2,471 loci, we inferred the most comprehensive fossil-based evolutionary tree of Adephaga beetles. Our results confirm the independent colonization of aquatic ecosystems by two lineages. We also reconstruct Hygrobiidae as sister to Amphizoidae and a paraphyletic Aspidytidae, supporting the evolutionary convergence of prothoracic glands in both Hygrobiidae and Dytiscidae. Our results suggest an origin of Adephaga in the Carboniferous, with subsequent diversification of major lineages in the mid-Permian. Future efforts should focus on expanding the taxonomic sampling in Geadephaga, this clade of terrestrial beetles being the most diverse lineage in Adephaga and paradoxically one of the least sampled. To that end, we introduce a new ultraconserved element probe set tailored for Geadephaga beetles that will help generate compatible genomic data to further refine the Adephaga tree of life.

Dryad DOI: https://doi.org/10.5061/dryad.w9ghx3fzf

Supplemental_Tables_dryadcorrections.xlsx

Each tab in the xlsx sheet is described here:

SuppFile1_Taxa

Taxa accessions, Locus Recovery, and BOLD IDs

Table of all taxa used for analyses within this manuscript. Associated ID (Other Code), Accession codes (Code), references, species identity, locality, data source (type of data), locus recovery by probe source, assembly statistics (length, min, max, average length), and BOLD identification are available for each taxon results: queryID, best genus guess, best species guess, database search, percentage match, and lowest percentage match for bold ID. NA's in this sheet are data that are unavailable or not applicable.

SuppFile1_MLTreeStats

Four tables of maximum likelihood and information criterion values of trees generated in IQ-TREE2

Four tables presenting the results of phylogenomic tree inferences of the all and subset taxon sets split by partitioning scheme and locus coverage: A) locus partitioning and A) 30 % locus coverage and B) 40% locus coverage, followed by gene partitioning and C) 30% locus coverage and D) 40% locus coverage. Each subtable contains LL: Log-likelihood of the tree; unconstrained LL: Unconstrained log-likelihood (without tree); Free parameters: Number of free parameters (# branches + # model parameters); AIC: Akaike information criterion (AIC) score; AICc: Corrected Akaike information criterion (AICc) score; BIC: Bayesian information criterion (BIC) score; Consensus LL: Log-likelihood of consensus tree; RF dist: Robinson-Foulds distance between ML tree and consensus tree; lastly any sub header with (*) indicates that the consensus tree has higher likelihood than the ML tree found and the consensus tree should be used over ML tree.

SuppFile1_prelim_notrim

Preliminary aligned locus statistics, no trimming. Columns are standard output from AMAS summary.

All loci recovered from all taxa using the joined probeset without any trimming, with a number of summary statistics.

SuppFile1_prelim_trimalauto

Preliminary aligned locus statistics, trimAl auto trimming. Columns are standard output from AMAS summary.

All loci recovered from all taxa using the joined probeset with trimAl auto trimming, including a number of summary statistics.

SuppFile1_prelim_gblocks

Preliminary aligned locus statistics, Gblocks trimming. Columns are standard output from AMAS summary.

All loci recovered from AllTAxa using the joined probeset with Gblocks trimming set to default phyluce parameters, including a number of summary statistics if available.

SuppFile1_AllTaxa-core

AllTaxa core data Joined probest loci statistics

Sequence count and length by locus before and after trimming with Gblocks and trimAl for the core curation dataset. Columns are the locus ID, the number of sequences, and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.

Any NA means the locus was not present or recovered in that probe set.

SuppFile1_AllTaxa-core+flanking

AllTaxa core+flanking data Joined probest loci statistics

Sequence count and length by locus before and after trimming with Gblocks and trimAl for the core+flanking curation dataset. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.

Any NA means the locus was not present or recovered in that probe set.

SuppFile1_AllTaxa-flanking

AllTaxa flanking data Joined probest loci statistics

Sequence count and length by locus before and after trimming with Gblocks and trimAl for the flanking curation dataset. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.

Any NA means the locus was not present or recovered in that probe set.

SuppFile1_curation_notrim

Two tables of alignment sequence counts and length pre-trimming

A table of the alignment statistics before and after trimming for sequence curation methods and by probe origin in the joined probe set. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.

Any NA means the locus was not present or recovered in that probe set.

SuppFile1_curation_stats

Four tables of alignment sequence count and length differences post-trimming

A table of the alignment differences after trimming for sequence curation methods and probe origin in the joined probe set after trimming. The difference is expressed as the number of sequences for the respective trimming methods data subtracted from the number of sequences from the untrimmed data.

SuppFile1_gCFsCF

Gene and Site concordance factor analysis results as output by IQTree concordance analysis.

The resulting output of gene and site concordance factors with IQ-TREE2 using the 30% trimAL gene SubTaxa dataset. IQTree2 concordance factor column information (from IQTree2):

ID: Branch ID, gCF: Gene concordance factor (=gCF_N/gN %), gCF_N: Number of trees concordant with the branch, gDF1: Gene discordance factor for NNI-1 branch (=gDF1_N/gN %), gDF1_N: Number of trees concordant with NNI-1 branch, gDF2: Gene discordance factor for NNI-2 branch (=gDF2_N/gN %), gDF2_N: Number of trees concordant with NNI-2 branch, gDFP: Gene discordance factor due to polyphyly (=gDFP_N/gN %), gDFP_N: Number of trees decisive but discordant due to polyphyly, gN: Number of trees decisive for the branch, sCF: Site concordance factor averaged over 1000 quartets (=sCF_N/sN %), sCF_N: sCF in absolute number of sites, sDF1: Site discordance factor for alternative quartet 1 (=sDF1_N/sN %), sDF1_N: sDF1 in absolute number of sites, sDF2: Site discordance factor for alternative quartet 2 (=sDF2_N/sN %), sDF2_N: sDF2 in absolute number of sites, sN: Number of informative sites averaged over 1000 quartets, Label: Existing branch label, and Length: Branch length.

SuppFile1_Fossils

Table of expanded fossil taxa.

An expanded fossil table including the fossil name, number, suborder, placement, minimum age, family, subfamily, fossil code, deposit, reference, and exponential and lognormal prior settings for BEAUti and BEAST analyses.

Column information: fossil: fossil name, Number: fossil ID on tree, Suborder: suborder, Placement Min. Age: minimum age (My), Family: family, Subfamily: subfamily, Fossil Code: if available, Fossil code/ID; deposit: geological deposition fossil was identified, References: fossil journal reference, BEAUti Exponential Prior Settings: exponential prior settings, and BEAUti Lognormal Prior Settings: log normal prior settings. N/A= not applicable.

SuppFile1_ExpTime

Exponential divergence time analysis time estimates

Estimated dates of the fossils used to calibrate the Bayesian timetree in the exponential analysis. Columns include: lognormal_taxon: fossils, tree number: node in tree, mean (Ma), stderr of mean: standard error of mean, stdev: standard deviation, variance, median (Ma), value range: rante of divergence estimates, geometric mean 95%, HPD interval: highest posterior density interval, auto-correlation time (ACT), effective sample size (ESS), number of samples.

SuppFile1_LogNormTime

Log normal divergence time analysis time estimates

Estimated dates of the fossils used to calibrate the Bayesian timetree in the log-normal analysis. Columns include: lognormal_taxon: fossils, tree number: node in tree, mean (Ma), stderr of mean: standard error of mean, stdev: standard deviation, variance, median (Ma), value range: rante of divergence estimates, geometric mean 95%, HPD interval: highest posterior density interval, auto-correlation time (ACT), effective sample size (ESS), number of samples.

SuppFile1_cogenic

Cogenic loci found in the Pterostichus madidus genome.

Counts for annotated genes in the P. madidus genome where more than one probe from the joined probe set map. Includes the partition ID and count of cogenic loci.

SuppFile1_LMM

Table of linear mixed effects model of the conservedness of loci

The resulting summaries of the most likely linear mixed effect model (LMM) fit by maximum likelihood are presented for the (A) informative sites and (B) invariant sites. Estimated variance of the LMM for the (C) informative and (D) invariant sites. Significance testing of the LMM for the (E) informative and (F) invariant sites.

Supplemental_pdf's

Supplemental_1.pdf

supplemental analyses, figures, and tables.

Supplemental_2.pdf

Phylogenetic trees from the all data set analyses. Trees are named in the top right with the data set, partitioning scheme, curation method, and locus occupancy; e.g., all_gene_core_trimal_30p.

Supplemental_3.pdf

Phylogenetic trees from the subset data set analyses. Trees are named in the top right with the data set, partitioning scheme, curation method, and locus occupancy; e.g., subset_gene_core_trimal_30p.

Supplemental_4.pdf

Phylogenetic analyses of the subset data set, analyses for the wASTRAL, and the gene and site concordance factor analyses of the subset. Trees are named in the top right with the data set with the preferred tree and analysis; e.g., subset-gene-core-trimal-30p_wASTRAL.

Supplemental_data2.zip

In this zip folder, in the directory _data, there is one file and several subdirectories described here.

The joined probe set used in all analyses is found here

Note
In an early draft of this manuscript, ExC data were referred to as AHE. In the majority of these files, AHE was changed from AHE to ExC. However, in some cases it was not reasonable to convert AHE to ExC; e.g. the “BOLD_characterization” and “Matrices” folders.

BEAST

Within this directory, there are:

Combined log files
combined time and treeannattator tree files
XML files generated by BEAUTi for the exponential and lognormal analyses.
chrono tree used in the analyses

BOLD_Taxonomy_ID

Within this directory, there are:

Multi-fasta of COI genes
BOLD trees of select taxa
BOLD Search report as a PDF & XLS
BLAST search results of select taxa
MitoFinder logs for all samples
GenBank mitochondrial genome references for MitoFinder search

As a note, the XLS files generated by BOLD are output from the BOLD database search and contain empty cells.

characterization

Within this directory are:

1genefeatures

output from processing of the genomic fasta files to generate introns and intergenic gene features

2mapping

output of the mapping of probes to the genome

3intersect

output of the identification of intersecting robes

4probe-reduction

output of the identification and subset of probes that do intersect the preferred probe set (Adephaga, UCE-Ade)

5join-probes

output of the probes that remain and are used to create the joined probe set

6group-by

output of remapping the probes to the preferred genome and creation of the modified partition file based on probes that are found to be on the same gene

Characterization_analysis

R script to process and describe the joined and Adephaga probe sets

cogenic-loci

R script to describe the number of cogenic loci found (from the 6-group-by output)

mapping_analysis

R script to describe the number of loci recovered by probe set for all taxa

Matrices

General directory structure:

.
├── all-gene
│ └── core+flanking
│ └── gblocks_30p
├── subset-loci
│ └── core+flanking
│ └── gblocks_30p

All data here are loci recovered using the joined probe set designed for in-silico capture/recovery of sequence data with Phyluce for alignment and trimming.

Sub-directories for IQ-TREE2 analyses for all and subset taxon data sets, for both the loci and gene partitioning. Each contains the CORE, CORE+FLANKING, FLANKING, and 30% & 40% taxa per locus. All taxon data sets have both trimal and gblocks data. The subset taxon set has only trimal.

For example:

all-gene
├── core
│ ├── gblocks_30p
│ ├── gblocks_40p
│ ├── trimal_30p
│ └── trimal_40p
├── core+flanking
│ ├── gblocks_30p
│ ├── gblocks_40p
│ ├── trimal_30p
│ └── trimal_40p
└── flanking
 ├── gblocks_30p
 ├── gblocks_40p
 ├── trimal_30p
 └── trimal_40p

Within each taxon set, "all" directories contain output files from IQTree2

the associated nexus matrix and partition files used as input for partition analyses
-the resulting partition analyses partitions
IQ-TREE2 log files
IQ-TREE2 iqtree file
IQ-TREE2 newick tree files

Taxon subset data has a slightly different data prefix due to additional analyses that are described in the next section.

subset data

Within the Subset taxon data sets in the same directory containing the IQ-TREE2 files are the associated wASTRAL and concordance analyses files (subset-gene-core-trimal_30p).

concatenated locus/gene IQ-TREE2 log file (gene-tree.*)
concatenated locus/gene iqtree file (gene-tree.iqtree)
concatenated locus/gene newick tree files generated by IQ-TREE2 (gene-tree.treefile)
concordance analysis log file (concordance_*)
concordance analysis tree file
wASTRAL tree file (wASTRAL_*)

Because there is a slightly different workflow for concordance and wASTRAL analyses, there is a different prefix. Loci with similar partitioning models are not merged to create locus/gene trees.
Files prefixed with nomerege_* indicate that during the search for the optimal partition, the loci were not merged after.
Files prefixed with UFB_* indicate that the partition search was performed with IQ0-TREE2's MF+MERGE option.

SortaDate

Contains subdirectories

IQ-TREE2

output of IQTREE2 for the model finder of the partition finder2 output

partitionfinder2

output of the partitioning scheme of partition finder

SortaDate

output of the sort date analysis

These are the files and outputs used for the SortaDate analysis.

core_parti-gene_30p.rr.tree

The tree file used in the sortadate analysis

`core_SORTADATE.*`

Files were generated using AMAS after identifying the best partitioning scheme from PartitionFinder2 and IQ-TREE2

GeaSub_2.9kv1.fasta

The new ultra-conserved element probe set designed for Geadephaga

Global integration of phylogenomic data and fine-scale partitioning strategies refine the evolutionary tree of Adephaga beetles (Insecta: Coleoptera)

Data files

Abstract

README: Global integration of phylogenomic data and fine-scale partitioning strategies refine the evolutionary tree of Adephaga beetles (Insecta: Coleoptera)

Supplemental_Tables_dryadcorrections.xlsx

SuppFile1_Taxa

SuppFile1_MLTreeStats

SuppFile1_prelim_notrim

SuppFile1_prelim_trimalauto

SuppFile1_prelim_gblocks

SuppFile1_AllTaxa-core

SuppFile1_AllTaxa-core+flanking

SuppFile1_AllTaxa-flanking

SuppFile1_curation_notrim

SuppFile1_curation_stats

SuppFile1_gCFsCF

SuppFile1_Fossils

SuppFile1_ExpTime

SuppFile1_LogNormTime

SuppFile1_cogenic

SuppFile1_LMM

Supplemental_pdf's

Supplemental_1.pdf

Supplemental_2.pdf

Supplemental_3.pdf

Supplemental_4.pdf

Supplemental_data2.zip

BEAST

BOLD_Taxonomy_ID

characterization

1genefeatures

2mapping

3intersect

4probe-reduction

5join-probes

6group-by

Characterization_analysis

cogenic-loci

mapping_analysis

Matrices

subset data

SortaDate

IQ-TREE2

partitionfinder2

SortaDate

These are the files and outputs used for the SortaDate analysis.

core_SORTADATE.*

GeaSub_2.9kv1.fasta

Methods

`core_SORTADATE.*`