Global integration of phylogenomic data and fine-scale partitioning strategies refine the evolutionary tree of Adephaga beetles (Insecta: Coleoptera)
Data files
Nov 24, 2025 version files 39.58 GB
-
GeaSub_2.9kv1.fasta
4.95 MB
-
README.md
16.07 KB
-
Supplemental_1.pdf
7.06 MB
-
Supplemental_2.pdf
1.33 MB
-
Supplemental_3.pdf
474.96 KB
-
Supplemental_4.pdf
82.10 KB
-
Supplemental_data2.zip
39.57 GB
-
Supplemental_Tables_dryadcorrections.xlsx
2.49 MB
Abstract
Over the past decade, genomic-scale data have revolutionized insect phylogenomics by allowing the generation of increasingly comprehensive genomic and taxonomic datasets. However, because different approaches have been used, it is often difficult to understand to what extent these data can be integrated to reconstruct evolutionary trees. In this study, we focus on the beetle suborder Adephaga to explore whether genomic data produced in the past decade can be combined to reconstruct the largest phylogenomic tree of this clade to date. To that end, we collected publicly available transcriptomes, genomes, and target sequence capture data of Adephaga beetles to generate a global dataset. Taking advantage of a newly developed bioinformatic pipeline, we demonstrate the overall compatibility of data types, especially of ultraconserved elements and exon-capture data. We also examined the impact of factors such as the treatment of off-target flanking genomic regions, data trimming regimes and partitioning, as well as varying levels of taxonomic and genomic sampling on phylogenomic inference. Using a matrix of 2,471 loci, we inferred the most comprehensive fossil-based evolutionary tree of Adephaga beetles. Our results confirm the independent colonization of aquatic ecosystems by two lineages. We also reconstruct Hygrobiidae as sister to Amphizoidae and a paraphyletic Aspidytidae, supporting the evolutionary convergence of prothoracic glands in both Hygrobiidae and Dytiscidae. Our results suggest an origin of Adephaga in the Carboniferous, with subsequent diversification of major lineages in the mid-Permian. Future efforts should focus on expanding the taxonomic sampling in Geadephaga, this clade of terrestrial beetles being the most diverse lineage in Adephaga and paradoxically one of the least sampled. To that end, we introduce a new ultraconserved element probe set tailored for Geadephaga beetles that will help generate compatible genomic data to further refine the Adephaga tree of life.
Dryad DOI: https://doi.org/10.5061/dryad.w9ghx3fzf
Supplemental_Tables_dryadcorrections.xlsx
Each tab in the xlsx sheet is described here:
SuppFile1_Taxa
Taxa accessions, Locus Recovery, and BOLD IDs
Table of all taxa used for analyses within this manuscript. Associated ID (Other Code), Accession codes (Code), references, species identity, locality, data source (type of data), locus recovery by probe source, assembly statistics (length, min, max, average length), and BOLD identification are available for each taxon results: queryID, best genus guess, best species guess, database search, percentage match, and lowest percentage match for bold ID. NA's in this sheet are data that are unavailable or not applicable.
SuppFile1_MLTreeStats
Four tables of maximum likelihood and information criterion values of trees generated in IQ-TREE2
Four tables presenting the results of phylogenomic tree inferences of the all and subset taxon sets split by partitioning scheme and locus coverage: A) locus partitioning and A) 30 % locus coverage and B) 40% locus coverage, followed by gene partitioning and C) 30% locus coverage and D) 40% locus coverage. Each subtable contains LL: Log-likelihood of the tree; unconstrained LL: Unconstrained log-likelihood (without tree); Free parameters: Number of free parameters (# branches + # model parameters); AIC: Akaike information criterion (AIC) score; AICc: Corrected Akaike information criterion (AICc) score; BIC: Bayesian information criterion (BIC) score; Consensus LL: Log-likelihood of consensus tree; RF dist: Robinson-Foulds distance between ML tree and consensus tree; lastly any sub header with (*) indicates that the consensus tree has higher likelihood than the ML tree found and the consensus tree should be used over ML tree.
SuppFile1_prelim_notrim
Preliminary aligned locus statistics, no trimming. Columns are standard output from AMAS summary.
All loci recovered from all taxa using the joined probeset without any trimming, with a number of summary statistics.
SuppFile1_prelim_trimalauto
Preliminary aligned locus statistics, trimAl auto trimming. Columns are standard output from AMAS summary.
All loci recovered from all taxa using the joined probeset with trimAl auto trimming, including a number of summary statistics.
SuppFile1_prelim_gblocks
Preliminary aligned locus statistics, Gblocks trimming. Columns are standard output from AMAS summary.
All loci recovered from AllTAxa using the joined probeset with Gblocks trimming set to default phyluce parameters, including a number of summary statistics if available.
SuppFile1_AllTaxa-core
AllTaxa core data Joined probest loci statistics
Sequence count and length by locus before and after trimming with Gblocks and trimAl for the core curation dataset. Columns are the locus ID, the number of sequences, and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.
Any NA means the locus was not present or recovered in that probe set.
SuppFile1_AllTaxa-core+flanking
AllTaxa core+flanking data Joined probest loci statistics
Sequence count and length by locus before and after trimming with Gblocks and trimAl for the core+flanking curation dataset. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.
Any NA means the locus was not present or recovered in that probe set.
SuppFile1_AllTaxa-flanking
AllTaxa flanking data Joined probest loci statistics
Sequence count and length by locus before and after trimming with Gblocks and trimAl for the flanking curation dataset. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.
Any NA means the locus was not present or recovered in that probe set.
SuppFile1_curation_notrim
Two tables of alignment sequence counts and length pre-trimming
A table of the alignment statistics before and after trimming for sequence curation methods and by probe origin in the joined probe set. Columns indicate the locus ID, the number of sequences and length for the notrim, the number of sequences and length for the gblocks, the number of sequences and length for the trimal, the difference in sequences from the untrimmed and gblocks, the difference in sequence length of the untrimmed and gblocks, the difference in sequences from the untrimmed and trimal, the difference in sequence length of the untrimmed and trimal, and the probe source.
Any NA means the locus was not present or recovered in that probe set.
SuppFile1_curation_stats
Four tables of alignment sequence count and length differences post-trimming
A table of the alignment differences after trimming for sequence curation methods and probe origin in the joined probe set after trimming. The difference is expressed as the number of sequences for the respective trimming methods data subtracted from the number of sequences from the untrimmed data.
SuppFile1_gCFsCF
Gene and Site concordance factor analysis results as output by IQTree concordance analysis.
The resulting output of gene and site concordance factors with IQ-TREE2 using the 30% trimAL gene SubTaxa dataset. IQTree2 concordance factor column information (from IQTree2):
ID: Branch ID, gCF: Gene concordance factor (=gCF_N/gN %), gCF_N: Number of trees concordant with the branch, gDF1: Gene discordance factor for NNI-1 branch (=gDF1_N/gN %), gDF1_N: Number of trees concordant with NNI-1 branch, gDF2: Gene discordance factor for NNI-2 branch (=gDF2_N/gN %), gDF2_N: Number of trees concordant with NNI-2 branch, gDFP: Gene discordance factor due to polyphyly (=gDFP_N/gN %), gDFP_N: Number of trees decisive but discordant due to polyphyly, gN: Number of trees decisive for the branch, sCF: Site concordance factor averaged over 1000 quartets (=sCF_N/sN %), sCF_N: sCF in absolute number of sites, sDF1: Site discordance factor for alternative quartet 1 (=sDF1_N/sN %), sDF1_N: sDF1 in absolute number of sites, sDF2: Site discordance factor for alternative quartet 2 (=sDF2_N/sN %), sDF2_N: sDF2 in absolute number of sites, sN: Number of informative sites averaged over 1000 quartets, Label: Existing branch label, and Length: Branch length.
SuppFile1_Fossils
Table of expanded fossil taxa.
An expanded fossil table including the fossil name, number, suborder, placement, minimum age, family, subfamily, fossil code, deposit, reference, and exponential and lognormal prior settings for BEAUti and BEAST analyses.
Column information: fossil: fossil name, Number: fossil ID on tree, Suborder: suborder, Placement Min. Age: minimum age (My), Family: family, Subfamily: subfamily, Fossil Code: if available, Fossil code/ID; deposit: geological deposition fossil was identified, References: fossil journal reference, BEAUti Exponential Prior Settings: exponential prior settings, and BEAUti Lognormal Prior Settings: log normal prior settings. N/A= not applicable.
SuppFile1_ExpTime
Exponential divergence time analysis time estimates
Estimated dates of the fossils used to calibrate the Bayesian timetree in the exponential analysis. Columns include: lognormal_taxon: fossils, tree number: node in tree, mean (Ma), stderr of mean: standard error of mean, stdev: standard deviation, variance, median (Ma), value range: rante of divergence estimates, geometric mean 95%, HPD interval: highest posterior density interval, auto-correlation time (ACT), effective sample size (ESS), number of samples.
SuppFile1_LogNormTime
Log normal divergence time analysis time estimates
Estimated dates of the fossils used to calibrate the Bayesian timetree in the log-normal analysis. Columns include: lognormal_taxon: fossils, tree number: node in tree, mean (Ma), stderr of mean: standard error of mean, stdev: standard deviation, variance, median (Ma), value range: rante of divergence estimates, geometric mean 95%, HPD interval: highest posterior density interval, auto-correlation time (ACT), effective sample size (ESS), number of samples.
SuppFile1_cogenic
Cogenic loci found in the Pterostichus madidus genome.
Counts for annotated genes in the P. madidus genome where more than one probe from the joined probe set map. Includes the partition ID and count of cogenic loci.
SuppFile1_LMM
Table of linear mixed effects model of the conservedness of loci
The resulting summaries of the most likely linear mixed effect model (LMM) fit by maximum likelihood are presented for the (A) informative sites and (B) invariant sites. Estimated variance of the LMM for the (C) informative and (D) invariant sites. Significance testing of the LMM for the (E) informative and (F) invariant sites.
Supplemental_pdf's
Supplemental_1.pdf
supplemental analyses, figures, and tables.
Supplemental_2.pdf
Phylogenetic trees from the all data set analyses. Trees are named in the top right with the data set, partitioning scheme, curation method, and locus occupancy; e.g., all_gene_core_trimal_30p.
Supplemental_3.pdf
Phylogenetic trees from the subset data set analyses. Trees are named in the top right with the data set, partitioning scheme, curation method, and locus occupancy; e.g., subset_gene_core_trimal_30p.
Supplemental_4.pdf
Phylogenetic analyses of the subset data set, analyses for the wASTRAL, and the gene and site concordance factor analyses of the subset. Trees are named in the top right with the data set with the preferred tree and analysis; e.g., subset-gene-core-trimal-30p_wASTRAL.
Supplemental_data2.zip
In this zip folder, in the directory _data, there is one file and several subdirectories described here.
The joined probe set used in all analyses is found here
- Note
In an early draft of this manuscript, ExC data were referred to as AHE. In the majority of these files, AHE was changed from AHE to ExC. However, in some cases it was not reasonable to convert AHE to ExC; e.g. the “BOLD_characterization” and “Matrices” folders.
BEAST
Within this directory, there are:
- Combined log files
- combined time and treeannattator tree files
- XML files generated by BEAUTi for the exponential and lognormal analyses.
- chrono tree used in the analyses
BOLD_Taxonomy_ID
Within this directory, there are:
- Multi-fasta of COI genes
- BOLD trees of select taxa
- BOLD Search report as a PDF & XLS
- BLAST search results of select taxa
- MitoFinder logs for all samples
- GenBank mitochondrial genome references for MitoFinder search
As a note, the XLS files generated by BOLD are output from the BOLD database search and contain empty cells.
characterization
Within this directory are:
1genefeatures
output from processing of the genomic fasta files to generate introns and intergenic gene features
2mapping
output of the mapping of probes to the genome
3intersect
output of the identification of intersecting robes
4probe-reduction
output of the identification and subset of probes that do intersect the preferred probe set (Adephaga, UCE-Ade)
5join-probes
output of the probes that remain and are used to create the joined probe set
6group-by
output of remapping the probes to the preferred genome and creation of the modified partition file based on probes that are found to be on the same gene
Characterization_analysis
R script to process and describe the joined and Adephaga probe sets
cogenic-loci
R script to describe the number of cogenic loci found (from the 6-group-by output)
mapping_analysis
R script to describe the number of loci recovered by probe set for all taxa
Matrices
General directory structure:
.
├── all-gene
│ └── core+flanking
│ └── gblocks_30p
├── subset-loci
│ └── core+flanking
│ └── gblocks_30p
All data here are loci recovered using the joined probe set designed for in-silico capture/recovery of sequence data with Phyluce for alignment and trimming.
Sub-directories for IQ-TREE2 analyses for all and subset taxon data sets, for both the loci and gene partitioning. Each contains the CORE, CORE+FLANKING, FLANKING, and 30% & 40% taxa per locus. All taxon data sets have both trimal and gblocks data. The subset taxon set has only trimal.
For example:
all-gene
├── core
│ ├── gblocks_30p
│ ├── gblocks_40p
│ ├── trimal_30p
│ └── trimal_40p
├── core+flanking
│ ├── gblocks_30p
│ ├── gblocks_40p
│ ├── trimal_30p
│ └── trimal_40p
└── flanking
├── gblocks_30p
├── gblocks_40p
├── trimal_30p
└── trimal_40p
Within each taxon set, "all" directories contain output files from IQTree2
- the associated nexus matrix and partition files used as input for partition analyses
-the resulting partition analyses partitions - IQ-TREE2 log files
- IQ-TREE2 iqtree file
- IQ-TREE2 newick tree files
Taxon subset data has a slightly different data prefix due to additional analyses that are described in the next section.
subset data
Within the Subset taxon data sets in the same directory containing the IQ-TREE2 files are the associated wASTRAL and concordance analyses files (subset-gene-core-trimal_30p).
- concatenated locus/gene IQ-TREE2 log file (
gene-tree.*) - concatenated locus/gene iqtree file (
gene-tree.iqtree) - concatenated locus/gene newick tree files generated by IQ-TREE2 (
gene-tree.treefile) - concordance analysis log file (
concordance_*) - concordance analysis tree file
- wASTRAL tree file (
wASTRAL_*)
Because there is a slightly different workflow for concordance and wASTRAL analyses, there is a different prefix. Loci with similar partitioning models are not merged to create locus/gene trees.
Files prefixed with nomerege_* indicate that during the search for the optimal partition, the loci were not merged after.
Files prefixed with UFB_* indicate that the partition search was performed with IQ0-TREE2's MF+MERGE option.
SortaDate
Contains subdirectories
IQ-TREE2
output of IQTREE2 for the model finder of the partition finder2 output
partitionfinder2
output of the partitioning scheme of partition finder
SortaDate
output of the sort date analysis
These are the files and outputs used for the SortaDate analysis.
core_parti-gene_30p.rr.tree
The tree file used in the sortadate analysis
core_SORTADATE.*
Files were generated using AMAS after identifying the best partitioning scheme from PartitionFinder2 and IQ-TREE2
GeaSub_2.9kv1.fasta
The new ultra-conserved element probe set designed for Geadephaga
We combined several Adephaga genomic-level datasets available from the National Center for Biotechnology Information (NCBI) database https://www.ncbi.nlm.nih.gov/, DNA Read Archive of the DNA Database of Japan (DDBJ) https://www.ddbj.nig.ac.jp/, and the Dryad Digital Repository: https://datadryad.org. Sequence data is composed of anchored hybrid enrichment data (ExC), genomic assemblies (GEN), transcriptomes (TRA), and ultraconserved elements (UCE). In an early draft of this manuscript, ExC data was referred to as AHE; in the majority of these files, AHE was changed from AHE to ExC. However, in some cases, it was not reasonable to convert AHE to ExC; e.g., the “BOLD_characterization” and “Matrices” folders.
Here we provide the output and scripts used to generate the characterization of loci, locus partitioning, and the resulting joined probeset from these analyses. We also provide the Gaedephaga UCE fasta file. Additional data analysis ouptut of IQTree2 for phylogenetic inferences, PDF's of the trees generated by IQTree2, the output of the SortaDate "gene shoping" approach, the BEAST analyses, along with the BEAST output files, the conserved locus analysis, MitoFinder's logs, and the Barcode of Life Database (BOLD) input COI bycatch and resulting BOLD output are found within.
