Data from: Phylogenomics, biogeography, and description of a new subfamily and genus of African characiform fishes (Teleostei: Alestidae)
Data files
Feb 25, 2026 version files 299.35 MB
-
alestidae-100.xml
1.75 MB
-
alestidae-MCC.tre
52.53 KB
-
alestidae-mtdna.phy
481.71 KB
-
clavocharax-distribution.kml
8.10 KB
-
clavocharax-RASP.txt
3.68 KB
-
mafft-nexus-edge-trimmed-clean-100p-BEAST.nex
1.71 MB
-
mafft-nexus-edge-trimmed-clean-65p.phylip
83.29 MB
-
mafft-nexus-edge-trimmed-clean-75p.nexus
77.72 MB
-
mafft-nexus-edge-trimmed-clean-75p.phylip
77.61 MB
-
mafft-nexus-edge-trimmed-clean-85p.phylip
56.37 MB
-
partition_finder.cfg
174.02 KB
-
partition.nex
173.86 KB
-
README.md
2.70 KB
Abstract
The Congo River, with the highest diversity of riverine fishes in Africa, only recently established its contemporary outlet into the Atlantic around the Miocene-Pliocene transition (~5 millions of years ago; Ma). Yet, its role in shaping ichthyofaunal diversification across central Africa through interactions with adjacent Atlantic coastal rivers remains unexplored at both regional and local scales. The African characiform family Alestidae, with lineages distributed across the entire region, offers an ideal system to investigate inland-coastal biogeographic connections. However, phylogenetic relationships within Alestidae remain unresolved, particularly with respect to two key genera, Brachypetersius and Nannopetersius, which inhabit both regions of interest. Applying likelihood and species-tree inferences using 1,759 nuclear ultraconserved elements (UCEs) and 13 protein-coding genes of mitochondrial genomes from 42 alestid taxa, we resolve both Brachypetersius and Nannopetersius as polyphyletic and identify a distinct clade warranting recognition as a new genus: Clavocharax. External morphological and osteological data from museum specimens corroborate this finding and support the revalidation of Clupeocharacinae as an inclusive subfamily, encompassing the new genus and seven other West and Central African genera, marking the first phylogenetically supported subfamily within Alestidae. Divergence time estimates suggest that Clavocharax originated in the Early Miocene (23.2–15.0 Ma), coinciding with climatic shifts and potential river capture events across the region of the Congo River outflow and Lower Guinean coastal systems. Ancestral range estimation implicates Miocene climatic and geological events, including the formation of Congo's current Atlantic outlet, in driving repeated geodispersal and diversification across inland and coastal drainages. This study highlights the influence of historical hydrological connectivity on African freshwater fish diversity and resolves previous gaps in our understanding of regional ichthyofaunal evolution and biogeography.
Description of the data and file structure
Data was collected from museum specimens. Ultraconserved elements from 40 taxa were sequenced using Illumina. Mitogenomes were extracted from 37 of these, with 5 additional taxa sequenced specifically for mitogenomes. Data were processed with PHYLUCE, Geneious, MitoFish, MitoAnnotator, and downstream analyses used IQ-TREE, SWSC-EN, PartitionFinder2, RAxML, ASTRAL, BEAST, Tracer, LogCombiner, TreeAnnotator. Geocoordinates of Clavocharax were mapped in QGIS to delimit bioregions for ancestral range estimation in RASP using BioGeoBEARS model testing.
Files and variables
File: alestidae-100.xml
Description: Data file used for BEAST time-calibrated analysis.
File: alestidae-MCC.tre
Description: Maximum Clade Credibility timetree of Alestidae
File: alestidae-mtdna.phy
Description: 13 protein-coding genes data matrix of Alestidae
File: clavocharax-distribution.kml
Description: Geocoordinates of Clavocharax museum records.
File: clavocharax-RASP.txt
Description: RASP results file comparing BioGeoBEARS models for Clavocharax ancestral range estimation.
File: mafft-nexus-edge-trimmed-clean-100p-BEAST.nex
Description: 100 % complete ultraconserved elements data matrix of Alestidae used for BEAST analysis.
File: mafft-nexus-edge-trimmed-clean-65p.phylip
Description: 65 % complete ultraconserved elements data matrix of Alestidae (n = 40).
File: mafft-nexus-edge-trimmed-clean-75p.nexus
Description: Concatenated nexus alignment of 75 % complete UCE matrix of Alestidae, with UCE markers and charsets for SWSC-EN input.
File: mafft-nexus-edge-trimmed-clean-75p.phylip
Description: Revised 75 % complete ultraconserved elements data matrix of Alestidae (n = 38).
File: mafft-nexus-edge-trimmed-clean-85p.phylip
Description: 85 % complete ultraconserved elements data matrix of Alestidae (n = 40).
File: partition.nex
Description: Partition file for ML analysis of 75 % complete UCE matrix of Alestidae in IQ-TREE 2, based on PartitionFinder2 output.
File: partition_finder.cfg
Description: Configuration file used for partitioning scheme optimization in PartitionFinder2, based on SWSC-EN output.
Code/software
PHYLUCE v.1.7.3
IQ-TREE 2
RAxML v.8.2.11
SWSC-EN
PartitionFinder2
ASTRAL v.5.7.8
Geneious v.6.0.3
MitoFish v.2025.06
MitoAnnotator
BEAST v.2.7.7
Tracer v.1.7.2
LogCombiner v.2.7.7
TreeAnnotator v.1.8.2
QGIS v.3.32.3-Lima
RASP v.4.4
Ultraconserved elements and phylogenomic analyses
For UCE sequencing of 40 taxa of Alestidae, total genomic DNA (gDNA) was extracted using the DNeasy tissue kit. Libraries were quantified and enriched with the myBaits Ostariophysan 2.7Kv1 probeset (overnight hybridization, 65 °C washes) and quantified with a spectrofluorimetric assay. Sequencing was performed on the Illumina NovaSeq 6000 platform (partial S4 PE150 lane), yielding ~14 Gbp in total.
We used PHYLUCE v.1.7.3 to prepare sequences for all UCE-based downstream analyses. We removed adapters and low-quality bases using Illumiprocessor v.2.10 and Trimmomatic v.0.39, assembled contigs with SPAdes v.3.14.1 and Velvet v.1.2.10, and identified orthologous UCE loci using the myBaits Ostariophysan 2.7Kv1 probeset and excluded putative paralog regions in PHYLUCE. We extracted and aligned loci using an edge-trimming approach in MAFFT v.7.475 and constructed three matrices: a 65 % complete matrix (i.e., loci present in ≥ 26 terminals), a 75 % complete matrix (≥3 0 terminals), and an 85 % complete matrix (≥ 34 terminals).
Maximum likelihood (ML) phylogenetic inferences were performed using IQ-TREE 2 using the TVM+F+R7 model—as selected by ModelFinder—with 1,000 ultrafast bootstrap replicates. We excluded Brachypetersius altus (AMNH 274763, AMCC 258398) and 'Nannopetersius' mutambuei (AMNH 246602, AMCC 264226) from the 75 % matrix due to low coverage and long branches. We used this updated 75 % matrix (i.e., loci present in ≥ 28 terminals) for all downstream analyses.
We applied the Sliding-Window Site Characteristics based on Entropy method (SWSC-EN) to the 75 % matrix and used PartitionFinder2 with rclusterf in RAxML v.8 (AICc, GTR+G, rcluster-max = 1000, rcluster-percent = 10) to generate the best partitioning scheme. We implemented the resulting partitioning scheme and the 75 % matrix in IQ-TREE 2 for ML inference, using ModelFinder (BIC) to select the best-fit models for each partition and merge similar subsets iteratively (MFP+MERGE), with 1,000 ultrafast bootstrap replicates.
For species tree reconstruction, ML gene trees for each UCE locus were estimated with RAxML v.8.2.11 (GTRGAMMA model), and a consensus species tree was inferred using ASTRAL v.5.7.8.
Mitochondrial genomic phylogenetic analysis
The mitogenomic dataset comprised 42 terminal taxa: mitogenomes were extracted from 37 UCE-sequenced samples, and 5 were sequenced specifically for mitogenome recovery as follows. The gDNA was extracted using the DNeasy kit, quality-checked, and quantified via spectrofluorimetry. Up to 80 % of gDNA was bead-purified for dual-indexed, Illumina-compatible library prep (~300 bp inserts). Libraries were quantified with spectrofluorimetry, visualized on an Agilent TapeStation 4200, pooled equimolarly, and sequenced on an Illumina NovaSeq 6000 (partial S4 PE150 lane), yielding ~1 Gbp in total.
We assembled, annotated, and aligned the mitogenomes for downstream phylogenetic analyses. Mitogenomes were assembled in Geneious v.6.0.3 by mapping raw reads to the Phenacogrammus interruptus reference (GenBank: AB054129) using medium-sensitivity parameters with up to five iterations. Exceptions were Rhabdalestes tangensis (CUMV 93797, T056-5587) and Tricuspidalestes caeruleus (AMNH 252193, AMCC 252193), which were mapped to the mitogenomes of sister taxa, Micralestes occidentalis (AMNH 256979–AMCC 222994) and Brachyalestes nurse (AMNH 254138–AMCC 226340), as determined from the 75 % ML tree. Assemblies were annotated with MitoFish v.2025.06 and MitoAnnotator. All 13 protein-coding genes were extracted in Geneious, aligned across taxa using MUSCLE v.5 (PPP algorithm), and concatenated into a single matrix. ML phylogenetic inference was performed in IQ-TREE 2 with 1,000 ultrafast bootstrap replicates and gene-partitioned best-fit models selected by ModelFinder.
Time-calibrated analysis
The time-calibrated phylogenetic analysis was performed in BEAUTi and BEAST v.2.7.7 using the 100 % complete UCE matrix under a GTR+G+I substitution model, an uncorrelated relaxed molecular clock, a birth-death prior, and an 80 % complete ML tree as a starting topology. Time constraints included one root constraint and three fossil calibrations.
The BEAST analysis was run in three independent MCMC simulations of 100 million generations each, sampling every 10,000 generations. We used Tracer v.1.7.2 to verify all parameters had ESS > 200. Trees were merged with a 10 % burn-in using LogCombiner v.2.7.7, and an MCC tree was generated with TreeAnnotator v.1.8.2.
Historical biogeographical analysis
The five bioregions of Clavocharax were determined using geographic coordinates compiled from AMNH specimens and additional museum records identified using the FishNet2 Portal. The complete dataset was visualized in QGIS v.3.32.3 (Lima).
To estimate geographic range evolution, we used BioGeoBEARS in RASP v.4.4 to compare DEC, DIVALIKE, and BAYAREALIKE models, restricting ancestral ranges to include the maximum of five bioregions. All bioregions were required to be contiguous, except for permitted connections between the “northern rivers and lower Ogooué” (region A in RASP results file clavocharax-RASP.txt) and the Kouilou (C), excluding the upper Ogooué (B), based on elevational similarity and potential paleo-connections. From 27,003 time-calibrated trees, 100 were randomly selected for analysis, with the MCC tree used as the consensus topology. Model selection was based on the AICc weight scores.
