Data from: Ultraconserved elements, DNA barcoding and morphology unravel the diversity and evolution of the Hypoponera pruinosa group (Hymenoptera: Formicidae) in Indochina and the Indo-Australian region
Data files
May 08, 2026 version files 14.04 GB
-
HPG2025_DryadUpdated.zip
14.04 GB
-
README.md
13.96 KB
Abstract
The genus Hypoponera Santschi, 1938 is arguably one of the most neglected ant genera in the world, kudos to its nondescript appearance and subtle inter-species differences. Despite its apparent ubiquity in the tropics, little is known about Hypoponera diversity in Southeast Asia and its neighbours, where most species remain undescribed. In this unprecedented integrative study, we used a combination of evidence from nuclear ultra-conserved elements (UCEs) and mitochondrial cytochrome oxidase subunit I (COI), alongside comparisons of morphology, to build species hypotheses and elucidate the evolution of the Hypoponera pruinosa (Emery, 1900) group in Indochina and the Indo-Australian region. Based on phylogenetic analyses of 2,014 UCE loci (unphased and phased) and morphological assessments of putative molecular species units generated via four methods (SODA, bPTP, BPP, ≤4% COI clusters), we recognised a total of 26 nominal species in the pruinosa group, 23 of which are new to science. Reciprocal monophyly for most species was strongly supported. Divergence dating and biogeographic analyses revealed that the crown pruinosa group most likely originated in Indochina-Borneo ca. 14.9-13.8 Ma, and diversified largely eastwards to neighbouring areas from the Late Miocene. We identified two cases of species paraphyly, involving a novel species and the extant H. sabronae (Donisthorpe, 1941); both species were found nested within the geographically widespread H. pruinosa clade in UCE-based phylogenies. This study lays the groundwork for a follow-up taxonomic monograph on the pruinosa group, where new species will be formally treated and rigorously described. Keywords: Formicidae, Systematics, Morphology & Evolution Entomological Society of America Editorial Office: 170 Jennifer Road, Suite 230, Annapolis, MD 21401, USA. Editorial Office Phone: 1-301-731-4535.
Data inputs for and outputs of all genetic/genomic analyses organised by analysis type (designated folders). Main folder: HPG2025_DryadUpdated.zip
Description of the data and file structure [broken down to respective folders]
ASTRAL analyses
- Astral_output subfolder: Contains all input and output files generated for multispecies coalescent (MSC) model-based analyses on ASTRAL III v5.7.8
- red-corr-hmhp-80p: Input and output for ASTRAL analyses on corrected unphased UCE dataset (80percent taxon occupancy)
- red-corr-phased-hmhp-80p: Input and output for ASTRAL analyses on corrected phased UCE dataset (80percent taxon occupancy)
- IQTree_GeneTrees subfolder: gene tree sets generated for ASTRAL III analyses, using IQ-TREE 2
- iqtree2_corr-reduced_unphased_hmhp_GENEtrees_output_80p: Output of gene tree generation by IQTree2 based on corrected unphased partitioned UCE loci
- iqtree2_corr-reduced-phased-hmhp_GENEtrees_output_80p: Output of gene tree generation by IQTree2 based on corrected phased partitioned UCE loci
BEAST2
Phased subfolder: contains output files generated from five independent runs of Markov Chain Monte Carlo (MCMC) analysis on BEAST2, using unpartitioned phased UCE loci at 100percent taxon occupancy (110 loci) and a constraint tree constructed from partitioned phased UCEs at 80percent taxon occupancy.
- Relevant files with prefix 'hmhp_ph100p_80pTreeBDM5'
- 'hmhp_ph100p_TreeBDM5_combi.trees': Tree files from five MCMC runs combined with LogCombiner.
- 'hmhp_ph100p_TreeBDM5_summary2-90p': Maximum clade credibility tree for phased unpartitioned UCEs, generated with TreeAnnotator and 10percent burnin.
Unphased subfolder: contains output files generated from five independent runs of Markov Chain Monte Carlo (MCMC) analysis on BEAST2, using unpartitioned unphased UCE loci at 100percent taxon occupancy (110 loci) and a constraint tree constructed from partitioned unphased UCEs at 80percent taxon occupancy.
- Relevant files with prefix 'hmhp_unp100p_80pTreeBDM5'
- 'hmhp_unp100p_TreeBDM5_combi1-5.trees': Tree files from five MCMC runs combined with LogCombiner.
- 'hmhp_unp100p_TreeBDM5_summary2-90p': Maximum clade credibility tree for unphased unpartitioned UCEs, generated with TreeAnnotator and 10percent burnin.
BPP sp delim_BPP1-4_Inputs & Outputs
- Contains input (.ctl and lmap.txt)for and output files of species delimitation analyses (A10 method) by the Bayesian Phylogenetics and Phylogeographic (BPP) program.
- Taxa divided into four subgroups - bpp1-4 - with corresponding datasets and files named accordingly.
- Each subset of UCE loci was first filtered for 100percent taxon occupancy, then the top 300 most parsimony-informative loci were identified and selected per bpp group using the R package phyloch v1.5.3.
- Phased subfolder: Phased UCE loci sets per bpp group filtered and used for generation of final alignments(using the software phyluce) for BPP analyses.
- 'bppX-phased-100p-top300-raxml' subfolder: contains charsets and alignment in phylip (.phy) format.
- 'bppX-phased-100p-top300-raxml2' subfolder: contains alignment in nexus format.
- Unphased subfolder: Unphased UCE loci sets per bpp group filtered and used for generation of final alignments(using the software phyluce) for BPP analyses.
- 'bppX-unphased-100p-top300-raxml' subfolder: contains charsets and alignment in phylip (.phy) format.
- 'bppX-unphased-100p-top300-raxml2' subfolder: contains alignment in nexus format.
bPTP output
reduced-phased-hmhp-80p_rooted_500k-gens subfolder: Output files from species delimitation analyses by the Bayesian implementation of the Poisson Tree Processes (bPTP) model. Input was a single tree constructed from phased SWSC-EN partitioned UCEs. Analyses were conducted on the free bPTP server (https://species.h-its.org), running 5 x 105 MCMC generations per analysis, all other input values at default with outgroups removed.
COI_ObjectiveClustering_Dendrogram
Input and output files for objective clustering of 193 COI sequences, using the customized software obj_clust v0.1.2 (A. Srivathsan, unpublished; an implementation of objective clustering as described by Meier et al. 2006)
- 'hmhp_mod6_GBF_mega-mafft.fasta': Fasta file of COI barcodes of the final dataset aligned using MAFFT v7 with default parameters
- 'hmhp_mod6_GBFinal_mega2.fas': Fasta file of the preceding MAFFT-aligned fasta, checked and corrected on MEGA v11.0.13. Final file used to generate cluster dendrogram.
- 'hmhp_mod6_GBFinal_mega2.fas_pmatrix': Text file containing uncorrected pairwise distances (i.e., pairwise distance matrix) between all pairs of taxa
- 'hmhp_mod6_GBFinal_mega2.fas_pmatrix_clusterlist': Text file listing sequence clusters present at different percentage clustering thresholds (based on uncorrected pairwise distance). Each row lists clusters at a specific percentage threshold point indicated by numerals at the start of each row, followed by ':' and the number of clusters at the corresponding threshold.
- 'hmhp_mod6_GBFinal_mega2.fas_pmatrix_dendro.html': Cluster dendrogram depicting divergences or merging amonst individual sequences and/or clusters at different uncorrected pairwise distance percentage thresholds (node values). Can be viewed on any web browser.
- hmhp_mod6_GBFinal_mega2.fas_pmatrix_clusterfastaouts subfolder: Fasta files of identical COI sequences (haplotypes) per respective fasta.
- hmhp_mod6_GBFinal_mega2.fas_pmatrix_threshfastaouts subfolder: Fasta files of COI sequences at different percentage distance thresholds 0.0-14.3%
Contig Correction output
All output generated from running the Phyluce 'correction' workflow on assembled contigs with the phyluce_workflow program.
- consensus subfolder: contains filtered or corrected 'consensus' contigs where variant bases have been hard-masked
- filtered_norm_pileups subfolder: contains processed and normalized binary variant call files (.bcf) derived from realigning/mapping raw reads back to assembled UCE contigs, as part of the phyluce correction workflow. '.bcf.csi' files are index files corresponding to each BCF file.
- BCF files represent SNPs and indels found within the assemblies; variants in BCF files in the filtered_norm_pileups subfolder have been filtered to remove low quality base calls or those with low read depth or other issues.
- BCF files may be viewed on command-line using bcftools (bcftools view file.bcf), indexed or converted to plain text VCF using the same program. Other tools for importing and visualizing BCF files include: BaseSpace Variant Interpreter, Integrative Genomics Viewer.
IQTree2_ML analyses
- best_scheme.nex.symtest.csv: Results of three matched-pairs tests of symmetry on the final unphased partitioned UCE dataset, performed using IQTree2. These are meant to test the two phylogenetic assumptions of stationarity and homogeneity, and detect potential model violations.
- iqtree2_corr-reduced_unphased-hmhp_output_100p subfolder: Output of IQTree2 analyses on unphased SWSC-EN partitioned UCE dataset at 100percent taxon occupancy.
- iqtree2_corr-reduced_phased-hmhp_output_80p subfolder: Output of IQTree2 analyses on phased SWSC-EN partitioned UCE dataset at 80percent taxon occupancy.
- iqtree2_corr-reduced_phased-hmhp_oupput_100p subfolder: Output of IQTree2 analyses on phased SWSC-EN partitioned UCE dataset at 100percent taxon occupancy
*iqtree2_corr-reduced-unphased-hmhp_output_80p subfolder: Output of IQTree2 analyses on unphased SWSC-EN partitioned UCE dataset at 80percent taxon occupancy.
RASP biogeographic analyses
Input (.csv, .tre) for and output (.txt,.bak) of biogeographic analyses of BEAST2-generated trees using RASP v4.4
- 'hpg_phased_states.csv' and 'hpg_states.csv': Distribution files indicating different distributions (i.e.,'State' column) per taxon (i.e., 'Name' column) on the input tree, for phased and unphased UCE trees respectively. Only one state per taxon given.
- 'reduced3b_dis.csv': Final distribution file indicating different distributions/states per species - unlike previous distribution files, each species is only represented once (under 'Name') and multiple states are applied where necessary.
- hmhp_ph100p_TreeBDM5_MCC-90p_RELABEL.tre: Initial input maximum clade credibility phased BEAST tree pruned to select taxa representing different geographic states.
- hmhp_unp100p_TreeBDM5_MCC-90p_RELABEL.tre: Initial input maximum clade credibility unphased BEAST tree pruned to select taxa representing different geographic states.
- Phased subfolder: Input (phased_red3_Final_tree.tre, .treeset.trees) and output files for final biogeographic analyses based on the Bayesian-based BayArea ('BAYAREALIKE') model, for phased UCE data.
- Unphased subfolder: Input (unphased_reduced3_tree.tre, .treeset.trees) and output files for final biogeographic analyses based on the Bayesian-based BayArea ('BAYAREALIKE') model, for phased UCE data.
SODA sp delimitation
- SODA-output_reduced-phased-hmhp-100ptrees_80pRootedGuideTree: Input (.tre, .txt) and output (.out, .cl) files for species delimitation analyses using the program SODA.
- SODA-output_reduced-unphased-hmhp-100ptrees_80pRootedGuideTree: Input (.tre, .txt) and output (.out, .cl) files for species delimitation analyses using the program SODA.
All input gene trees were filtered for 100percent taxon occupancy and low support branches (UFBoot<50%) collapsed before running each round of SODA analysis, using the 80p rooted SWSC-EN partitioned tree (either phased or unphased) as a guide each time.
UCE assembly alignment
Relevant input for and output files from processing of UCEs from contig assembly, correction, to identification of UCE loci and alignment for both phased and unphased datasets. All processes were performed on Phyluce.
- all_barcodes_hmhp_COI_filtered_fromUCEs subfolder: Output from running 'phyluce_assembly_match_contigs_to_barcodes' on assembled phased contigs, comprising COI sequences (full or partial)found and 'sliced' from corresponding contigs, in .fasta and .lastz formats.
- corrected-contigs subfolder: Corrected unphased assembled (using SPADES) contigs from the phyluce correction workflow, used for downstream alignment and further analyses.
- phased-uce-search-results subfolder: Output from running 'phyluce_assembly_match_contigs_to_probes' on assembled corrected and phased contigs, in .lastz format.
- uce-search-results subfolder: Output from running 'phyluce_assembly_match_contigs_to_probes' on assembled corrected contigs, in .lastz format.
- RED-phased-taxon-sets folder >> all subfolder: Output directory for all phyluce analyses steps performed on phased UCE data from final 104 taxa (208 for phased data), starting from 'phyluce_assembly_get_match_counts' to MAFFT alignment and trimming [internal and GBlocks (gb)], to concatenation and data matrix generation at 80percent and 100percent taxon occupancies [.phy and .nex file outputs].
- RED-taxon-sets folder >> 'all' subfolder: Output directory for all phyluce analyses steps performed on unphased UCE data from final 104 taxa, starting from 'phyluce_assembly_get_match_counts' to MAFFT alignment and trimming [internal and GBlocks (gb)], to concatenation and data matrix generation at 80percent and 100percent taxon occupancies [.phy and .nex file outputs].
- hymenoptera-v2-ANT-SPECIFIC-uce-baits.fasta: Probe set used to find relevant UCE loci from assembled contigs.
- phyluce_assembly_get_match_counts.txt: Log file generated from running 'phyluce_assembly_get_match_counts' on 'uce-search-results'.
- phyluce_assembly_match_contigs_to_barcodes.txt: Log file generated from running 'phyluce_assembly_match_contigs_to_barcodes' on corrected phased contigs.
- phyluce_assembly_match_contigs_to_probes.txt: Log file generated from running 'phyluce_assembly_match_contigs_to_probes' on corrected unphased contigs.
- RED-taxon-set-hmhp.conf (text): Configuration (.conf) file comprising list of selected taxa (104 taxa) whose unphased UCE data would be extracted and stored for further analysis under 'RED-taxon-sets'.
- RED-taxon-set-phased-hmhp.conf (text): Configuration (.conf) file comprising list of selected taxa (208 taxa, phased alleles) whose UCE data would be extracted and stored for further analysis under 'RED-phased-taxon-sets'.
- UCE_assemblies_summary_stats_per_taxon (text): Summary statistics of initial SPADES assembly (107 taxa - before filtering of final 104 taxa). Information for each line are arranged as follows:
- Sample, contigs, total bp, mean length, 95 CI length, min length, max length, median length, contigs >1kb
YAML input files for Contig Correction & Phasing
YAML format files required as input for the mapping, correction and phasing workflows (Phyluce) applied on assembled contigs prior to actual phylogenetic analyses downstream.
Sharing/Access information
Link to open-access publication and supplementary datasets (online only):
-https://doi.org/10.1093/isd/ixag013
Further Remarks
- All text or text-based files (including .nex, .fasta) are recommended to be viewed on the freeware Notepad++.
- Tree files (.tre, .tree) may be visualized on FigTree or other phylogenetic tree viewers available online.
- YAML files can be opened and edited on https://codebeautify.org/yaml-beautifier.
- BEAST2 trace files can be viewed on Tracer v1.7.1 (or any later version) - trace files are listed as TEXT files in Windows, you may need to add the '.log' extension prior to viewing on Tracer.
