Cryptic species can be phylogenetically old despite strong sex-biased dispersal

Dietz, Lars 1 ; Eberle, Jonas2 ; Kukowka, Sandra1 ; Podsiadlowski, Lars1 ; Bazzato, Erika3 ; Stange, Madlen1 ; Warnock, Rachel4 ; Niehuis, Oliver 5 ; Mayer, Christoph 1 ; Ahrens, Dirk1

Research facility: USYB-2024-168

Published Oct 24, 2025 on Dryad. https://doi.org/10.5061/dryad.931zcrjtw

Abstract

The impact of strongly differentiated populations on species delimitation due to limited or sex-biased dispersal remains challenging and under-explored in the framework of integrative taxonomy. The Mediterranean chafer beetle genus Pachypus is remarkable for its extreme female philopatry, with entirely wingless and subterranean females. This makes Pachypus an interesting case study. Based on a dataset of over 900 protein-coding genes (metazoan universal single-copy orthologs; mzl-USCOs), we investigated phylogeny, species delimitation, gene flow, and population differentiation to provide an integrative assessment of species boundaries. Integrative consideration of all results led to the recognition of 14 mostly morphologically cryptic species, including several new taxa. Most inferred speciation events occurred in the time between the end of the Messinian salinity crisis (about 5.3 million years ago) and the early Pleistocene. Phylogenetically old species and lack of recent speciation was unexpected because of the extreme philopatry, the morphological similarity of the species, and the high degree of differentiation observed among populations of the same species. Speciation was partly associated with the disruption of previously more connected ranges after the Messinian salinity crisis (MSC). This also helps clarify the extent to which the Mediterranean dried out during the MSC, since land connections in the circum-Tyrrhenian region must have persisted long enough for flightless Pachypus females to disperse across drifting land areas connecting the Apennine Peninsula and Africa. We found evidence for historical gene flow between species, while more recent gene flow between populations is low, which is potentially the cause of considerable over-splitting found in the Bayesian Phylogenetics & Phylogeography (BPP) species delimitation analysis. We showed that integrating the outcome of the BPP species delimitation with genealogical divergence index (gdi) values proved to be helpful in some cases but was inconclusive in many others. Generalized Mixed Yule Coalescent (GMYC) and Poisson Tree Processes (PTP) analyses were less prone to over-splitting. This illustrates how species delimitation analyses of cases with restricted or sex-biased dispersal and highly differentiated populations can serve as empirical tests of the utility and robustness of delimitation approaches.

Supplementary files for study "Cryptic species or metapopulations? Post-Messinian speciation of ancient Tyrrhenian philopatric Pachypus chafer beetles despite repeated hybridization", L. Dietz et al., submitted

For questions, contact Dirk Ahrens (d.ahrens@leibniz-lib.de) or Lars Dietz (l.dietz@leibniz-lib.de)

Each ZIP file contains a folder with several files containing alignments, phylogenetic trees, or other data. For more information on these files, please consult the accompanying description files.

All alignments, including SNP datasets, are in FASTA format and can be opened with standard alignment viewers. Except in the alignments from the Scarabaeoidea dataset, ambiguity codes (R, Y, W, S, M, K) stand for positions inferred to be heterozygous.

Phylogenetic trees are in NEWICK format and can be opened in a standard phylogenetic tree viewer such as FigTree. All trees are unrooted. Partition files for concatenated alignments are in NEXUS format for use in IQ-TREE.

SNP datasets recoded for NMDS analysis are tab-delimited text and numbers have the following meaning: 0: homozygous for more common allele, 1: heterozygous, 2: homozygous for less common allele. Unknown positions are represented by empty cells.

Figures are in SVG format and can be opened e.g. with Inkscape. All other files can be opened with a standard text editor.

Supplementary Tables are in XLSX format and can be opened e.g. with Excel.

Description of the data and file structure

gdi_imap_files.zip: Files with population assignment of Pachypus individuals used in BPP analyses for calculation of genealogical diversity index (gdi). File names are in the format pachypus_{group}_gdi_Imap.txt, where "group" stands for the analyzed group of Pachypus.

gdi_tables.zip: Tables containing parameters estimated by BPP analyses for calculation of genealogical diversity index (gdi), with five repeats of the analysis for each clade. File names are in the format {group}_gdi{num}.txt, where "group" stands for the analyzed group of Pachypus and "num" stands for the number of the analysis from 0 to 4.

gene_alignments.zip: DNA alignments of individual USCO loci of 171 Pachypus individuals and two outgroups.

gene_trees.zip: Phylogenetic trees based on individual USCO loci of 171 Pachypus individuals and two outgroups.

pachypus_admixture_10.zip: Input and results of ADMIXTURE analysis of complete Pachypus SNP dataset.

(see pachypus_admixture_descriptions.txt for descriptions)

pachypus_a_dsuite.zip: Input and results of analysis of inter-population gene flow within Pachypus clade A with Dsuite.

(see pachypus_a_dsuite_descriptions.txt for descriptions)

pachypus_b_dsuite.zip: Input and results of analysis of inter-population gene flow within Pachypus clade B with Dsuite.

(see pachypus_b_dsuite_descriptions.txt for descriptions)

pachypus_bpp_trees.zip: Trees from BPP species delimitation analysis of major Pachypus groups. Numbers on branches are posterior probabilities for species-level splits.

(see pachypus_bpp_trees_descriptions.txt for descriptions)

scarab_astral_trees.zip: Coalescent-based phylogenetic trees from ASTRAL analysis of the Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_astral_trees_descriptions.txt for descriptions)

scarab_concat_alignments.zip: Concatenated alignments of USCO genes from Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_concat_alignments_descriptions.txt for descriptions)

scarab_concat_trees.zip: Maximum-likelihood phylogenetic trees created with IQ-TREE from concatenated Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_concat_trees_descriptions.txt for descriptions)

scarab_gene_alignments.zip: Alignments of individual USCO genes from Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_gene_alignments_descriptions.txt for descriptions)

scarab_gene_trees.zip: Maximum-likelihood phylogenetic trees based on individual USCO genes from Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_gene_trees_descriptions.txt for descriptions)

scarab_partition_files.zip: Partition files in NEXUS format for concatenated alignment of USCO genes from Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals.

(see scarab_partition_files_descriptions.txt for descriptions)

snp_alignments.zip: Alignments of informative SNPs from USCO loci of Pachypus and its major clades.

(see snp_alignments_descriptions.txt for details)

snp_nmds.zip: Tables of biallelic SNPs for NMDS analysis from USCO loci of Pachypus and its major clades.

(see snp_nmds_descriptions.txt for details)

usco_admixture_10.zip: Input and results of ADMIXTURE analyses of the four major Pachypus clades and other taxa analyzed for comparison.

(see usco_admixture_descriptions.txt for descriptions)

co1_cons.fas: Aligned consensus sequences of cytochrome oxidase 1 of Pachypus individuals extracted with MitoGeneExtractor.

pachypus_dstat.txt: D-statistics for inter-population gene flow within Pachypus based on ADMIXTOOLS analysis. Columns are, from left to right, population 1 (outgroup), populations 2, 3, and 4 (the latter two are sister groups), estimated D-statistic, standard error, z-score for significance testing, and p-value. If the D-statistic is negative, this indicates gene flow between population 2 and 3, if positive, between population 2 and 4.

pachypus_f3.txt: f3 statistics for admixed origin of populations within Pachypus based on ADMIXTOOLS analysis. Columns are, from left to right, populations 1, 2, 3, estimated D-statistic, standard error, z-score for significance testing, and p-value. If the f3 statistic is significately negative, population 1 is likely admixed between populations 2 and 3.

pachypus_fst.txt: FST statistics for differentiation between populations within Pachypus based on ADMIXTOOLS analysis. Columns are, from left to right, population 1, population 2, estimated FST statistic, and standard error.

pachypus_mcmctree_ahrens.tre: Calibrated tree of Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals calculated with MCMCTREE according to calibration points from Ahrens et al. (2014).

pachypus_mcmctree_mckenna.tre: Calibrated tree of Scarabaeoidea transcriptomic dataset including 12 Pachypus individuals calculated with MCMCTREE according to calibration points from McKenna et al. (2019).

pachypus_outgroups_astral.tre: Coalescent-based phylogenetic tree from ASTRAL analysis of USCO dataset of 171 Pachypus individuals and two outgroups, with local posterior probabilities as support values.

pachypus_outgroups_astral_q.tre: Coalescent-based phylogenetic tree from ASTRAL analysis of USCO dataset of 171 Pachypus individuals and two outgroups, with quartet scores as support values.

pachypus_outgroups_concat.treefile: Maximum-likelihood phylogenetic tree created with IQ-TREE from concatenated USCO dataset of 171 Pachypus individuals and two outgroups.

pachypus_outgroups_nt.charset.nex: Partition file in NEXUS format for concatenated alignment of USCO genes of 171 Pachypus individuals and two outgroups.

pachypus_outgroups_nt_concat3.fas: Concatenated alignment in FASTA format of USCO genes of 171 Pachypus individuals and two outgroups.

pachypus_snapp_root.xml: XML file for SNAPP species tree analysis of Pachypus calibrated with the age of the root.

pachypus_snapp_msc.xml: XML file for SNAPP species tree analysis of Pachypus calibrated with the Messinian Salinity Crisis.

pachypus_snapp_both.xml: XML file for SNAPP species tree analysis of Pachypus calibrated with both calibration points.

pachypus_snapp_root.log: Log file containing parameters of SNAPP species tree analysis of Pachypus calibrated with the age of the root.

pachypus_snapp_msc.log: Log file containing parameters of SNAPP species tree analysis of Pachypus calibrated with the Messinian Salinity Crisis.

pachypus_snapp_both.log: Log file containing parameters of SNAPP species tree analysis of Pachypus calibrated with both calibration points.

pachypus_snapp_root.trees: NEXUS file containing trees from SNAPP species tree analysis of Pachypus calibrated with the age of the root.

pachypus_snapp_msc.trees: NEXUS file containing trees from SNAPP species tree analysis of Pachypus calibrated with the Messinian Salinity Crisis.

pachypus_snapp_both.trees: NEXUS file containing trees from SNAPP species tree analysis of Pachypus calibrated with both calibration points.

pachypus_snapp_root_cons.tre: Consensus tree from SNAPP species tree analysis of Pachypus calibrated with the age of the root.

pachypus_snapp_msc_cons.tre: Consensus tree from SNAPP species tree analysis of Pachypus calibrated with the Messinian Salinity Crisis.

pachypus_snapp_both_cons.tre: Consensus tree from SNAPP species tree analysis of Pachypus calibrated with both calibration points.

Sharing/Access information

Links to other publicly accessible locations of the data:

none

Data was derived from the following sources:

Raw reads from hybrid enrichment of Pachypus spp.: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1050384/
Transcriptome assemblies of scarabaeoid beetles from Dietz et al. (2023b): https://www.ncbi.nlm.nih.gov/bioproject/PRJNA906571/
SNP data of different arthropod and vertebrate taxa from Dietz et al. (2023a): https://datadryad.org/stash/dataset/doi:10.5061/dryad.hhmgqnkg5
SNP data of Anopheles, Drosophila, Heliconius and Darwin's finches extracted by Dietz et al. (in press): https://datadryad.org/stash/dataset/doi:10.5061/dryad.kprr4xhb3

Code/Software

trinity_longest_d.pl: This script creates filtered versions of Trinity assembly results, containing only the longest variant of each contig. Requires, in that order, the input folder containing assemblies in FASTA format, and an output folder for filtered assemblies. Names of assembly files must end in .fasta. Example: trinity_longest_d.pl input_folder/ output_folder/

hmmalign_cut2_d.pl: This script removes all positions not covered by the HMM from hmmalign protein alignments (STOCKHOLM format), and the nucleotide alignments (FASTA format) based on them. As part of this process, the STOCKHOLM format of the protein alignments is converted to FASTA. Requires, in that order, the paths to the folder containing input protein alignments, the folder containing input nucleotide alignments, an output folder for protein alignments, and an output folder for nucleotide alignments. Names of protein and nucleotide alignments must be identical, with the former ending in .sth and the latter in .fas. Example: hmmalign_c2ut_d.pl input_prot/ input_nuc/ output_prot/ output_nuc/

extract_codpos_d.pl: This script removes the third codon position from all nucleotide alignments (FASTA format) in a folder. Requires, in that order, the paths to the folder containing input alignments and an output folder for modified alignments. Names of alignment files must end in .fas. Example: extract_codpos_d.pl input_folder/ output_folder/

removegaps_snp_inf_d.pl: This script removes non-parsimony-informative SNPs from SNP datasets in FASTA format. Requires, in that order, the path to the folder containing input data and a folder where output alignments are put. Input file names must end in .fas. Example: removegaps_snp_inf_d.pl input_folder/ output_folder/

removegaps_snp_d.pl: This script removes all positions from SNP datasets in FASTA format that are missing in more than a specified number of taxa. It is recommended to initially set this to 0 or some other small number, then delete all empty files from your output folder, and increase by one and repeat until no more empty files are produced. Input file names must end in .fas. Requires, in that order, the path to the folder containing input alignments, the maximum number of taxa containing a gap, and a folder where output alignments are put. Example: removegaps_snp_d.pl input_folder/ 0 output_folder/

snp-remove-mt2_d.pl: This script removes sites with more than two alleles from an SNP dataset in FASTA format. Requires, in that order, name of the input file, a STRUCTURE output file of the dataset, and name of the output file. Example: snp-remove-mt2_d.pl input.fas structure_file output.fas

snp-pca_d.pl: This script converts an SNP dataset in FASTA format, including only SNPs with at most two alleles, to a format usable for PCA or NMDS. A majority consensus sequence of the input alignment, as a file containing only the sequence, must be created beforehand. Requires, in that order, name of the input, consensus, and output files. Input alignment file names must end in .fas. Example: snp-pca_d.pl input.fas cons.txt output.txt

concat_eogs_part_d.pl: This script creates a concatenated alignment FASTA file from all alignments (FASTA format) in a folder. It also creates a partition file in NEXUS format listing each alignment as a partition. Requires, in that order, the path to the folder containing input alignments, a name for the concatenated output alignment, and the partition file. Names of alignment files must end in .fas. Example: concat_eogs_part_d.pl input_folder/ concat.fas partition.nex

fas2geno_d.pl: This script converts an SNP dataset in FASTA format, including only SNPs with at most two alleles, to a .geno format usable for ADMIXTOOLS. A majority consensus sequence of the input alignment, as a file containing only the sequence, must be created beforehand. Requires, in that order, name of the input, consensus, and output files. Input alignment file names must end in .fas. Example: snp-pca_d.pl input.fas cons.txt output.geno

Supplementary Text

Pachypus_Supplementary_information.docx: This Word document contains additional information on various methods and results mentioned in the main text.

Supplementary Tables

Table S1. Collection data, species assignments, and NCBI accession numbers of examined Pachypus specimens.

Table S2. Calibration points based for the MCMCTree analysis based on the dating results of Ahrens et al. (2014; calibration scheme 6) and McKenna et al. (2019; 4818-gene dataset), respectively.

Table S3. Results of the admixture analysis based on the f3 statistics.

Table S4. Results of MANOVA analysis.

Table S5. Results of IBD test for selected pairs of candidate species entities. A hypothesis is rejected with p < 0.05 (indicated in bold). P values are shown for the following null hypotheses: H01 states that the regressions of genetic on geographical distances within two primary candidate species agree. If this hypothesis cannot be rejected, the hypothesis that the regression pattern between primary candidate species is compatible with the regression based on the combined within-group data is tested (H02). If H01 is rejected, the hypothesis that the regression pattern between groups is at least compatible with the regression within one of the primary species hypotheses is tested (H03). In many cases of pairs, no IBD test was possible due to insufficient population sampling (na). Lacking IBD can be understood as support for the two tested population entities being separate species. NA - in at least one population there are not enough specimens or localities for testing the hypothesis.

Table S6. Logarithms of marginal likelihoods and Bayes factors in comparison to the best model resulting from the model testing with BPP using the multi-species coalescent model with migration vs models without gene flow for the four main clades of Pachypus. The best result is highlighted in bold. The following models were tested: “full”: a model assuming gene flow between all pairs of non-sister clades for which this was indicated by the f-branch test, as well as all pairs of sister clades; “reduced”: a model assuming gene flow between the same non-sister clades, but not between sister clades; “no migration”: a model assuming no gene flow between clades at all.

Supplementary Figures

All supplementary figures are in a single PDF file (pachypus-Supplement_figures.pdf).

Supplementary Figures and Tables:

Figure S1. Collection localities of specimens analyzed in this study.

Figure S2a. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of 171 Pachypus individuals and outgroup taxa. Numbers above branches are local posterior probabilities.

Figure S2b. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of 171 Pachypus individuals and outgroup taxa. Numbers above branches are quartet scores.

Figure S3. Maximum-likelihood tree based on concatenated USCO data of all 171 Pachypus individuals and outgroup taxa. Numbers above branches are support values from approximate likelihood ratio tests and ultrafast bootstrapping.

Figure S4. Phylogenetic NeighborNet network of 171 Pachypus individuals based on SNP data.

Figure S5. Maximum-likelihood tree based on concatenated alignment of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with hmmalign and including all nucleotide positions. Numbers above branches are ultrafast bootstrap values.

Figure S6. Maximum-likelihood tree based on concatenated alignment of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with hmmalign and including only first and second nucleotide positions. Numbers above branches are ultrafast bootstrap values.

Figure S7. Maximum-likelihood tree based on concatenated alignment of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with MAFFT and including all nucleotide positions. Numbers above branches are ultrafast bootstrap values.

Figure S8. Maximum-likelihood tree based on concatenated alignment of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with MAFFT and including only first and second nucleotide positions. Numbers above branches are ultrafast bootstrap values.

Figure S9. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with hmmalign and including all nucleotide positions. Numbers above branches are local posterior probabilities.

Figure S10. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with hmmalign and including only first and second nucleotide positions. Numbers above branches are local posterior probabilities.

Figure S11. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with MAFFT and including all nucleotide positions. Numbers above branches are local posterior probabilities.

Figure S12. Coalescent-based tree calculated with ASTRAL based on individual USCO gene trees of selected Pachypus individuals and a larger dataset of other scarabaeoid beetles, aligned with MAFFT and including only first and second nucleotide positions. Numbers above branches are local posterior probabilities.

Figure S13. Bayesian species tree estimated with SNAPP based on SNPs. Numbers above branches show posterior probabilities for internal nodes.

Figure S14. Tree calibrated with MCMCTree based on fixed ML tree topology using the calibration scheme from Ahrens et al. (2014).

Figure S15. Tree calibrated with MCMCTree based on fixed ML tree topology using the calibration scheme from McKenna et al. (2019).

Figure S16. Results of BPP species delimitation for subclades with the full dataset (left side) and using a dataset where all positions containing missing data and/or gaps were removed (right side) with different theta and tau values. Boxes at tree nodes refer to different combinations of theta and tau, and colors refer to posterior probabilities of species splits at the node (see inset).

Figure S17. Results of various species delimitation analyses mapped on ASTRAL tree. Colored bars (bars 3-10) show entities inferred to be distinct species separated by horizontal lines. Curved lines connect specimens assigned to the same species-level entity. From left to right these are: 1) species based on Eberle et al. (2018); 2) groupings (i.e., subclades) for BPP analysis; results of 3) BPP with full dataset; 4) BPP with dataset where all positions containing missing data and/or gaps were removed; 5) BPP with full dataset, where lineages not supported as full species by gdi were lumped (grey lines are ambiguous cases; numbers are gdi values separating a given lineage from its sisters); 6) bPTP result with maximum likelihood; 7) bPTP result with highest Bayesian support; 8) GMYC; 9) tr2; 10) SODA.

Figure S18. Values of the genealogical diversity index (gdi) mapped onto the respective nodes of the ASTRAL tree of all Pachypus specimens (for BPP priors: beta = 0.04 for theta, beta = 0.02 for tau). Critical gdi values above merging threshold and below splitting threshold are indicated by green and red squares, respectively. Nodes indicated by an empty circle were not subject to gdi analysis due to dataset subdivision (see text). Inset shows the frequency distribution of gdi values.

Figure S19. Partitioned NMDS analyses using SNP data of the major Pachypus clades, A - A1, B - A2, C - B1, D - B2.

Figure S20. Results of ADMIXTURE analysis based on SNP data including all individuals for a number of ancestral populations (K) of 12.

Figure S21. Partitioned ADMIXTURE analyses based on SNP data for the major Pachypus clades, A - A1, B - A2, C - B1, D - B2.

Figure S22. Comparison of cross-validation with ADMIXTURE in Pachypus and its major subclades, compared to other case studies using mzl-USCOs, with the x-axis representing K (Kmax = 20 for all cases except the complete dataset of Pachypus, in which we chose Kmax = 50), and the y-axis representing the cross-validation error.

Figure S23. Plots of IBD testing for cases in which all three null hypotheses could be tested (see also Table S5). IBD was rejected in all cases shown here except one (sp4-1 vs. sp4-2+sp4-3). Dotted lines are regression lines for within-group distances (black: first tested group, red: second tested group, green: both groups together). Solid lines are regression lines fitted to both within- and between-group distances (black: within-group distances for first tested group only, red: for second tested group only, green: for all distances). The blue line shows the center of the between-group geographic distances.