Data from: A genomic perspective on cryptic species reveals complex evolutionary dynamics in the gray zone of the speciation continuum
Data files
Jan 28, 2026 version files 152.30 MB
-
data.zip
152.28 MB
-
README.md
2.94 KB
-
scripts_others.zip
10.03 KB
-
scripts_R.zip
11.47 KB
Abstract
The evolutionary dynamics of cryptic species remain poorly understood, and their detection relies primarily on methods that quantify divergence, assuming that gene flow is absent. Here, we examine how gene flow shapes the evolutionary trajectories and species boundaries in Bornean Fanged Frogs, a renowned example of cryptic diversity where a single species has been split into 18 genetically divergent yet morphologically indistinguishable species. We employed target-capture data from over 13,000 loci to assess lineage independence of 14 nominal species distributed across Malaysian Borneo by evaluating both divergence and cohesion using network multispecies coalescent (NMSC) and MSC + migration approaches. Under the Unified Species Concept, only six of the 14 nominal species unambiguously form independently evolving lineages; the remainder represent cohesive metapopulation lineages nested within those six species. While mitochondrial p-distances varied substantially (up to 10%), genome-wide net divergences (Da) were more consistent, ranging from 0.5–2 %, placing all the hypothesized “cryptic species” within the empirical gray zone of the speciation continuum. We show that diversification in the gray zone is unpredictable and heavily impacted by gene flow, leading to two key phenomena that confound species delimitation: (1) the artifactual branch effect, where admixed lineages are inferred as long, early-diverging branches, creating an illusion of deep divergence; and (2) the species-definition anomaly zone, where intraspecific pairwise sequence distances exceed interspecific ones. We further demonstrate that divergence in the gray zone varies among metrics and genomic regions, reflecting heterogeneity in evolutionary dynamics across the genome. Different genomic markers also vary considerably in phylogenetic discordance and their ability to retain signatures of gene flow. Loci from anchored hybrid enrichment (AHE) and ultraconserved elements (UCE) produced less phylogenetic discordance and retained signals of older introgression but failed to detect recent migration, making them suitable for phylogenetic reconstruction and inferring ancient introgression, but not ongoing gene flow. Recognizing the central role of gene flow reframes our understanding of cryptic species; rather than being considered as genetically distinct units that failed to evolve morphological differentiation, they are manifestations of continuous diversification in the gray zone. This shift in perspective offers a new and dynamic evolutionary framework for identifying and interpreting cryptic biodiversity across the Tree of Life.
Dataset DOI: 10.5061/dryad.c866t1gk6
Description of the data and file structure
All data relevant to the reproducibility of this study are made available, including sequence alignments (16S and genomic), pairwise p-distance calculations, raw results from the pixy analysis (Dxy, Fst, and pi), SNP dataset in STRUCTURE format, and phylogenetic trees (species, gene, and consensus trees).
Files and variables
File: scripts_others.zip
Description: Various helper bash and Python scripts:
filter_alignments.sh: subsets alignment files based on a predefined list of loci. It reads locus IDs from a text file and copies the corresponding .phy alignment files from a source directory into a new directory for downstream phylogenetic analyses.
filter_genetrees.sh: subsets a collection of gene tree files based on a predefined list of IDs
phylip_to_bpp.py: converts a phylip to bpp format
random_columns_from_third.sh: randomly selects a specified number of columns from an input table, starting from the 3rd column onward, and prints them together with the first two columns.
random_select_100_loci.sh: randomly subset 100 loci
rename_population.sh: remaps population identifiers in FASTA sequence headers using a tab-delimited mapping file
resample-genetrees.sh: randomly subset a user-specified number of genetrees
split_vcf_by_marker.sh: partitions a VCF file into region-specific subsets (UCE, AHE, BUSCO, etc.) based on genomic coordinates defined in a BED file.
File: scripts_R.zip
Description: R scripts for specific analyses
snmf_tsne.R: R script for sNMF and t-SNE analyses
divergence_landscape.R: R script for the divergence landscape analyses
nanuq.R: R script for the NANUQ analyses
p_distances.R: R script for the p-distance comparisons analyses
File: data.zip
Description: sequence alignments, p-distance calculations, pixy results, SNP dataset, and phylogenetic trees
alignments: Sequence alignments for 16S and all genomic loci
p_distances: Pairwise p-distance calculations that are used for the p-distance analyses
pixy: Fst, Dxy, and pi calculations from the pixy analysis. NA=no segregating sites between populations in that window; populations are identical at this locus
snmf_tsne: SNP dataset for the snmf and t-SNE analyses
Trees: ASTRAL species trees, individual gene trees by marker type, and IQ-TREE consensus trees
Code/software
Tree files can be viewed using FigTree. All other files can be viewed using a text editor
Access information
Other publicly accessible locations of the data:
