Allopatric speciation and interspecific gene flow driven by niche conservatism of Diploderma tree lizards in Taiwan
Data files
Mar 11, 2025 version files 104.95 MB
-
01_ND2_1136bp.fas
284.61 KB
-
02_BDNF_480bp.fas
125.36 KB
-
03_CMOS_327bp.fas
83.98 KB
-
04_R35_561bp.fas
144.87 KB
-
05_Diploderma.p.snps.vcf
94.02 MB
-
06_Diploder_ddRAD_seq.fasta
9.77 MB
-
07_Diploderma_SNAP_tree.newick
5.17 KB
-
08_BPP.zip
390.05 KB
-
09_Diploderma_Occurrence.csv
124.02 KB
-
README.md
1.35 KB
Abstract
Allopatric speciation is a widely accepted hypothesis for species distributed across geographic barriers. Meanwhile, niche conservatism, the tendency of species to retain their ancestral ecological traits, helps reinforce genetic differentiation by stabilizing species distributions over time and reducing the role of competition in shaping range boundaries. In contrast, hybridization can occur at the edges of distribution after secondary contact following climatic or geological events, leading to a reduction in genetic divergence between divergent lineages. In this study, we investigated the role of geographic barriers, niche conservatism, and gene flow in the speciation history of Diploderma species in Taiwan, where geographically distinct taxa share similar environmental preferences. By using ddRAD-seq data, seven distinct genetic clusters were identified with two putatively new cryptic species in D. brevipes and D. polygonatum. Most sister species pairs share similar climatic niches based on niche equivalency and similarity tests. We further detected significant historical gene flow between lineages of D. brevipes and D. polygonatum, where secondary contact might have occurred because of paleo-climate changes and historical demographic expansion. Our results demonstrate that niche conservatism does not always act in concert to strengthen the result of allopatric speciation; instead, it may also lead to gene flow between divergent lineages following secondary contact. On the other hand, post-divergence gene flow may be a creating force generating phenotypic diversity in sexually selected traits in our study system. The underestimated species diversity of Diploderma in Taiwan requires further taxonomic works in the future.
This reposity contains information for data used in the article:
Authors: Tzong-Han Lin, Zong-Yu Shen, Ming-Hsun Chou, Pei-Wei Sun, Chin-Chia Shan, Jen-Pan Huang, Si-Min Lin
Title: Allopatric speciation and interspecific gene flow driven by niche conservatism of Diploderma tree lizards in Taiwan
01 ND2_1136b: Sequence file for mitochondrial ND2 sequences
02 BDNF_480bp: Sequence file for nuclear BDNF sequences
03 CMOS_327bp: Sequence file for nuclear CMOS sequences
04 R35_561bp: Sequence file for nuclear R35 sequences
05 Diploderma.p.snps: SNPs file for ddRAD sequences from nuclear genome
06 Diploder_ddRAD_seq.fasta: fasta file for ddRAD sequences from nuclear genome
07 Diploderma_SNAP_tree.newick: Tree file using ddRAD-seq
08 BPP.zip: Input files for BPP analysis, including population file 'bpp.Imap.txt', sequence file 'BPP_new.phy', and control file 'msci_single.ctl'
09 Diploderma_Occurrence: Occurrence data for all Diploderma species in Taiwan, the abbreviation for species were listed as followed: "DB_S": central to southern D. brevipes, "DB_N": northern D. brevipes, "DM": D. makii, "DL": D. luei, "DP_W": western D. polygonatum, "DP_E": eastern D. polygonatum, and "DS": D. swinhonis.
Sampling and molecular data preparation
The collection surveys were conducted across distribution range of each species (n = 227) and combined with samples from an earlier study (Yang et al., 2018, n = 19), with a total of 246 Diploderma samples implemented in the phylogenetic analysis (Fig. 1a). Genomic DNA was extracted using muscle tissues following the protocol of EasyPure Genomic DNA mini Kit (Bioman, New Taipei). To preliminarily investigate the phylogeny by multi-locus data, four pairs of primers were selected for following polymerase chain amplification process (PCR), including one mitochondrial (NADH dehydrogenase subunit 2, ND2) and three nuclear genes (brain-derived neurotrophic factor gene, BDNF; oocyte maturation factor Mos gene, CMOS; RNA fingerprint protein 35 gene, R35) (Table 1). The PCR cycles consisted of an initial cycle at 94 °C for 3 minutes, followed by 35 cycles at 94 °C for 30 seconds, suitable annealing temperature for 45 seconds, and 72 °C for 120 seconds, and ended with a final cycle at 72 °C for 10 minutes. The PCR products were checked using 1.2% agarose gel in 0.5× TBE buffer to visualize the results, and sequencing was performed using an ABI 3730XL autosequencer (Genomics BioSci & Tech Corp., Taipei, Taiwan). The raw sequencing results were first edited and compiled in Sequencher 5.4.6 (Gene Codes Corp., Ann Arbor, MI), while the alignments were carried out using MUSCLE algorithm (Edgar, 2004) in MEGAX (Kumar et al., 2018).
To access more in-depth genetic data through next-generation sequencing techniques, a double digest restriction-site associated sequencing (ddRAD-seq) was implemented. The preparation of ddRAD libraries was conducted following the steps and recommendations of the published protocol (Peterson et al., 2012). The DNA samples were initially digested by two restriction enzymes, SbfI-HF and MspI. The amplified products were size-selected from 350 to 450 base pairs and attached with the library barcodes and individual index for the follow-up demultiplexing and analysis. A total of 10 libraries were generated, each including 18 to 23 indexed individuals. The genomic DNA libraries were then pooled as the final product and sent to Tri-I Biotech, New Taipei City for sequencing on NovaSeq 6000 system (Illumina, Inc., San Diego, USA). The raw reads from the ddRADseq were first demultiplexed by Stacks 2 (Rochette et al., 2019), and the reads with an average Phred score smaller than 10 were removed. Due to the lack of reference genomes for Diploderma, the de novo approach was applied to assemble the single nucleotide polymorphism (SNP) data. Various sets of parameters were examined to determine the most optimal combination following the recommendation from the previous study (Paris et al., 2017). Although the ‘r80 rule’ was generally applied in the parameter evaluation process, the missing rate parameter (r) was set to 0.5 considering the multiple species included in the dataset. This threshold was chosen to retain the most polymorphic sites across species while excluding potential loci caused by sequencing errors. Individuals with high missing rates or insufficient mean depths were excluded after the assembly using the most optimal combination (Cerca et al., 2021). The raw SNPs were filtered by VCFTools (Danecek et al., 2011) based on the following standards: including only bi-allelic sites, the minimum average mean depth larger than 5, allowing only 50% of missing data for each locus, and minor allele frequency (MAF) larger than 0.05.
Phylogenetic analyses
Phylogeny of Diploderma in Taiwan was investigated by Sanger’s sequence and SNP data. The multi-locus sequencing data (BDNF, CMOS, ND2, and R35) were partitioned by codon positions using ‘Split Codons’ (Stothard, 2000) and concatenated for the following analysis, where outgroup sequences of ten other agamid lizards (Table 2) were retrieved from GenBank. (Wang et al., 2019; Wang et al., 2021c). For the ddRAD-seq dataset, D. swinhonis was included with other mid-elevation species but designated as the outgroup based on a previous study. (Wang et al., 2019). The phylogenetic trees for both datasets were constructed by IQTREE 2 (Minh et al., 2020), in which the best-fit substitution model was chosen using ModelFinder Plus option (Kalyaanamoorthy et al., 2017). The ultrafast bootstrap method (Hoang et al., 2018) was utilized with 10,000 replicates to estimate the branch support, and the phylogenetic trees were visualized in Figtree 1.4.4 (Rambaut, 2018).
Population structure analyses
To examine the differentiation of mtDNA across geographic regions, the fast hierarchical Bayesian analysis was implemented to inspect population clustering by ND2 sequences by R package ‘fastbaps’ (Tonkin-Hill et al., 2019).
Principal component analysis (PCA) in PLINK 1.9 (Purcell et al., 2007) was applied for population stratification using ddRAD-seq dataset and visualized by ‘ggplot2’ (Wickham, 2011). To fulfill the assumption of data independency for PCA, the SNPs were filtered with a window size of 1,000 variant counts to detect and retain sites without strong linkage disequilibrium (LD) based on the variance inflation factor (VIF) with a threshold set at 1.1.
Population structure was further analyzed by sparse nonnegative matrix factorization (sNMF) and Structure (Pritchard et al., 2000; Popescu et al., 2014). The sNMF analysis was conducted using R package ‘LEA’ (Frichot & François, 2015) to explore the potential cluster numbers from 1 to 15, with 10 repetitions for each K for cross-entropy criterion. For Structure analysis, the parallel computation pipelines from StrAuto were utilized to improve the efficiency and output compilation (Chhatre & Emerson, 2017). The potential numbers of populations (K) were set to be examined from 1 to 15, and 10 iterations were repeated for each K. The Markov Chain Monte Carlo (MCMC) chains for each iteration were set with lengths of 1,000,000 and 25% as burn-in, while the best K was identified through the combined evaluation of the likelihood approach (Ln Pr(X|K)) and the Evanno method in Structure Harvester (Evanno et al., 2005; Earl & VonHoldt, 2012). Moreover, the optimal run of the best K was accessed by CLUMPP and visualized by CLUMPAK (Jakobsson & Rosenberg, 2007; Kopelman et al., 2015).
To gain insights into the shared genetic ancestry among focal species, the pipeline from the package ‘fineRADstructure’ was applied to quantify ancestry sources and investigate relationships from SNP data in ‘RADpainter’ format (Lawson et al., 2012; Malinsky et al., 2018). To avoid the overestimation caused by LD in the de novo loci, the input dataset was first reordered to ensure a conservative upper bound on the numbers of statistically identifiable clusters. The co-ancestry matrix was calculated for 100,000 MCMC iterations with a 100,000 burin-in period and sampled every 1,000 generations.
Species delimitation
Species delimitation was conducted using Bayesian Phylogenetics and Phylogeography (BPP) for analyzing multi-locus alignments under the multispecies coalescent (MSC) model (Yang, 2002; Rannala & Yang, 2003). After processing the SNP data (See Supplementary information X), the A10 analysis in BPP was first implemented with 1,000,000 MCMC steps and 200,000 burn-in to reach the convergence. The ddRAD-seq dataset was first filtered to exclusively contain loci with at least four variable sites which provide greater power in species delimitation, and exported by Vcf2phylip 2.0 (Huang, 2018; Ortiz, 2018). The guide tree for the simulation was obtained by SVDquartets (Chifman & Kubatko, 2014; Chifman & Kubatko, 2015) with 100 replicates of nonparametric bootstrap procedure. The fundamental parameters for BPP, the population size (theta, θ) and divergence times (tau, 𝜏), were set to be drawn from the inverse-gamma distribution of θ~ (3, 0.002) and 𝜏 ~ (3, 0.02). For A10 analysis (species delimitation without estimation of species tree) in BPP, the parameter ε for the rjMCMC algorithm was set at 5 to ensure well-mixing for Bayesian calculation, and the estimation was repeated twice to check if any incongruence appears (Yang & Rannala, 2010). A heuristic species delimitation approach was then implemented on posterior density of the genealogical diversity index (gdi) (Leaché et al., 2019). The fundamental rules were followed as described below: gdi smaller than 0.2 represents intraspecific differentiation, larger than 0.7 indicates well-differentiated species, and between 0.2 to 0.7 suggests weakly divergent lineages (Jackson et al., 2017).
Gene flow detection
Hybridization events were evaluated by Treemix (Pickrell & Pritchard, 2012), PhyloNetworks (Solís-Lemus & Ané 2016; Solís-Lemus et al., 2017), and Dsuite (Durand et al., 2011; Malinsky et al., 2018; Malinsky et al., 2021) to access the potential gene flow among divergent lineages based on population structure and species delimitation analyses. Treemix (Pickrell & Pritchard, 2012) was initially applied to detect the presence of admixture using the allele frequency. Considering the requirement of no missing data, the genotype imputation was performed by Beagle 5.4 (Browning & Browning, 2007; Browning et al., 2018). In order to reduce the risk of erroneous estimation, high missing rates sites (> 50%) and the rare alleles (MAF<0.05) were screened out before the imputation process. The appropriate number of migration events was examined from 1 to 15 and determined by the second-order rate of changes in likelihood via R package ‘OptM’ (Fitak, 2021). Various block sizes (500, 750, and 1000) were applied and repeated 10 times for each value to increase the variance. The node support for the maximum likelihood tree with inferred introgression events was built with 100 bootstraps, and the consensus topology was obtained through the ‘consense’ program using the extended majority rule in PHYLIP (Felsenstein, 2004).
For the network analysis, the concordance factors (CF) were calculated using ‘SNPs2CF’ with 100 sampled alleles per species quartet (Olave & Meyer, 2020), where the species tree constructed by SVDquartets was taken as the starting tree. The reticulated phylogeny was then estimated by SNaQ in PhyloNetworks with 50 independent runs. Number of hybridization events was surveyed from 0 and increased by 1 until reaching saturation in model likelihood.
The evidence of gene flow events was later confirmed by ABBA-BABA statistic. Patterson’s D and the f4-ratio statistic were calculated and interpreted using Dtrios and Fbranch program in Dsuite for all possible trios. Considering the influence of substitution variation on ABBA-BABA test, the recently developed method (--ABBAclustering) was incorporated in the estimation to determine the informative sites with introgression signals (Koppetsch et al., 2024).
Niche conservatism tests
To examine potential effect of niche conservatism on divergence and secondary contact, the bioclimatic niches of focal taxa was scrutinized via the following procedures. The occurrence data were obtained from the citizen science platforms, including iNaturalist, GBIF, Taiwan Biodiversity Network (TBN), and collection coordinates from the surveys in this study. To avoid autocorrelation that might mask modeling results, the dataset was thinned to remove duplicated or neighboring points with a one-kilometer filter in ‘spThin’ (Aiello‐Lammens et al., 2015). The bioclimatic variables and solar radiation data (Table S1) were downloaded at 30 arc-second resolutions from WorldClim (Karger et al., 2017). Two monthly vegetation indexes were also incorporated into the dataset, sourcing from the Moderate Resolution Imaging Spectroradiometer (MODIS) data over the period of 2000 to 2010 (Didan, 2015; Busetto & Ranghetti, 2016). Highly correlated factors were then eliminated based on Pearson’s correlation coefficient (r) and variance inflation factor (VIF) (Naimi et al., 2014). Principal component analysis (PCA) was conducted using obtained environmental variables, followed by between-class analysis (BCA), and then transformed into density-form data to visualize the overlap pattern (Bates et al., 2020). Niche overlaps between divergent clusters was first quantified by Schoener’s D and Warren’s I with ‘ecospat’ in R (Schoener, 1968; Warren et al., 2008; Di Cola et al., 2017). Followed by statistical examination using niche equivalency and similarity test with 1,000 permutations, the former compares the niches according to the exact location and the latter compensates the results by accounting the differences of surrounding environments (Warren et al., 2010; Aguirre‐Gutiérrez et al., 2015). For the niche equivalency test, the null hypothesis was that the two compared niches are identical. The niche similarity test was implemented to evaluate whether the two niches are more (or less) similar to each other than a randomly selected pairing, where the null hypothesis is that two niches are not more similar than random niches.
Gene flow and demographic history
To infer the timeline of gene flow during speciation, the multispecies coalescent with introgression (MSCi) in BPP was utilized to simulate potential scenarios (Degnan, 2018; Flouri et al., 2020). The bi-directional introgression model (BDI) was utilized, and introgression probability (φ) was sampled from a beta distribution of φ ~ (1,1). To ensure convergence for the Bayesian analysis, two runs were performed for 1,000,000 generations with 200,000 as burn-in and sampled every two iterations. The simulation was then adjusted based on the divergence time of D. polygonatum and D. swinhonis (10.60 MYA) according to previous estimation with fossil calibration (Angelis & Dos Reis, 2015; Yang et al., 2018).
To evaluate recent or historical contact among clusters with potential gene flow, the species distribution modeling (SDM) was performed to identify overlapping regions across various time periods. The available paleo-climate variables were downloaded at 2.5 arc-minute resolution via “rpaleoclim” for the following projection (Brown et al., 2018). The MaxEnt 3.4.1 (Phillips & Dudík, 2006) was applied due to its robustness across different types of occurrence data and capability to produce reliable results (Phillips et al., 2006; Wisz et al., 2008; Ahmadi et al., 2023). To avoid potential risk associated with directly using default configurations for MaxEnt, the parameter selection and tuning analysis were conducted using R package ‘ENMeval’ (Anderson & Gonzalez, 2011; Morales et al., 2017; Kass et al., 2021). The optimal combination of parameters was then applied to 10 cross-validation of MaxEnt calculations, and the projected layers were intersected for the estimation of overlapping area sizes.
To investigate the influence of historical demographic changes on gene flow events, the effective population size of each lineage involved in historical introgression was estimated by Stairway Plot 2 (Liu & Fu, 2020). The folded SNP frequency spectra (SFS) was initially transformed through easySFS (https://github.com/isaacovercast/easySFS; Gutenkunst et al., 2009) and the maximum number of segregating sites per population was selected from projection values. Considering that the inclusion of singleton alleles could lead to potential upward bias of the estimation, the alleles with single copy (singletons) were masked from the following analysis (Waples & Do, 2010). A generation time of three years and an average mutation rate per site per generation of 1.74 × 10-9 (5.80 × 10-4 per million years for Pogona vitticeps, Perry et al., 2018) were utilized. The estimation for each lineage was conducted with 0.67 training percentage and 200 input files for bootstrap-like replicates.
