Skip to main content

The emergence of a cryptic lineage and cytonuclear discordance through past hybridization in the Japanese fire-bellied newt, Cynops pyrrhogaster (Amphibia: Urodela)

Cite this dataset

Tominaga, Atsushi (2022). The emergence of a cryptic lineage and cytonuclear discordance through past hybridization in the Japanese fire-bellied newt, Cynops pyrrhogaster (Amphibia: Urodela) [Dataset]. Dryad.


 Discrepancies in geographic variation patterns between nuclear DNA and mitochondrial DNA (mtDNA) are the result of the complicated differentiation processes in organisms and the key to understanding their true evolutionary process. The genetic differentiation of the northern and southern Izu lineages of the Japanese newt Cynops pyrrhogaster was investigated through their single nucleotide polymorphism variations by multiplexed ISSR genotyping by sequencing (MIG-seq). We found three genetic groups (Tohoku, N-Kanto, and S-Kanto) those not detected by mtDNA in the northern lineage. N-Kanto has intermediate genetic characteristics between Tohoku and S-Kanto. The western populations of N-Kanto are close to S-Kanto, whereas the eastern populations of N-Kanto are close to Tohoku. Tohoku, N-Kanto, and S-Kanto are now moderately isolated from each other and have unique genetic characteristics. An estimation of the evolutionary history by the Approximate Bayesian Computation approach suggested that Tohoku diverged from the common ancestor of S-Kanto and S-Izu. Then, S-Kanto and S-Izu split and the recent hybridization between Tohoku and S-Kanto gave rise to N-Kanto. The origin of N-Kanto through the hybridization is relatively young and seems to be related to changes in the distributions of Tohoku and S-Kanto as a result of the climatic oscillation in the Pleistocene. We concluded that the mitochondrial genome of S-Kanto was captured into Tohoku and the past original mitochondrial genome of Tohoku was entirely swept out from Tohoku through the hybridization. 


To obtain de novo genome-wide SNP data for C. pyrrhogaster, multiplexed inter-simple sequence repeat (ISSR) genotyping by sequencing (MIG-seq) analysis (Suyama & Matsuki, 2015) was conducted. MIG-seq method is one of the PCR-based reduced-representation sequencings that has been conducted in various species (e.g., Matsui et al., 2019; Sato et al., 2021).

All experimental procedures for MIG-seq analysis are the same as those described by Suyama & Matsuki (2015). The procedure is briefly described here. The first PCR was conducted to amplify ISSR regions from genomic DNA with MIG-seq primer set-1 (Suyama & Matsuki, 2015) with the Multiplex PCR Assay Kit Ver. 2 (TaKaRa). The first PCR conditions were as follows: initial heating at 94°C (1 min), 27 cycles of 94°C (30 sec), 48°C (1 min), and 72°C (1 min), and finally 10 min extension at 72°C.

The second PCR was done using the PrimeSTAR GXL DNA Polymerase (TaKaRa) and its conditions were as follows: initial heating at 94°C (1 min), 15 cycles of 98°C (10 sec), 54°C (15 sec), and 68°C (30 sec). Then, 3 μl of each second PCR product was collected in one tube to form a mixed library. The purification and size selection (300–800 bp) of the mixed library were conducted using BluePippin DNA size selection system (Sage Science, Beverly, MA, USA). DNA concentration of the size-selected library was measured using the Agilent 4200 TapeStation (Agilent Technologies, CA, USA). The sequencing was conducted on the Illumina MiSeq System (Illumina, San Diego, CA, USA) using a mixed library at a final concentration of 10 pM using a MiSeq Reagent Kit v3 (150 cycle, Illumina). The sequencing of the first 17 bases of R1 read and 3 bases (anchor region) of R2 read were skipped by the 'DarkCycle' option of the MiSeq system. Both ends of the fragment (R1 and R2) were read by paired-end sequencing to determine 80 and 94 bp, respectively.

Obtained raw sequence data were preprocessed and quality-filtered, mainly following the procedures of Sato et al., (2021). The sequences derived from adapter primers that were occasionally left in or repeated at the opposite 3'  -ends of the read were cut out twice by Cutadapt 1.14 (Martin, 2011) with   ≥ 8 bp accordance and  ≤ 10% base mismatch. Then, we trimmed the low quality (Phred score < 10) 3-tails of each read using Cutadapt and excluded the shorter sequences (in length R1 reads < 70 bp and R2 reads <80 bp) by a custom Perl script.

 All quality-filtered R2 reads, which were longer on average and confirmed to have sufficient quality relative to R1 reads, were analyzed for de novo SNP calling by the Stacks 1.45 (Catchen et al., 2013). First, all R2 reads from each sample were clustered into putative loci within the sample based on sequence resemblance using the Ustacks program. From a preliminary analysis of 20 randomly selected samples from nearly all sampling locations, we set the thresholds for minimum depth of coverage and maximum number of mismatched bases between alleles at each locus to 4 and 4, respectively. Shared loci among samples were detected using the Cstacks program, with a threshold value of 2 for the maximum number of mismatch bases among alleles from other samples, as determined by a preliminary analysis of the same samples as above. Finally, SNP calling was made using the Stacks and the Populations programs to create a SNP matrix table with less than 50% missing values for each locus in all samples. To avoid biased output of missing values due to subpopulation structure, all samples were tentatively treated as a single population (predefinition of subpopulations was not adopted).

We further selected SNPs that were found in more than 70% of the samples (more than 126 samples) and less than 0.5 heterozygosity to avoid using the loci that have excess heterozygosity and then selected 177 ingroup samples of 28 populations from 181 ingroup samples excluding the outgroup samples of the central and western lineages with missing values of less than 40% (dataset 1) using TASSEL 5.0 software (Bradbury et al. 2007).

The ancestry inference and genetic structure for the entire sample based on the SNP loci obtained by MIG-seq were then assessed using Admixture 1.23 (Alexander and Lange, 2011). The most probable number of genetic clusters (K) was presumed in Admixture software, by computing maximum-likelihood estimates of the parameters. Ten independent simulations with cross-validation were run for each value of K from 1 to 10 to investigate the convergence of samples. The minimization of cross-validation error among all runs was used to estimate the most likely value of K. To compare with the results from Admixture analysis, principal component analysis (PCA) was conducted using the same dataset by TASSEL 5.0.

To validate the effect of isolation by distance, we calculated FST /(1-FST) among 28 ingroup populations as genetic distances using the Populations program of Stacks. The Mantel test based on Pearson’s product moment correlation ® was applied to assay the significance of the correlation between FST /(1-FST) and geographic distances using the vegan R package (Dixon, 2003)

We also constructed a Neighbor-Net network based on concatenated SNPs sequences using estimated uncorrected p-distance including the outgroup samples by Splits Tree 4 software (Huson and Bryant, 2006). For this analysis, we selected SNPs that were found in more than 50% of the samples (more than 95 of 189 samples) and less than 0.5 heterozygosity to avoid using the loci that have excess heterozygosity and then selected 186 samples, including those of the central and western lineages, with missing values of less than 40% in each sample using TASSEL 5.0 (dataset 2).

To infer the genetic relationships among recognized genetic groups in Admixture analysis and PCA, we conducted Approximate Bayesian Computation (ABC) to test for different diversification process scenarios (Beaumont et al. 2002; Beaumont 2010). The analysis was conducted with DIYABC 2.1.0 (Cornuet et al. 2014), following the detailed manual of the software. For this step, we divided the ingroup samples (dataset 1) into four populations recognized in Admixture analysis and PCA and excluded the SNPs that were missing for all samples in each population according to DIYABC 2.1.0 software settings. Each scenario involved four genetic groups of samples, corresponding to Group 1 (Tohoku: pops. 1–7), Group 2 (N-Kanto: pops. 8–21), Group 3 (S-Kanto: pops. 22–27), and Group 4 (S-Izu: pop. 28). The following eight scenarios were tested for the four populations (Fig. 2). 


Japan Society for the Promotion of Science, Award: 17K07540

Japan Society for the Promotion of Science, Award: 21K06321