Data from: Homoploid hybrid speciation in a marine pelagic fish
Data files
Sep 17, 2025 version files 583.94 MB
-
data_from_Muto_et_al.zip
583.93 MB
-
README.md
11.23 KB
Abstract
Homoploid hybrid speciation (HHS) is an enigmatic evolutionary process where new species arise through hybridization of divergent lineages without changes in chromosome number. Although increasingly documented in various taxa and ecosystems, convincing cases of HHS in marine fishes have been lacking. This study presents a possible case of HHS in a pelagic marine fish based on comprehensive genomic, morphological, and ecological analyses. Population genomics, species tree estimation, and tests of introgression and admixture identified three sympatric clusters in Megalaspis cordyla in the western Pacific and the admixed nature of one cluster between the others. Moreover, model-based demographic inference favored a hybrid speciation scenario over introgression for the origin of the admixed cluster. While contemporary gene flow suggested partial reproductive isolation, examination of occurrence data and ecologically relevant morphological characters suggested ecological differences between the clusters, potentially contributing to the reproductive isolation and niche partitioning in sympatry. The clusters are also morphologically distinguishable and thus can be taxonomically recognized as separate species. The hybrid cluster is restricted to the coasts of Taiwan and Japan, where all three clusters coexist. The parental clusters are additionally found in lower latitudes, where they display non-overlapping distributions. Given the geographical distributions, estimated times of the species formation, and patterns of historical demographic changes, we propose that the Pleistocene glacial cycles were the primary driver of HHS in this system. We also develop an ecogeographic model of HHS in marine coastal ecosystems, including a novel hypothesis to explain the initial stages of HHS.
https://doi.org/10.5061/dryad.kwh70rzdn
Description of the data and file structure
This dataset consists of genetic, morphological, and ecological data and associated codes to reproduce the results (Data_from_Muto_etal.zip) and supplementary materials (Supplement_Figs.pdf; Supplement_Tables.xlsx; Supplement_text.docx). Data and associated codes for each alaysis are provided in separate directories, which are compressed in a single zip file. The data was obtained as outlined in Methods. Softwares and packages required are described in the codes.
Contents of each directory
PCA
This directory contains the files required to reproduce the the result of the Principal Components Analysis based on the genome-wide SNPs data.
data1.vcf.gz: a gzipped vcf file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).
pca.R: An R script for performing PCA using the adegenet package.
admixtools
This directory contains the files required to reproduce D-statistics and f3 statistics using ADMIXTOOLS based on the genome-wide SNPs data.
data2.geno/ind/ped/snp: input files for ADMIXTOOLS/admixr based on 6789 SNPs across 24 samples of M. cordyla with C. ignobilis as an outgroup (2nd dataset as referred to in the manuscript).
admixtools.R: An R script to run ADMIXTOOLS via amixr, a wrapper R package for ADMIXTOOLS.
arlequin
This directory contains the files required to reproduce the genome-wied Fst value using Arlequin v3.5.2.2. To get the result, put the two files listed below in the directory where the console version of Arlequin (arlecore) is installed, and execute LaunchArlecore.sh.
data1_pure.arp: SNPs data to be fed to Arlequin. 2962 SNPs across 102 samples (i.e., 1st dataset as referred to in the manuscript minus 3 individulas of putative early-generation hybrids).
data1_pure.ars: setting file for Arlequin.
bayesass
This directory contains the files required to reproduce the the result of Bayesass.
data1.vcf.gz: a gzipped vcf file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).
popmap.tsv: population information for the samples.
bayesass.sh: procedure to reproduce the result.
rep_summary.sh: script to summarize the result (used in bayesass.sh)
rep_summary.R: script to summarize the result (used in bayesass.sh)
ugnix: contains ugnix scripts (used in bayesass.sh). see https://github.com/brannala/ugnix
rep*: contains the input files preapred by the author. bayesass.sh explains both how to run Bayesass using these files and how to prepare these files by subsampling data1.vcf.gz
BPP
This directory contains the files required to reproduce the the result of BPP.
BPP.sh: procedure to reproduce the result.
filtered.fa.gz: A fasta file filtered as described in Materials and Methods, containing 1197 loci of 24 samples.
Imap.txt: population information for the samples to be used in BPP run.
msci_mC_A.ctl, msci_mC_BEast.ctl, msci_mC_BWest.ctl, msci_mC_C.ctl: control files for BPP.
popmap: population information for the samples to be used for data conversion.
easySFS
This directory contains the files required to reproduce SFS based on the genome-wide SNPs data. easySFS developed by Isaac Overcast (https://github.com/isaacovercast/easySFS) is used.
data4.vcf.gz: A gzipepd vcf file of 6,341 SNPs across 24 samples, polarized by two outgroups C. ignobilis and C. melampygus, used for creating the SFS for demographic model selection (4th dataset as referred to in the manuscript).
data5.vcf.gz: A gzipepd vcf file of 22,326 SNPs plus 920,184 invariant sites across 24 samples, polarized by two outgroups C. ignobilis and C. melampygus, used for creating the SFS for demographic parameter estimation and Stairway Plot analysis (5th dataset as referred to in the manuscript).
pop.tsv: population information for the samples.
easySFS.sh: procedure to reproduce SFS.
ecology
This directory contains the files required to reproduce the result of the Random Forest classification, hierarchical clustering, and Factor Analysis of Mixed Data of the ecology data.
ecology_data.csv: a comma-separated file of ecology data for M. cordyla in Taiwan, reproduced from Su et al. (2020). ecology.R: An R script for performing the analyses.
fastsimcoal
This directory contains scripts and other files required to reproduce the results of demographic model selection and parameter esitmation using fastsimcoal2. Results of the fastsimcoal runs are also provided since they are computationllly intensive tasks and take a lot of time to reproduce.
The files related to demographic model selection and parameter esitmation are provided in separate directories ("modelselectoin/" and "param/"). For more details, pleaes see readme files in these directories.
fst_locus_by_locus
This directory contains the files required to reproduce lpucs-by-locus Fst between genetic clusters using populations implemented in Stacks based on the genome-wide SNPs data.
data1_pure.arp: SNPs data to be fed to populations. 2962 SNPs across 102 samples (i.e., 1st dataset as referred to in the manuscript minus 3 individulas of putative early-generation hybrids).
popmap.tsv: population information for the samples.
fst_locus_by_locus.sh: procedure to reproduce the result.
hiest
This directory contains the files required to reproduce the results of the HIest analysis based on the genome-wide SNPs data.
The files with a prefix "2pop" are for the analysis assuming the East and West clusters to be potential parents, where as those with "3pop" are for the one assuming the three clusters to be potential parents.
2pop_G_2pop.tsv: genotype matrix of 108 loci whose allele frequencies were differentiated by > 0.5 between the East and West clusters. First row is locus name. A, T, G, C are coded as 1, 2, 3, 4.
2pop_G_rownames.txt: a list of sample IDs to be combined with the genotype matrix.
2pop_P.tsv: allele frequency matrix of the 108 loci.
3pop_G_2pop.tsv: genotype matrix of 130 loci whose allele frequencies differed by more than > 0.5 between at least one pair of clusters. First row is locus name. A, T, G, C are coded as 1, 2, 3, 4.
3pop_G_rownames.txt: : a list of sample IDs to be combined with the genotype matrix.
3pop_P.tsv: allele frequency matrix of the 130 loci.
hiest.R: R script to perform HIest analysis.
hyde
This directory contains the files required to reproduce the results of HyDe analysis based on the genome-wide SNPs data.
data2.phylip: a phylip format data of 6,789 SNPs across 24 samples of M. cordyla plus C. ignobilis as an outgroup (2nd dataset as referred to in the manuscript).
data2.map: population information of the samples.
data2.trio: a file to specify triples to be tested.
hyde.sh: procedure to reproduce the results.
morphology
This directory contains the files required to reproduce the result of the Principal Components Analysis of morphological characters.
morphology.txt: a tab-separated file of raw morphology data for 37 samples. 54 characters including 37 linear measurements and 17 countable characters are listed, of which 24 are used for the PCA. All linear measurements are in mm.
pca.R: An R script for performing the PCA.
mtDNA_alignment
This directory contains the following mitochondrial DNA sequence alignments:
CR.fas: 760 bp alignment of control region sequences of 70 samples.
Cytb.fas: 873 bp alignment of Cytb sequences of 161 samples.
coding.fas: 5124 bp alignment of 7 protein-coding genes of 20 samples. Partial sequences of the following genes are concatenated in this order: COI (606 bp); COII (657 bp); Cytb (873 bp); ND1 (828 bp); ND2 (531 bp); ND4 (828 bp); ND5 (801 bp).
all.fas: 5884 bp alignment (including gaps) of the 7 protein-coding genes (identical as above) plus control region sequences of 20 samples.
all_ignobilis.fas: 5919 bp alignment (including gaps) of the 7 protein-coding genes (identical as above) plus control region sequences Caranx ignobilis as an outgroup and 11 samples of M. cordyla. The difference in the alignment length between "all.fas" and "all_ignobilis.fas" is due to gaps inserted in the control region sequence of the latter.
Sequence titles consist of voucher number, assigment to genetic cluster (east, west, north, or their hybrid), country and region of collection.
"all_ignobilis.fas" was used for divergence time estimation, whereas others were used for haplotype network construction.
mtDNA_divergencetime
This directory contains the files required to reproduce the result of divergence time estimation based on mtDNA sequences.
mtDNA_alignment.nex: a nexus file containing the aligned mtDNA sequences.
run1/, run2/: directories containing .xml files to be used in BEAST runs.
mtDNA_divergence.sh: procedure to reproduce the result.
nQuire
This directory contains the files required to reproduce the the result of nQuire.
popmap.tsv: population information for the samples.
nQuire.sh: procedure to reproduce the result.
bis: contains the input files (filtered, denoised, and converterted to "BIN" format required by nQuire) preapred by the author. nQuire.sh explains both how to run nQuire using these files and how to prepare these files from BAM files.
snapp
This directory contains the files required to reproduce the result of divergence time estimation based on the genome-wide SNPs data using SNAPP.
data3.vcf: a vcf file of 3179 SNPs across 9 M. cordyla samples plus a C. ignobilis as an outgroup (3rd dataset as referred to in the manuscript).
samples.txt: population information for the samples.
constraints.txt: a file specifying age constraints.
snapp.sh: procedure to reproduce the result.
splitstree
This directory contains the file required to reproduce neighbor-net network using splitstree v5.3.0.
splitstree.nex: a nexus file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).
stairwayplot
This directory contains the files required to reproduce the result of historical population size estimations using Stairway Plot 2 based on the genome-wide SNPs data.
0_west / 1_north / 2_east_DAFpop0.obs: SFS for each cluster calculated using easyfSFS based on 22,326 SNPs plus 920,184 invariant sites across 24 samples (8 per cluster), polarized by two outgroups C. ignobilis and C. melampygus (5th dataset as referred to in the manuscript).
west / east / north.blueprint: input file for Stairway Plot 2 containing the SFS for each cluster and other settings, manually prepared following https://github.com/xiaoming-liu/stairway-plot-v2/blob/master/READMEv2.1.pdf
stairwayplot.sh: procedure to reproduce the results.
structure
This directory contains the files required to reproduce the result of STRUCTURE analysis based on the genome-wide SNPs data.
data1.str: 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript) formatted for STRUCTURE.
structure.sh: procedure to reproduce the results.
We analyzed a total of 160 specimens of M. cordyla collected from various localities in the western Pacific.
We obtained partial sequences of the mitochondrial cytochrome b gene (CYTB: 873 bp) from all specimens. Additionally, we obtained partial sequences of the following seven loci from a subset of the specimens: cytochrome c oxidase subunits 1 (COI: 612 bp) and 2 (COII: 657 bp), NADH dehydrogenase subunits 1 (ND1: 828 bp), 2 (ND2: 531 bp), 4 (ND4: 828 bp), and 5 (ND5: 801 bp), and the control region (CR: 760 bp). These sequences were used for divergent time estimation, haplotype network construction, and Neighbor-Joining tree estimation.
We genotyped 115 specimens using MIG-seq, a reduced representation sequencing method targeting SNPs in the inter-simple sequence repeat region. We processed the raw MIG-seq reads by the single-end mode of fastp v0.20.1, merging the processed reads 1 and 2 into a single FASTQ file, and mapped them to the Caranx ignobilis reference genome (GenBank JAFHLA000000000.1) using BWA-mem v0.7.17 with default settings. Unmapped or low-quality reads were removed using Samtools v1.12. The resulting BAM files were processed using Stacks v2.5.4, bcftools v1.8, VCFtools v0.1.16, and R package ‘SNPfiltR’ v1.0.0 to perpare suitable datasets for downstream analyses.
We examined 37 linear measurements and 17 countable characters in 37 specimens.
The ecological data is based on the catch data of M. cordyla in Taiwan during 2000–2001, originally reported in Su et al. (2020:Scientific Reports 10: 16829).
More details of data collection, processing, and analyses are described in the Supplementary Material (Supplement_text.docx).
