Data from: Homoploid hybrid speciation in a marine pelagic fish

Muto, Nozomu 1 ; Su, Yong-Chao2; Hata, Harutaka3; Nguyen, Van Quan4; Vilasri, Veera5; Ghaffar, Mazlan Abd.6; Babaran, Ricardo P.7

Published Sep 17, 2025 on Dryad. https://doi.org/10.5061/dryad.kwh70rzdn

Data files

Sep 17, 2025 version files 583.94 MB

data_from_Muto_et_al.zip

583.93 MB
README.md

11.23 KB

Abstract

Homoploid hybrid speciation (HHS) is an enigmatic evolutionary process where new species arise through hybridization of divergent lineages without changes in chromosome number. Although increasingly documented in various taxa and ecosystems, convincing cases of HHS in marine fishes have been lacking. This study presents a possible case of HHS in a pelagic marine fish based on comprehensive genomic, morphological, and ecological analyses. Population genomics, species tree estimation, and tests of introgression and admixture identified three sympatric clusters in Megalaspis cordyla in the western Pacific and the admixed nature of one cluster between the others. Moreover, model-based demographic inference favored a hybrid speciation scenario over introgression for the origin of the admixed cluster. While contemporary gene flow suggested partial reproductive isolation, examination of occurrence data and ecologically relevant morphological characters suggested ecological differences between the clusters, potentially contributing to the reproductive isolation and niche partitioning in sympatry. The clusters are also morphologically distinguishable and thus can be taxonomically recognized as separate species. The hybrid cluster is restricted to the coasts of Taiwan and Japan, where all three clusters coexist. The parental clusters are additionally found in lower latitudes, where they display non-overlapping distributions. Given the geographical distributions, estimated times of the species formation, and patterns of historical demographic changes, we propose that the Pleistocene glacial cycles were the primary driver of HHS in this system. We also develop an ecogeographic model of HHS in marine coastal ecosystems, including a novel hypothesis to explain the initial stages of HHS.

https://doi.org/10.5061/dryad.kwh70rzdn

Description of the data and file structure

This dataset consists of genetic, morphological, and ecological data and associated codes to reproduce the results (Data_from_Muto_etal.zip) and supplementary materials (Supplement_Figs.pdf; Supplement_Tables.xlsx; Supplement_text.docx). Data and associated codes for each alaysis are provided in separate directories, which are compressed in a single zip file. The data was obtained as outlined in Methods. Softwares and packages required are described in the codes.

Contents of each directory

PCA

This directory contains the files required to reproduce the the result of the Principal Components Analysis based on the genome-wide SNPs data.
data1.vcf.gz: a gzipped vcf file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).
pca.R: An R script for performing PCA using the adegenet package.

admixtools

This directory contains the files required to reproduce D-statistics and f3 statistics using ADMIXTOOLS based on the genome-wide SNPs data.
data2.geno/ind/ped/snp: input files for ADMIXTOOLS/admixr based on 6789 SNPs across 24 samples of M. cordyla with C. ignobilis as an outgroup (2nd dataset as referred to in the manuscript).
admixtools.R: An R script to run ADMIXTOOLS via amixr, a wrapper R package for ADMIXTOOLS.

arlequin

This directory contains the files required to reproduce the genome-wied Fst value using Arlequin v3.5.2.2. To get the result, put the two files listed below in the directory where the console version of Arlequin (arlecore) is installed, and execute LaunchArlecore.sh.
data1_pure.arp: SNPs data to be fed to Arlequin. 2962 SNPs across 102 samples (i.e., 1st dataset as referred to in the manuscript minus 3 individulas of putative early-generation hybrids).
data1_pure.ars: setting file for Arlequin.

bayesass

This directory contains the files required to reproduce the the result of Bayesass.
data1.vcf.gz: a gzipped vcf file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).
popmap.tsv: population information for the samples.
bayesass.sh: procedure to reproduce the result.
rep_summary.sh: script to summarize the result (used in bayesass.sh)
rep_summary.R: script to summarize the result (used in bayesass.sh)
ugnix: contains ugnix scripts (used in bayesass.sh). see https://github.com/brannala/ugnix
rep*: contains the input files preapred by the author. bayesass.sh explains both how to run Bayesass using these files and how to prepare these files by subsampling data1.vcf.gz

BPP

This directory contains the files required to reproduce the the result of BPP.
BPP.sh: procedure to reproduce the result.
filtered.fa.gz: A fasta file filtered as described in Materials and Methods, containing 1197 loci of 24 samples.
Imap.txt: population information for the samples to be used in BPP run.
msci_mC_A.ctl, msci_mC_BEast.ctl, msci_mC_BWest.ctl, msci_mC_C.ctl: control files for BPP.
popmap: population information for the samples to be used for data conversion.

easySFS

This directory contains the files required to reproduce SFS based on the genome-wide SNPs data. easySFS developed by Isaac Overcast (https://github.com/isaacovercast/easySFS) is used.
data4.vcf.gz: A gzipepd vcf file of 6,341 SNPs across 24 samples, polarized by two outgroups C. ignobilis and C. melampygus, used for creating the SFS for demographic model selection (4th dataset as referred to in the manuscript).
data5.vcf.gz: A gzipepd vcf file of 22,326 SNPs plus 920,184 invariant sites across 24 samples, polarized by two outgroups C. ignobilis and C. melampygus, used for creating the SFS for demographic parameter estimation and Stairway Plot analysis (5th dataset as referred to in the manuscript).
pop.tsv: population information for the samples.
easySFS.sh: procedure to reproduce SFS.

ecology

This directory contains the files required to reproduce the result of the Random Forest classification, hierarchical clustering, and Factor Analysis of Mixed Data of the ecology data.
ecology_data.csv: a comma-separated file of ecology data for M. cordyla in Taiwan, reproduced from Su et al. (2020). ecology.R: An R script for performing the analyses.

fastsimcoal

This directory contains scripts and other files required to reproduce the results of demographic model selection and parameter esitmation using fastsimcoal2. Results of the fastsimcoal runs are also provided since they are computationllly intensive tasks and take a lot of time to reproduce.
The files related to demographic model selection and parameter esitmation are provided in separate directories ("modelselectoin/" and "param/"). For more details, pleaes see readme files in these directories.

fst_locus_by_locus

This directory contains the files required to reproduce lpucs-by-locus Fst between genetic clusters using populations implemented in Stacks based on the genome-wide SNPs data.
data1_pure.arp: SNPs data to be fed to populations. 2962 SNPs across 102 samples (i.e., 1st dataset as referred to in the manuscript minus 3 individulas of putative early-generation hybrids).
popmap.tsv: population information for the samples.
fst_locus_by_locus.sh: procedure to reproduce the result.

hiest

This directory contains the files required to reproduce the results of the HIest analysis based on the genome-wide SNPs data.
The files with a prefix "2pop" are for the analysis assuming the East and West clusters to be potential parents, where as those with "3pop" are for the one assuming the three clusters to be potential parents.
2pop_G_2pop.tsv: genotype matrix of 108 loci whose allele frequencies were differentiated by > 0.5 between the East and West clusters. First row is locus name. A, T, G, C are coded as 1, 2, 3, 4.
2pop_G_rownames.txt: a list of sample IDs to be combined with the genotype matrix.
2pop_P.tsv: allele frequency matrix of the 108 loci.
3pop_G_2pop.tsv: genotype matrix of 130 loci whose allele frequencies differed by more than > 0.5 between at least one pair of clusters. First row is locus name. A, T, G, C are coded as 1, 2, 3, 4.
3pop_G_rownames.txt: : a list of sample IDs to be combined with the genotype matrix.
3pop_P.tsv: allele frequency matrix of the 130 loci.
hiest.R: R script to perform HIest analysis.

hyde

This directory contains the files required to reproduce the results of HyDe analysis based on the genome-wide SNPs data.
data2.phylip: a phylip format data of 6,789 SNPs across 24 samples of M. cordyla plus C. ignobilis as an outgroup (2nd dataset as referred to in the manuscript).
data2.map: population information of the samples.
data2.trio: a file to specify triples to be tested.
hyde.sh: procedure to reproduce the results.

morphology

This directory contains the files required to reproduce the result of the Principal Components Analysis of morphological characters.
morphology.txt: a tab-separated file of raw morphology data for 37 samples. 54 characters including 37 linear measurements and 17 countable characters are listed, of which 24 are used for the PCA. All linear measurements are in mm.
pca.R: An R script for performing the PCA.

mtDNA_alignment

This directory contains the following mitochondrial DNA sequence alignments:
CR.fas: 760 bp alignment of control region sequences of 70 samples.
Cytb.fas: 873 bp alignment of Cytb sequences of 161 samples.
coding.fas: 5124 bp alignment of 7 protein-coding genes of 20 samples. Partial sequences of the following genes are concatenated in this order: COI (606 bp); COII (657 bp); Cytb (873 bp); ND1 (828 bp); ND2 (531 bp); ND4 (828 bp); ND5 (801 bp).
all.fas: 5884 bp alignment (including gaps) of the 7 protein-coding genes (identical as above) plus control region sequences of 20 samples.
all_ignobilis.fas: 5919 bp alignment (including gaps) of the 7 protein-coding genes (identical as above) plus control region sequences Caranx ignobilis as an outgroup and 11 samples of M. cordyla. The difference in the alignment length between "all.fas" and "all_ignobilis.fas" is due to gaps inserted in the control region sequence of the latter.
Sequence titles consist of voucher number, assigment to genetic cluster (east, west, north, or their hybrid), country and region of collection.
"all_ignobilis.fas" was used for divergence time estimation, whereas others were used for haplotype network construction.

mtDNA_divergencetime

This directory contains the files required to reproduce the result of divergence time estimation based on mtDNA sequences.
mtDNA_alignment.nex: a nexus file containing the aligned mtDNA sequences.
run1/, run2/: directories containing .xml files to be used in BEAST runs.
mtDNA_divergence.sh: procedure to reproduce the result.

nQuire

This directory contains the files required to reproduce the the result of nQuire.
popmap.tsv: population information for the samples.
nQuire.sh: procedure to reproduce the result.
bis: contains the input files (filtered, denoised, and converterted to "BIN" format required by nQuire) preapred by the author. nQuire.sh explains both how to run nQuire using these files and how to prepare these files from BAM files.

snapp

This directory contains the files required to reproduce the result of divergence time estimation based on the genome-wide SNPs data using SNAPP.
data3.vcf: a vcf file of 3179 SNPs across 9 M. cordyla samples plus a C. ignobilis as an outgroup (3rd dataset as referred to in the manuscript).
samples.txt: population information for the samples.
constraints.txt: a file specifying age constraints.
snapp.sh: procedure to reproduce the result.

splitstree

This directory contains the file required to reproduce neighbor-net network using splitstree v5.3.0.
splitstree.nex: a nexus file of 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript).

stairwayplot

This directory contains the files required to reproduce the result of historical population size estimations using Stairway Plot 2 based on the genome-wide SNPs data.
0_west / 1_north / 2_east_DAFpop0.obs: SFS for each cluster calculated using easyfSFS based on 22,326 SNPs plus 920,184 invariant sites across 24 samples (8 per cluster), polarized by two outgroups C. ignobilis and C. melampygus (5th dataset as referred to in the manuscript).
west / east / north.blueprint: input file for Stairway Plot 2 containing the SFS for each cluster and other settings, manually prepared following https://github.com/xiaoming-liu/stairway-plot-v2/blob/master/READMEv2.1.pdf
stairwayplot.sh: procedure to reproduce the results.

structure

This directory contains the files required to reproduce the result of STRUCTURE analysis based on the genome-wide SNPs data.
data1.str: 2926 SNPs across 105 samples (1st dataset as referred to in the manuscript) formatted for STRUCTURE.
structure.sh: procedure to reproduce the results.