Species delimitation in the Eviota sigillata species complex, a widely distributed group of cryptobenthic coral reef fishes
Data files
Jul 11, 2025 version files 244.85 GB
-
GBS-Plate-1_S195_L004_R1_001.fastq.gz
37.78 GB
-
GBS-Plate-1_S195_L004_R2_001.fastq.gz
39.16 GB
-
GBS-Plate-1_S81_L002_R1_001.fastq.gz
21.85 GB
-
GBS-Plate-1_S81_L002_R2_001.fastq.gz
19.05 GB
-
GBS-Plate-2_S196_L004_R1_001.fastq.gz
40.79 GB
-
GBS-Plate-2_S196_L004_R2_001.fastq.gz
42.48 GB
-
GBS-Plate-2_S82_L002_R1_001.fastq.gz
23.26 GB
-
GBS-Plate-2_S82_L002_R2_001.fastq.gz
20.41 GB
-
README.md
5.30 KB
-
sigillata_104_660013.phy
68.64 MB
-
tfs_27x1637_struct.txt
194.38 KB
-
tfs_29x1640.min4.phy
48.21 KB
Abstract
Cryptobenthic reef fishes (CRF) are the smallest vertebrates on coral reefs but represent about 40% of the fish species and about 50% of fish abundance in coral reef ecosystems. Their diversity can be explained by their extremely limited dispersal abilities and short generation times (promoting allopatric speciation) coupled with their ability to partition microhabitats at a very fine scale. Importantly, for some groups of CRF, their small size, cryptic nature, and conserved morphology have resulted in many undetected cryptic species, which may require a genome-wide species delimitation approach to discern how many species are present. One of the most species-rich groups of CRF, the genus Eviota, has 132 species described to date, is widely distributed from the Red Sea to Hawaii and French Polynesia, and is known to comprise numerous cryptic species. We focused on the Eviota sigillata complex which is represented by two nominal species described by morphological characters, yet preliminary genetic data suggest the presence of multiple cryptic lineages. Here we use molecular data from mitochondrial DNA and genome-wide SNP data generated via double digest restriction site associated sequencing (ddRADseq), in combination with morphological data to infer the number of species in the E. sigillata complex. Specifically, we constructed phylogenetic trees and conducted several types of single-locus and multilocus species delimitation analyses and compared these to groupings based on morphology, as well as their geographic distribution. Overall, we recovered evidence for the presence of 9-13 lineages within the E. sigillata species complex, with genetic lineages corresponding well with the biogeography of the group. We further confirmed that the original morphological diagnostic characters used for the separation of the two nominal species were not useful for distinguishing each of the nine clades in the complex but may be helpful in diagnosing groups of species. Overall, our study sheds light onto the patterns of speciation within CRF and provides a glimpse of the tremendous hidden diversity that still remains in coral reef fishes.
Dataset DOI: https://doi.org/10.5061/dryad.69p8cz9cn
Description of the data and file structure
To estimate the number of species in the E. sigillata complex we took an integrative approach that combined multiple data sources and analyses. Using a combination of freshly collected samples and museum specimens, we holistically evaluated the number of species estimated from: (i) species delimitation analysis of mitochondrial COI data using both tree-based and non-tree-based approaches; (ii) genome-wide phylogenetic analyses of thousands of loci from ddRADseq data; for several clades, we performed a (iii) multispecies coalescent species-tree analysis coupled with Bayes Factor species delimitation using SNAPP + BFD*, as well as (iv) a genome wide analysis of population genetic variation using STRUCTURE; and (v) an analysis of morphology to identify putatively diagnostic characters for different clades. Information from each of these analyses was holistically assessed to conclude a plausible number of species likely present in the complex.
Files and variables
File: GBS-Plate-1_S81_L002_R1_001.fastq.gz
Description: Plate 1 with raw forward sequences from ddRADseq processing (run 1)
File: GBS-Plate-2_S82_L002_R1_001.fastq.gz
Description: Plate 2 with raw forward sequences from ddRADseq processing (run 1)
File: GBS-Plate-1_S81_L002_R2_001.fastq.gz
Description: Plate 1 with raw reverse sequences from ddRADseq processing (run 1)
File: GBS-Plate-2_S82_L002_R2_001.fastq.gz
Description: Plate 2 with raw reverse sequences from ddRADseq processing (run 1)
File: GBS-Plate-1_S195_L004_R1_001.fastq.gz
Description: Plate 1 with raw forward sequences from ddRADseq processing (run 2)
File: GBS-Plate-1_S195_L004_R2_001.fastq.gz
Description: Plate 1 with raw reverse sequences from ddRADseq processing (run 2)
File: GBS-Plate-2_S196_L004_R1_001.fastq.gz
Description: Plate 2 with raw forward sequences from ddRADseq processing (run 2)
File: GBS-Plate-2_S196_L004_R2_001.fastq.gz
Description: Plate 2 with raw reverse sequences from ddRADseq processing (run 2)
File: sigillata_104_660013.phy
Description: This file contains a concatenated alignment in PHYLIP format derived from double digest Restriction-site Associated DNA sequencing (ddRADseq) data. The dataset includes 660,013 loci obtained from 104 Eviota sigillata specimens, processed using the Stacks pipeline. This alignment was used for phylogenetic tree reconstruction using RAxML of the E. sigillata samples.
File: tfs_29x1640.min4.phy
Description: This file contains a concatenated alignment in PHYLIP format derived from double digest Restriction-site Associated DNA sequencing (ddRADseq) data. The dataset includes 1640 loci obtained from 29 Eviota sigillata specimens restricted to Tonga, Fiji, and Samoa locations processed using the Stacks pipeline. This alignment was used for phylogenetic tree reconstruction using RAxML of the E. sigillata samples, and for the coalescence based species delimitation analysis using Bayes fsctor delimitation (BFD*).
File: tfs_27x1637_struct.txt
Description:
This file contains genotype data formatted for use with the program STRUCTURE. It includes data from 27 Eviota specimens and 1,637 SNP loci, obtained through ddRADseq and processed using the Stacks pipeline. The dataset was filtered to include only one SNP per locus to minimize linkage, making it suitable for inferring population structure and admixture patterns.
Usage notes:
This file is ready for direct input into STRUCTURE. Each row represents an individual, and each column represents a locus. Missing data are coded as -9
.
Code/software
To view and analyze the data provided in this dataset, the following free and open-source software packages are recommended:
- Stacks (https://catchenlab.life.illinois.edu/stacks/) – used for demultiplexing, aligning, and processing ddRADseq data to generate SNP loci. The raw sequence files can be processed through Stacks to reproduce the datasets provided.
- STRUCTURE (https://web.stanford.edu/group/pritchardlab/structure.html) – used for analyzing population structure based on the SNP data. The
tfs_27x1637_struct.txt
file is formatted specifically for this program. - Phylogenetic software RAxML (https://cme.h-its.org/exelixis/web/software/raxml/) was used for constructing phylogenetic trees from the PHYLIP-formatted alignment file
sigillata_104_660013.phy
. - FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) – optional but useful for quality control checks on the raw sequencing files.
All of the above tools are freely available for academic use and are compatible with Windows, macOS, and Linux systems. Scripts used for all the data analyses can be made available upon request to the repository author.
Sample Collection
A total of 123 specimens in the E. sigillata complex were collected from locations across the Indo-Pacific region (Figure 1; see Table S1 for specific location of 102 samples passing the bioinformatic filtering). Specimens were determined to belong to the E. sigillata complex if they keyed out to E. sigillata, E. shimadai, or an ambiguous combination of the two using the key from Greenfield and Winterbottom (2016) or very closely resembled live photographs of these two species from the taxonomic literature. All specimens were captured after being anesthetized using 5% quinaldine sulphate dissolved in seawater or a solution of 5% clove oil dissolved in ethanol dispersed from a squirt bottle (see Tornabene et al., 2013). Specimens were photographed in the field and then preserved in 95% ethanol. We also used two tissue samples of E. sigillata from two specimens from the South African Institute for Aquatic Biodiversity (SAIAB) that were available for genomic analysis and that were collected from the type locality of the described E. sigillata in the Seychelles (SAIAB #77969-406 and #77969-407). We also extracted DNA from a tissue sample of an E. shimadai specimen collected from the type locality and donated by the National Museum of Nature and Science in Tokyo.
Molecular Protocols
DNA was extracted from pectoral fin clips using Qiagen DNeasy Blood and tissue kits (Qiagen, Inc.) following the manufacturer’s protocols. A 53-sample subset from the 123 total samples was selected for initial sequencing for a segment of the mitochondrial gene cytochrome-c oxidase subunit I (COI), using the primers GobyL6468 and GobyH7696 (Thacker 2003) and FishF1 and FishR1for a smaller subset of the samples (Ward et al. 2005). PCR protocols were the same as those used by Tornabene et al. (2016). Sequences were aligned in Geneious v.6.0.6 (Biomatters; www.geneious.com). The final alignment consisted of 1,142 bp of COI, with some samples having a shorter ~650 bp fragment (Table 1).
To capture SNP data, we used ddRADseq. Including two enzymes to ‘double digest’ the DNA provides a more precise selection of fragments throughout the genome that are close to the target size (Peterson et al. 2012). It also allows for reproducibility of results if additional ddRADseq data from the studied group is added later to the existing dataset. DNA extractions were sent to the NGS labs at the University of Wisconsin for sequencing and library preparation. Prior to sequencing, an optimization process conducted at the same lab facilities was used to select the restriction enzymes that would work best for our set of samples. For the enzyme optimization, we selected eight representative samples from across the genus Eviota. We sent the DNA extractions from these specimens to create ddRADseq libraries, employing various combinations of enzymes including ApeKI, PstI/MspI, PstI/BfaI, NsiI/MspI, and NsiI/BfaI. Subsequently, the libraries were analyzed using an Agilent Tapestation 4200 to assess their fragment sizes and profiles. The combination of the NsiI and MspI enzymes was ultimately chosen due to its favorable characteristics, including a smooth profile and a fragment size >300 bp. Libraries were sequenced on Illumina NovaSeq6000 using paired-end 150 bp reads.
ddRADseq libraries were prepared following the methodology outlined in Elshire et al. (2011), with only minor adjustments. To summarize, 100 ng of DNA underwent digestion using NsiI and MspI enzymes from New England Biolabs in Ipswich, MA. Following digestion, barcoded adapters suitable for Illumina sequencing were incorporated through ligation using T4 ligase from New England Biolabs. Subsequently, 123 samples with adapter-ligated DNA were combined and amplified to generate libraries suitable for sequencing, with the removal of adapter dimers achieved through SPRI bead purification. The final libraries' quality and quantity were evaluated using the Agilent Tapestation from Agilent Technologies in Santa Clara, CA, and the Qubit® dsDNA HS Assay Kit from Life Technologies in Grand Island, NY, respectively.
To assemble the ddRADseq raw data, we used the Stacks 2.4.1 pipeline (Catchen et al. 2013) on the University of Washington’s High Performance Computer Cluster. First, the samples were demultiplexed using the process_radtags program for the paired-end reads, including the -c (clean data, remove any read with an uncalled base), -q (discard reads with low quality scores), and -r (rescue barcodes and Rad-Tag cut sites) flags. We used the R packages RADstackshelpR 0.1.0 (https://github.com/DevonDeRaad/RADstackshelpR ) to calculate summary statistics from Stacks output vcf files, and to determine the optimal parameters for assembling putative RAD loci de novo (Mastretta et al. 2015, Paris et al. 2017). These assembly parameters selected were ‘-m=4’ (minimum number of reads required to form a stack), ‘-M=2’ (number of mismatches allowed between stacks within individuals), and ‘-n=3’ (number of mismatches allowed between loc catalog building). To reduce computational burden and as instructed in DeRaad’s RADstackshelpR, we used a subsample of representative individuals from the 10 samples to run this parameter optimization pipeline. With the resulting optimal parameters, we ran the denovo_map.pl program for the 123 samples. Each of the analysis methods required specific levels of filtering and data curation (Table 1). For instance, the --write-single-snp flag was run for all the species delimitation and the population structure analysis in order to select only the first SNP at each locus, whereas we used all variant sites for phylogenetic analysis of ddRAD data. The R package SNPfiltR 1.0.1 (DeRaad 2021) aided the visualization of the proportion of missing data per sample, and in setting an acceptable completeness cutoff per sample that would retain enough SNPs to still allow for robust inferences from each of the performed analyses.
Phylogenetic analyses
From the COI sequence data, we generated a maximum likelihood phylogenetic tree using the GTR+G RAxML-NG v1.1 0 software (Kozlov et al. 2019) on CIPRES Science Gateway online high-performance computer resource (Miller et al. 2010). Support for nodes in the ML tree was assessed with 100 bootstrap replicates. The tree was visualized using FigTree v1.4.4 (Rambaut 2012). Downstream COI species-delimitation analyses require a time-calibrated phylogeny. We generated an ultrametric COI tree with the program BEAST 2 (Bouckaert et al. 2019). We used the ‘GTR+G’ site model and the ‘Optimized Relaxed Clock’ model to estimate the mean clock rate during the analysis. The calibration prior on the age of the root of the tree was set to 1.5 million years using a normal distribution with a mean of 1.5 and standard deviation of 0.5, based on the stem age of the E. sigillata complex estimated by Tornabene et al. (2016).
For the ddRADseq data, we selected a filtered dataset containing 104 individuals (101 sigillata complex and three non-sigillata outgroup samples) with a combined 4,591 loci (RAD sequence) and 660,013 bp (Table 1). This dataset resulted after setting an initial filter specifying a minimum of 20% of individuals required to process a locus. The program SNPfiltR would be run for the remaining data until only 40% of missing data by samples and SNPs was accomplished for the remaining 104 taxa and 4,591 loci. With this dataset, we generated a maximum likelihood tree (from here on referred to as ddRADseq tree), using RAxML-NG v1.1 (Kozlov 2018) under the GTR+G model. We used the CIPRES (Miller et al. 2010) supercomputer portal to run the analysis. Support for nodes was assessed with 100 bootstrap replicates. The trees were visualized using FigTree v1.4.4 (Rambaut 2012).
Single-locus species delimitation analyses
For the COI data, we used the tree-based Generalized Mixed Yule Coalescent (GMYC, Pons 2006) and the distance-based Assemble Species by Automatic Partitioning (ASAP, Puillandre et al. 2020) methods to explore if species delimitation results from the single locus dataset would agree with the results from the species delimitation analyses with the genome-wide dataset. The GMYC is a likelihood method that delimits species by fitting within and between-species branching models to reconstruct gene trees. Using a time-calibrated single-locus tree, GMYC fits a model containing a single species and compares this to the multiple species models using a likelihood ratio (LR) test. For each model, GMYC reports the estimated number of clusters (i.e. species containing more than one terminal taxon in the dataset) and number of entities (i.e. the estimated total number of species including those made up of a single taxon), along with their respective confidence intervals. To perform the GMYC, we used the ultrametric COI tree with the function ‘gmyc’ in the R package ‘splits’ (Ezard et al. 2009; Fujisawa and Barraclough 2013). The ASAP method (Pulliandre et al. 2021) is based on pairwise genetic distances and uses an ascending hierarchical clustering algorithm that merges sequences into ‘groups’. After each merging step, the assignment of all sequences into ‘groups’ is named a ‘partition’. Through a recursive splitting process an asap-score (combination of probability and barcode-gap width metrics) is generated and used to rank the partitions. ASAP does not use a phylogenetic tree and requires a FASTA file of all sample sequences as input into their graphical web interface (https://bioinfo.mnhn.fr/abi/public/asap/#). We selected the ‘simple distance’ (p-distance) as the method to calculate pairwise distance. The lowest asap-generated score is the one that determines the best species model (Puillandre et al. 2021).
Coalescent species delimitation
For our genome wide SNP data, we employed Bayes Factor Delimitation with BFD* program (Leaché et al. 2014) to assess and rank various species delimitation models within the context of a multispecies coalescent framework. In essence, BFD* entails conducting SNAPP analyses with different species counts and individual species assignments, estimating the marginal likelihood for each model, and then comparing Bayes factors based on runs that rank each model tested. This approach provides a quantitative way to delimit species across multiple unlinked loci and is in alignment with the genealogical species concept (Baum and Shaw 1995) which assumes no gene flow as criteria for defining species.
The BFD* analysis was conducted in SNAPP v. 1.3.0 (part of BEAST2, Bouckaert et al. 2019) on a subset of individuals from Tonga, Fiji, and Samoa. A more stringent filtering was necessary for this dataset, since the analysis was focused on a smaller set of data that clustered together in a closely related group. An initial filter when running populations was set at -R 0.5, meaning that a minimum of 50% of individuals across populations was required to process a locus. Further filtering was accomplished by using VCFtools (Danecek et al. 2011) with a minor allele frequency of – maf 0.05, meaning that the minor allele is present in at least 5% of the individuals. The program SNPfiltR would be run for the remaining data, until there were only 18% of missing data by sample and by SNP, which left 29 individuals (two of these were from the nearest outgroup to the Tonga, Fiji, and Samoa clade) and 2,476 loci (Table 1). We further filtered the dataset using the program TASSEL (Bradbury et al. 2007) to remove taxa and loci, such that only loci present in at least one individual per putative species in our most species-rich model (Tonga, Fiji, Samoa, and outgroup) were included (see details in supplementary section). The final filtered dataset consisted of 29 individuals and 1,640 loci. We compared Bayes Factors for models with two, three, and four (three + outgroup) species. We followed the guidelines specified in Leaché et al. (2014) for parameter setting. For Model Parameters, we used Mutation Rates for U (Instantaneous rate of mutation from the 0 allele to the 1 allele) and V (Instantaneous rate of mutation from the 1 allele to the 0 allele) = 1.0 (as recommended in Leaché and Bouckaert 2018), with a Coalescence Rate = 0.01 (checking the Sample and Use Log Likelihood Correction boxes). For Priors, we used the Gamma model, with Alpha and Beta values = 2.0 and 100.0 respectively. For the MCMC, we used a Chain Length = 10,000,000. To determine the marginal likelihood values for each species model, we ran multiple SNAPP analyses with the software BEAST2 (Bouckaert et al. 2019) in the CIPRES supercomputing portal (Miller et al. 2010). SNAPP requires users to specify a priori assignments of taxa as proxy for potential species, thus we used geography to divide our taxa. The three models we ran in BEAST2 were as follows: Run A = 4 species (Samoa, Fiji, Tonga, and outgroup), Run B = 3 species (Fiji+Tonga, Samoa, and outgroup), and Run C = 2 species (Samoa+ Fiji+Tonga, and outgroup). Visualization of results and confirmation of mixing and convergence was done through the program Tracer 1.7.1 (Rambaut et al. 2018). The final step in this process was to compare the marginal likelihood estimates for each species model using the Bayes Factors (BF, Kass and Raftery 1995), calculated with the equation: BF = 2 x (MLE1 – MLE0), where MLE0 is the marginal likelihood of model 0 and MLE1 is the marginal likelihood of model 1. The scale for the BF values identifies what models should be consider as follows: 0 < BF < 2 not supported, 2 < BF < 6 positive evidence, 6 < BF < 10 strong support, and BF > 10 decisive.
Population structure analysis
To investigate the fine-scale genetic structure among a closely-related complex of lineages that could potentially represent new species, or perhaps early-diverging lineages in the ‘anomaly zone’ (i.e. clades of samples from Tonga, Fiji, and Samoa – see above), we took a population genetic approach. For this analysis, we used the same taxa as for the SNAPP/BFD* analysis. The only difference is that the outgroup taxa were removed to avoid introducing confounding genetic diversity not representative of the studied group. The final filtered subset consisted of 27 individuals from the three above-mentioned populations genotyped at 4,235 loci (Table 1). For this subset, we used the program Structure 2.3.4 (Pritchard et al. 2000). This program uses a Bayesian clustering algorithm to assign individuals to genetic clusters based on their genotype data. This is done by examining the rate of change in likelihood values when different numbers of clusters (k) are present without prior knowledge of cluster membership of individual samples. The goal is to identify the k value that maximizes the rate of change of the likelihood value (Δk, Evanno et al. 2005). The peak of the Δk on the resulting graph (of k values vs. Δk) indicates this most likely number of genetic clusters or populations in the dataset. An advantage of using Structure for our data is that SNPs have lower variability than multiple allele loci genomic data (i.e., microsatellite data), thus requiring much smaller sample sizes per putative population (Porras-Hurtado et al. 2013). For the number of genetic clusters (k), we tested a range of values from 1-6 to identify the optimal number of clusters that explained the genetic variation of our dataset. This range was selected based on the phylogenetic analyses showing three distinct clades each from Tonga, Fiji, and Samoa. For the burn-in and MCMC replicates, we conducted 10 runs for each k value with a burn-in period of 12,500 iterations followed by 50,000 MCMC replicates to ensure convergence and consistency of results. We assumed the “admixture” model, which allows individuals to have mixed ancestry from multiple genetic clusters. We employed the “independent allele frequencies” model, which assumes that the allele frequencies within clusters are uncorrelated. After running Structure, the program’s output files were processed using Structure Harvester web server (Earl and vonHoldt 2012).