Data from: Inversions dominate evolution in the European Sardine (Sardina pilchardus) amid strong gene flow
Data files
Jul 29, 2025 version files 8.84 GB
-
README.md
6.03 KB
-
sard_ncbi_all_Qual30_biSnps_STB_HAPs_IND.vcf.gz
7.36 GB
-
sard_NOZ_VCF_DP_MAF.05_NMM.allfreq_pwc.gz
1.16 GB
-
sardine_AEG_A_polarizedAF.fq.gz
145.25 MB
-
sardine_hard_filtered.sync.gz
5.49 MB
-
sardine_just_outliers.sync.gz
6.65 MB
-
sardine_SMO_J_polarizedAF.fq.gz
145.27 MB
-
sardine_soft_filtered.sync.gz
8.24 MB
Abstract
This dataset contains genomic resources associated with a study of chromosomal inversions and population structure in the European sardine (Sardina pilchardus). It includes whole-genome resequencing data, variant call files (VCFs), inferred genotypes for structural variants, and associated metadata across multiple populations spanning the species’ range. These data were used to identify large chromosomal inversions and assess their role in maintaining genetic differentiation amid high gene flow. The resources support future analyses of structural variation, local adaptation, and demographic history in marine fishes.
Dryad: https://doi.org/10.5061/dryad.vmcvdnd3f
Name: Stephen J. Sabatino
Institution: CIBIO - BIOPOLIS
Email: sabatino@gmail.com
Name: Ana Veríssimo
Institution: CIBIO - BIOPOLIS
Email: averissimo@cibio.up.pt
Dataset Overview
This dataset contains the data and code required to replicate analyses in Sabatino et al., using whole-genome sequencing to investigate the influence of inversions on the population genomics of the r-selected European sardine (Sardina pilchardus). Samples include genomic DNA extracted from 25 individuals (pooled males and females) from each of 33 locations in the Northeastern Atlantic and Mediterranean Sea covering most of the species distribution range. Genetic data include the following datasets of single nucleotide polymorphisms (SNPs):
Dates of Data Collection
A. Fish tissue collections: 2019 - 2021
B. Genetic sequencing data: 2021-2022
C. Genetic analysis: 2022 -2025
Data Spatial Scope
European sardines (n = 1,027) were collected from 33 locations across the species' range, including 26 locations in the northeastern Atlantic from the Bristol and English Channels southward to southern Morocco, and 8 locations in the Mediterranean Sea from the Aegean Sea to Mediterranean Morocco.
Funding
Funding was obtained by the Project SARDINOMICS: Development of molecular tools to improve knowledge and management of the European sardine stocks/Desenvolvimento de ferramentas moleculares para o melhoramento do conhecimento e da gestão dos stocks de Sardinha (PO Mar2020, MAR-01.04.02-FEAMP-0024) financed by the European Maritime and Fisheries Fund (EMFF), through the MAR2020 Operational Programme.
Ethics Approval
Tissue collections were obtained during annual research surveys conducted in the North and Central Atlantic using small pelagic trawls, except for samples from the Aegean, Ionian, and Adriatic Seas, which were obtained from commercial fisheries landings of local fishermen.
Sharing/Access information
This work is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license.
All scripts and code here are copyright under a GNU GPLv3 license.
Datasets
The following datasets are included in the repository:
- Filtered VCF file for bi-allelic SNP loci
- File:
sard_ncbi_all_Qual30_biSnps_STB_HAPs_IND.vcf.gz
- This filtered VCF file contains information on the bi-allelic SNP loci in European sardine used for this study.
- File expands to ~80Gb
- File:
- PWC file for SNP allele frequency differences
- File:
sard_NOZ_VCF_DP_MAF.05_NMM.allfreq_pwc.gz
This PWC file, generated using Popoolations2, was used to identify SNPs with exceptionally high allele frequency differences between populations within and outside putative inversions. - File expands to ~4.6Gb
- File:
- Polarized allele frequency data
- Files:
sardine_SMO_J_polarizedAF.fq.gz
sardine_AEG_A_polarizedAF.fq.gz
These two .fq data files contain polarized allele frequencies for two populations with high frequencies of inversion alleles.
- Files:
- SYNC files for pairwise FST calculations
- Files:
sardine_hard_filtered.sync.gz
sardine_just_outliers.sync.gz
sardine_soft_filtered.sync.gz
These SYNC files were used to study population differentiation in the European sardine through pairwise FST values.
- Files:
Supplemental Tables
The following supplemental tables are described in the main manuscript. Briefly, the “soft-filtered” dataset include all SNPs identified with FreeBayes except those filtered after running PCAdapt on Atlantic and Mediterranean samples, separately, followed by Bayscan; the "hard-filtered" dataset including all SNPs in b) except those identified in regions of elevated FST (several-fold higher) relative to the rest of the genome; and the "outliers-only" including SNPs identified in outlier regions using PCAdapt (as in a) ).
Table_S1_FST.xlsx
FST values are color-coded, ranging from yellow (lower values) to blue (higher values). Sample labels follow the abbreviationslisted in Table 1 of the main manuscript. This table provides pairwise FST values for the three SNP datasets: (Generated using:sardine_hard_filtered.sync.g, sardine_just_outliers.sync.gz, sardine_soft_filtered.sync.gz
)Table_S4_GO_Terms.xlsx
This table presents the results of the Gene Ontology (GO) enrichment analysis, based on the European sardine reference genome annotation from NCBI. The analysis was conducted using topGO v2.38.1. Results are organized into the following categories:- Each of the nine inversions identified in European sardines.
- All nine inversions combined.
- Outlier regions identified using PCAdapt.
Custom Scripts
The following seven scripts are provided in the compressed directory: Code_Methods.zip.
Additional information on the code and its functions can be found in the script annotations.
1. flt_vcf_IND.py
- This script processes a VCF file of SNPs and another VCF file of INDELs, filtering the SNP VCF for any SNPs flanking INDELs. Unlike typical INDEL filters, it accounts for the length of the INDEL as estimated by FreeBayes.
2. flt_sync_ALT.py
- Filters non-biallelic SNPs from a SYNC file.
3. flt_sync_DP_MAF.py
- Filters a SYNC file based on read depth and minimum minor allele frequency (MAF).
4. sync_to_Pcadapt.py
- Converts a SYNC file into the input format required for PCAdapt. This script uses a sliding window approach, defining windows based on a specified number of SNPs and step size.
5. polarize_sync_AF.py
- This script and 6 and 7 below were used for estimating allele frequencies for putative inversions. These three scripts are part of a short pipeline named pipeline.sh
.
6. calc_INV_AF.py
7. map_INV_AF.r
2.1 | Sampling, DNA Extraction, Library Preparation and Sequencing
European sardines were collected from across the Atlantic and Mediterranean species' range between 2019 and 2021 (Table 1). Samples were obtained during annual research surveys conducted in the eastern North and Central Atlantic using small pelagic trawls, except for samples from the Aegean, Ionian, and Adriatic Seas obtained from local fishermen. Wherever possible, two samples were obtained per location: one sample of 25 juveniles and another of 25 adult individuals. Given the variability in size at maturity throughout the species ́range and the marked differences between Atlantic and Mediterranean sardines (e.g., Dimarchopoulou and Tsikliras 2022; Silva et al. 2006), individuals were categorised as juveniles or adults using location-specific cutoff values of size at 50% maturity (L50): 18cm for Atlantic samples, and 16cm for Mediterranean samples. The final sample set comprised 13 juvenile and 21 adult samples (Table 1).
All fish were immediately frozen upon collection and transported to the laboratory. Upon thawing, muscle tissue for DNA extraction was collected from each individual and stored in 96% ethanol at −20°C until DNA extraction. Genomic DNA was extracted using the DNeasy Blood & Tissue Kit (Qiagen), following the manufacturer's instructions, including the RNase A treatment. The DNA quality was assessed by agarose gel electrophoresis and spectrophotometry (Nanodrop), and concentrations were measured using the Quant-iT Picogreen Kit (Invitrogen). For each sample collection (n = 34; Table 1), the DNA from each of the 25 individuals per sample was pooled in equal amounts and quantified. Pool-seq library construction and whole-genome shotgun sequencing (NovaSeq6000, Illumina) were performed by Macrogen (Seoul, Republic of Korea), generating 2×150bp paired-end reads from 350-bp insert libraries (TrueSeq DNA Nano, Illumina) with a minimum coverage of 60× based on a genome size of 869.4 Mb. Samples were sequenced in two runs (one lane per run); one run included samples from 2019 and 2020, and a second run included samples from 2021.
2.2 | Quality Control of Raw Reads, Mapping, and Filtering
The quality of raw reads for each pool was evaluated using FastQC v0.11.8 (Andrews 2010) and summarised with MultiQC v1.12 (Ewels et al. 2016). Reads were trimmed using Trimmomatic v0.39 (Bolger et al. 2014) with adapter removal and stringent quality filtering (leading/trailing base quality ≥30, sliding-window trimming with a 5-base window and minimum average quality of 20, minimum read length of 100bp, and average read quality ≥30), and the resulting filtered reads reassessed with FastQC v0.11.8 and MultiQC v1.12. Quality-trimmed reads were mapped to haplotype one of the phased S. pilchardus reference genome (GCA_963854175.1) using BWA MEM v0.7.17-r1188 (Li 2013). Alignments were sorted using SAMtools v1.14 (Li et al. 2009), and read groups were added with Picard v2.212 AddOrReplaceReadGroups (Broad Institute, 2019). PCR duplicates were identified and removed with Picard v2.21.2 MarkDuplicates. The read mapping quality for each pool was obtained with QualiMap v2.2.1 (Okonechnikov et al. 2016).
2.3 | SNP Calling and Allele Frequency Estimates
To identify SNP loci, the bam files for each poolseq data set were analysed with FreeBayes v0.9.21 (Garrison and Marth 2012). The FreeBayes parameters used included: p 50 (number of chromosomes per pool), pooled-discrete, min-coverage 10, min-repeat-entropy 1, min-mapping-quality 60, min-base-quality 20, use-mapping-quality and max-coverage 100 per pool. The FreeBayes analysis was run using all 34 pools simultaneously to maximize the detection of rare alleles, utilizing the use-best-n-alleles 2 approach to reduce computational costs. Loci with more than two alleles were filtered using custom Python scripts. The resulting VCF file was filtered for loci with quality scores below 30, minor allele frequency (MAF)<0.05, or those showing strand bias (‘SAF>0 & SAR>0 & RPR>1 & RPL>1’) using bcftools (Danecek et al. 2021). SNPs in haplotypes were then left-aligned using the norm function. Loci within 10 base pairs of insertions or deletions identified in the FreeBayes analysis were excluded using custom scripts. A sync file of allele frequencies was generated with Popoolation2 (Kofler et al. 2011) using default parameters for diploid organisms (50 chromosomes for n = 25 individuals per pool), retaining only high-quality SNP loci in the final, filtered, VCF file.
2.4 | Genome Scans, Population Structure, and Genetic Diversity (FST and HE)
A key objective was to estimate population structure and genetic diversity in European sardines using neutral, unlinked loci. Initially, we used PCAdapt (Luu et al. 2017) and BayPass2.1 (Gautier 2015) to detect and filter non-neutral SNPs and SNPs linked to non-neutral variants. PCAdapt identifies outlier loci by performing principal component analysis (PCA) on genotype data to detect genetic variation associated with population structure. Loci are marked as outliers if their allele frequencies deviate significantly from neutral expectations. Selection often affects SNPs linked to targeted sites through genetic hitchhiking (Lewontin and Krakauer 1973). We therefore conducted PCAdapt on genomic regions rather than individual SNPs, using a sliding-window approach with 20 consecutive SNP loci and an (overlapping) 10 SNP step size. The most likely population cluster number (k) was chosen based on the lowest value that explained the most variance before plateauing. For the full 34 pools, PCAdapt identified k=2 (Figure S1), splitting Atlantic and Mediterranean populations. However, when working with many samples (n = 34), k = 2 can be biased towards PC1 (Janes et al. 2017). Thus, we performed separate PCAdapt runs for the 26 Atlantic and 8 Mediterranean samples. The best k value was
2 for the Atlantic samples, while k=3 was used for those from the Mediterranean after determining that k = 2 and k = 4 yielded similar results. Significant outlier windows (FDR-corrected p value <0.05) from all three PCAdapt runs were then pruned for redundancy and merged if within 50kb. We chose to merge windows using a relatively large gap threshold of 50kb based on several considerations: (1) our outlier detection methods,
like all such approaches, may miss or fragment true signals; (2) many detected outliers clustered within linked regions, such as chromosomal inversions; and (3) given our broader data set of hundreds of thousands of SNPs, we aimed to derive biologically meaningful and unbiased estimates of population structure and gene flow by accounting for outliers potentially distributed across linked genomic regions. All SNP loci in the set of merged outlier windows were filtered.
Next, we used BayPass to identify individual outlier SNPs from the PCAdapt-filtered data set. BayPass measures covariance in allele frequencies across populations and simulates statistical cutoffs to detect those that deviate from neutrality. In this case and throughout our study, BayPass was always run using a subset of ~50k SNPs randomly chosen from across the genome to ensure each is at least 10 kilobases apart. BayPass was executed with default parameters, including the recommended initial read count (d0yij) set to 10 (1/5 the minimum pool size in chromosomes). The test statistic used in BayPass is xTx, an analogue of FST that uses the covariance in allele frequencies among populations to help mitigate the confounding effects of hierarchical population structure when detecting outliers (Günther and Coop 2013). A null distribution of xTx values based on neutral expectations was simulated using the empirical data. The simulated xTx values were then compared to those based on the empirical data, and loci exceeding the top 99th percentile were filtered from the data set (hereafter referred to as the ‘soft-filtered’ data set). The subset of SNPs within outlier regions identified by PCAdapt was filtered for linkage as described above and used for downstream analysis (hereafter referred to as the ‘outlier-only’ data set).
Estimates of pairwise FST and expected heterozygosity (HE) were calculated using the soft-filtered and outlier-only data sets with computing.fstats in the Poolfstat R package (Gautier et al. 2022). The HE estimates are correlated to the true HE, but given they are not based on allele counts, they should be interpreted in a relative sense. Genetic relationships among the 34 populations were visualised using heatmaps and dendrograms. Block-jackknife estimation of the 95% confidence intervals for pairwise FST estimates was conducted assuming 1 SNP per 1000 bp block, as implemented in poolfstat.
An initial comparison of pairwise FST using the soft-filtered and outlier-only data sets showed differences in magnitude but strikingly similar spatial patterns (Figure S2; Table S1). In addition, elevated FST values in these two data sets varied nonrandomly among genomic regions (i.e., were clumped; Figures S3 and S4; see methods below) and overlapped with known inversions (Table S2), suggesting that our soft-filtered data set was not free from bias. To address this, we tested a second filtering approach, aiming to hard mask regions showing evidence of being outliers. This was achieved by analysing population pairs (see Figure S5) from the soft-filtered data set that exhibited atypically high FST overall, especially between neighbouring populations (Table S1), and had highly heterogeneous FST across entire chromosomes. Based on these criteria, of all the pairwise FST comparisons made for our 34 samples, four stood out among the rest and were further studied (CAN_A/NSB_J; CAD1_A/TOR_A; CAN_A/ SMO_J; ION_A/NSB_J; Figure S5). For each of the four, we generated Manhattan plots of pairwise FST estimates for 20-SNP windows with a 10-SNP step size using Poolfstat v.2.2 (Gautier et al. 2022). For comparative purposes, FST-Manhattan plots were also generated for an expanded set of pairs of populations, including all adult and juvenile samples from the same location using the same approach (Figures S3 and S4). We then visually inspected the Manhattan plots for each chromosome to identify regions of elevated FST (several-fold higher) relative to the rest of the genome that were continuous and at least 500kb long (Figures S4 and S5). The resulting outlier regions were then fully masked from the SNP dataset. As with the ‘soft-filtered’ data set, the resulting SNP data set was pruned for linkage to obtain a set of 50 k SNPs, which were then filtered using BayPass. Like in the pairwise FST filter, the Manhattan plot of xTx statistics from the BayPass analysis was also visually inspected for regions of linked outlier loci, and the one identified (Figure S6) was similarly masked. The resulting set of SNPs (hereafter referred to as the ‘hard-filtered’ data set) was then used to measure pairwise FST and HE using Poolfstat as above. Differences in HE between adult and juvenile population pairs were then tested using a paired t-test.
Hierarchical F-statistics were estimated assuming distinct scenarios of population structure using read count data as implemented in Poolfsat v3 (Gautier et al. 2024), with 95% confidence intervals calculated using the block-jackknife method described above. The groups tested were (1) Atlantic and Mediterranean populations with and without including the Canary Islands as part of the Mediterranean, (2) the genetic clusters found using the outlier-only SNP data set, and (3) environmental-phenotypic groupings of sardines outlined earlier (i.e., British Isles—Biscay Bay—Cantabria, W. Iberia—Morocco, S. Morocco, and the Mediterranean; see details in Introduction). This analysis was conducted separately on the hard-filtered and outliers-only data sets to enable a comparative assessment.
2.5 | Isolation by Distance
To investigate spatial patterns of genetic differentiation and the relevance of lower (<0.10) FST estimates (Palumbi 2003), we tested for isolation-by-distance (Yang 2004) using pairwise FST estimates and least-cost distances among sample collections. Least-cost distances were constrained by the upper 1200m of water depth, which separates the Canary Islands from the mainland, and were estimated using the Marmap v1.0.10 package (Pante and Simon-Bouhet 2013). We used a maximum likelihood population effects model gls function in the package nlme and using the corMLPE v1 package (Clarke et al. 2002) to directly model and account for nonindependence of residuals due to the pairwise relations (Jha 2015). Confidence intervals (95%) for model predictions were obtained through 1000 replicate simulations of normal distributions to account for autocorrelation in the data. The above computations were run in R 4.42 (R Core Team 2021).
2.6 | Identification and Population Genetic Analysis of Inversions
Five large inversions were identified during the sardine reference genome assembly based on a single individual, suggesting that additional inversions might be segregating within the species. This hypothesis was supported by our outlier analysis using PCAdapt, which revealed several large outlier regions spanning nearly whole chromosomes that did not overlap the inversions identified in the reference genome. To identify candidate inversions, we again visually inspected Manhattan plots of pairwise FST values from four population pairs in our genetic analysis (see above). Candidate inversions were defined as genomic regions larger than 500kb with continuously elevated FST several-fold above the genome-wide average. This analysis identified five additional candidate inversions (Figure S5; Table S2), which were subsequently examined in detail.
To help verify the existence of candidate inversions, we sequenced the genome of an additional individual using Nanopore long-read technology, which is well-suited for detecting structural variants. The specimen sequenced was collected by local fishermen off Porto, Portugal, in 2018, euthanised, flash-frozen in liquid nitrogen and stored at −80°C. DNA was extracted from muscle tissue using a phenol-chloroform protocol to minimise shearing. The extracted DNA was cleaned using magnetic beads and rinsed with ethanol and a cleaning buffer. Sequencing was performed using the SQ-KLSK109 ligration sequencing kit and run on the Oxford Nanopore MinION (FLO-MIN106 flow cell), following the manufacturer's instructions, across six runs to maximise yield. Adapter sequences were removed using Porechop (https://github.com/rrwick/Porechop), and the reads were mapped to the sardine reference genome using minimap2 (Li 2018) with the ‘map-ont’ option. Structural variants were analysed using npInv (Shao et al. 2018), which detects inversions based on main alignments and subalignments. The npInv analysis was run using default parameters, except the minimum inversion size was set to 100kb to reduce false positives. Only inversions flagged as ‘PASS’ in the VCF file and supported by more than one read were retained, with overlapping intervals merged into a final set of candidate inversions.
The npInv analysis identified 57 candidate inversions spanning 482,327,974 bases. Although the total length of these predicted inversions is substantial, it is consistent with findings from the NCBI reference genome. This consistency may reflect high heterozygosity for inversions in the Atlantic populations from which the reference genome and the individual sequenced here were sampled (see Results). However, identifying the breakpoints of inversions using npInv with an alignment of moderate coverage (23×) is not expected to be highly accurate. Consequently, we used these results primarily to assess the five candidate regions identified through FST analysis rather than to discover novel inversions. Each of the five candidate regions (FST) was supported by overlapping inversions in the npInv results (Table S2), with an average overlap of 71.7%. Although only four of the five inversions had overlaps of 80% or higher, and additional research is needed to confirm their existence and true breakpoints, we refer to them as inversions and studied them along with those identified in the NCBI reference genome.
We estimated allele frequencies for nine candidate inversions. First, differences in allele frequencies (ΔAF) were calculated for all loci in the SNP data set before outlier filtering (~752,000 SNPs) using Popoolations2 (Kofler et al. 2011). For each of the nine inversions analysed, SNPs in the top 25th percentile of ΔAF were considered diagnostic of divergent inversion haplotypes. Allele frequencies were polarised across all populations to associate the reference allele with putative alternative haplotypes using custom Python scripts. Pie charts visualising inversion haplotype frequencies were generated using custom R scripts. Linear models were then used to test correlations between allele frequencies and latitude in the Atlantic and longitude in the Mediterranean. For comparison, a similar number of putatively neutral SNPs from the hard-filtered data set were also analysed with a linear model, and an ANCOVA was used to test for differences between the slopes (rate of Δ in allele frequency with distance) based on inversion versus neutral loci. To avoid biasing the analysis, neutral loci were chosen using the same approach to identify loci diagnostic of inversion haplotypes, including sampling from those with the highest ΔAF across space. We performed 100 bootstrap replicates of ANCOVA per inversion to estimate uncertainty by resampling SNPs from neutral regions outside large inversions. Differences in clinal slopes were considered significant if p was less than 0.05 more than 95% of the time.
2.7 | Demographic Inference and Genetic Diversity of Inversions
We used Diffusion Approximations for Demographic Inference (dadi; Gutenkunst et al. 2009) to explore the evolutionary significance of the identified inversions. Specifically, we aimed to investigate the sampled populations' overall demographic history and test the hypothesis that gene flow and effective population size are reduced in inverted genomic regions compared to the collinear areas. For this analysis, we selected 10 representative population pairs to explore the following contrasts: (1) neighbouring populations with highly divergent inversions (e.g., Southern and Central Morocco); (2) populations from different oceanographic basins (Atlantic vs. Mediterranean); (3) populations within the same basin; and (4) adult versus juvenile populations from the Aegean Sea. The dadi pipeline was run twice for each pair, once using SNPs from collinear regions and again from inverted areas. Because we were interested in studying the nine large inversions found here, we used the 1.7 M SNP data set that was not filtered for outlier regions. Most outliers were associated with large inversions. Nevertheless, this data set likely violates the neutrality assumptions required for accurate demographic parameter estimation and linkage for inverted regions. We therefore interpret this analysis cautiously, using it primarily to explore the evolutionary dynamics and potential effects of inversions.
To prepare the dadi input file, we used Samtools v1.11 to generate mpileup files for each population, including only the SNPs from the 1.7 M SNP data set described above. These mpileup files were then processed with SNAPE-pooled (Raineri et al. 2012) to calculate the folded site frequency spectrum (SFS), using the following parameters: the number of chromosomes set to twice the number of individuals per pool (i.e., 50), θ=0.001, and a minimum allele frequency threshold of 0.1. Custom Python scripts were used to reformat the data for compatibility with dadi. Data sets were stratified by genomic context (collinear vs. inverted) and pruned for background linkage disequilibrium by requiring a minimum distance of 10kb between SNPs. All demographic analyses were performed using the folded SFS, and projections were made from 25 to 15 chromosomes to account for missing data.
Next, we fit eight two-population demographic models to the joint allele frequency spectrum (2D-AFS), including: no divergence (no_divergence), isolation without migration (no_mig), symmetric and asymmetric migration (sym_mig, asym_mig), and secondary contact models (anc_sym_mig, anc_asym_mig, sec_contact_sym_mig, sec_contact_asym_mig). Model functions were implemented in a custom module and optimised using the strategies from the dadi_pipeline v3.1.5 (Portik et al. 2017), which included four rounds of parameter perturbation and optimisation to reduce the risk of convergence on local optima. Confidence intervals for parameter estimates were computed using the Godambe Information Matrix (GIM), with finite-difference step sizes of 0.1, 0.01, and 0.001. Confidence intervals were generally stable across step sizes, supporting the robustness of the parameter estimates.
Finally, to test whether inversions harbour elevated genetic diversity, consistent with balancing selection (Mérot et al. 2020; Berdan et al. 2023), we estimated Tajima's π across all 34 populations using the Variance-sliding.pl script from PoPoolation2, applying a 100kb sliding window with a 50kb step. We then compared π between inversion and collinear regions within each population using Mann–Whitney U tests, implemented via custom Python scripts using the stats package.