Data from: Genome divergence between European anchovy ecotypes fuelled by structural variants originating from trans-equatorial admixture

Meyer, Laura 1 ; Barry, Pierre 2 ; Le Moan, Alan 3 ; Arbiol, Christine 1 ; Castilho, Rita 4 ; Van der Lingen, Carl 5 ; Chlaïda, Malika6; McKeown, Niall J.7 ; Ernande, Bruno 8 9 10 11 12 ; Bonhomme, François 1 ; Gagnaire, Pierre-Alexandre 1 ; Guinand, Bruno 1

Published Sep 11, 2025 on Dryad. https://doi.org/10.5061/dryad.rxwdbrvnc

Data files

Sep 11, 2025 version files 1.46 GB

Anchovy_ecotypes_2025_Supp_Appendix.html

24.75 MB
Eencr_RAD_dataset.vcf.gz

14.35 MB
Eencr_WGS_dataset.vcf.gz

1.42 GB
README.md

3.60 KB
Sample_table.txt

48.81 KB

Abstract

The European anchovy (Engraulis encrasicolus) is known to be subdivided into marine and coastal ecotypes, and their divergence shows patterns that are consistent with SVs. Here, we present the first genome-scale study investigating genetic structure in the E. encrasicolus species complex. We generated a reference genome and produced whole-genome resequencing data for anchovies from the North-East Atlantic and Mediterranean Sea, as well as from South Africa. We complemented this approach with the analysis of RAD-seq data in order to study ecotypic structure across the entire distribution range. We found that genetic diversity is not only characterised by the presence of two genetic clusters, namely the marine and coastal ecotypes, but also by a third ancestry which corresponds to a southern Atlantic lineage. Genomic landscapes of differentiation showed evidence for large regions of high linkage disequilibrium, likely representing SVs that differentiate the three anchovy lineages.

Dataset DOI: 10.5061/dryad.rxwdbrvnc

Description of the data and file structure

The European anchovy (Engraulis encrasicolus) presents distinct marine and coastal forms, but the genetic basis behind this split has remained unclear. We used whole-genome and RAD sequencing to study populations across the species’ range.

Files and variables

File: Anchovy_ecotypes_2025_Supp_Appendix.html

Description: This report includes results about the alignment of RAD and WGS data, PCA conducted on different genomic regions, and the SVs that were genotyped. It contains the following tabs and variables:

1) Samples: The first row contains the sample table (also provided in ".txt" format), where background colours indicate ancestry proportions as RGB values (see Fig. 1 in the main text). Below the table, there is an interactive map with sample locations.

2) WGS mapping tab: Statistics from alignment with BWA, with 'Presentation' describing all stats in detail. Plots shown: Distribution, Percentage, Reads mapped, Reads mapped & paired, Reads unmapped, Reads properly paired, Average length, Average quality, Insert size average, Paired different chromosomes, Paired different chromosomes (%), Percent paired other orientation, Percentage properly paired reads, Insert size, Coverage plot, GC coverage, Insertion data, Deletion data.

3) RAD mapping tab: Statistics from alignment with BWA, with 'Presentation' describing all stats in detail. Plots shown: Distribution, Percentage, Reads mapped, Reads unmapped, Duplication rate, Average length, Average quality, Coverage plot, Insertion data, Deletion data.

4) WGS PCA tab: PCA performed on the WGS dataset, using markers from across the entire genome ('Genome-wide PCA') or markers from single chromosomes only (24 plots). Sample colours reflect their ancestry proportions as in the sample table.

5) RAD PCA tab: PCA performed on the RAD dataset, using markers from across the entire genome ('Genome-wide PCA') or markers from single chromosomes only (24 plots). Sample colours reflect their ancestry proportions as in the sample table.

File: Sample_table.txt

Description: Sample table with following columns: Sample, WGS, RAD, Basin, Latitude, Longitude, Location, Admixture, Habitat, Genetic class

File: Eencr_WGS_dataset.vcf.gz

Description: Whole-genome resequencing dataset, containing 5.9M SNPs in 39 individuals.

File: Eencr_RAD_dataset.vcf.gz

Description: RAD-sequencing dataset, containing 3906 SNPs in 385 individuals.

Code/software

HTML report can be opened using an internet browser.

Access information

Data was derived from the following sources:

Sequence reads for whole-genome resequencing data have been deposited in the GenBank Sequence Read Archive under the accession codes BioProject ID PRJNA311981 (for RAD samples) and BioProject ID PRJNA777424 (WGS samples).

Accessions of RAD samples: SAMN48731832 to SAMN48732074:

https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=311981

Accessions of WGS samples: SAMN48800104 to SAMN48800125, SAMN48746249 to SAMN48746268:

https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=777424