Data from: The genomic architecture of local adaptation in two connected populations of three-spined stickleback
Data files
Apr 06, 2026 version files 12.74 GB
-
README.md
4.10 KB
-
stickleback_StLawrence_SV.vcf.gz
337.01 MB
-
sticklebacks_vcf.zip
12.41 GB
Abstract
Populations often harbor extensive genetic variation shaped by both selection and connectivity, yet the genomic basis of this variation remains incompletely understood. We generated two complementary datasets in three-spined sticklebacks (Gasterosteus aculeatus) to characterize single-nucleotide polymorphisms (SNPs) and structural variants (SVs) across the genome. For SNP discovery, we used Illumina whole-genome sequencing data aligned to the most recent reference genome, including the Y chromosome for males, while excluding the pseudo-autosomal region to minimize alignment errors. Reads were processed to remove duplicates, clipped for overlaps, and locally realigned around indels, achieving an average coverage of 11.2×. After standardizing coverage and removing related individuals, we called SNPs chromosome by chromosome and filtered for biallelic sites with a minor allele frequency > 0.05, coverage between 4× and 35×, and a genotyping success rate exceeding 50%.
To complement this, we generated a structural variant dataset by combining long-read (Nanopore) and short-read (Illumina) sequencing. Nanopore reads >1 kb were mapped and filtered, and SVs were called independently with multiple algorithms for each dataset to maximize confidence. Only SVs detected by at least two callers per dataset were retained, and long- and short-read datasets were then merged. Structural variants were genotyped across all samples using a genome graph approach, with insertions primarily resolved from long-read data and inversions largely from short-read data.
Together, these datasets provide a high-resolution view of genomic variation, capturing both fine-scale SNPs and larger structural variants. They enable the study of recombination landscapes, patterns of differentiation, and the potential role of structural variation in local adaptation across connected populations of three-spined sticklebacks.
Dataset DOI: 10.5061/dryad.cc2fqz6md
Description of the data and file structure
We have submitted two Variant Call Format (VCF) files that constitute the primary genomic datasets used in our paper "The Genomic Architecture of Local Adaptation in Two Connected Populations of Three-Spined Stickleback."
Both files can be found in the sticklebacks_vcf.zip folder.
The first file, Sticklebacks_StLawrence_SNPs.vcf.gz, contains single-nucleotide polymorphisms identified from Illumina whole-genome sequencing data. Reads were aligned to the most recent stickleback reference genome, processed to standardize coverage and remove related individuals, and filtered to retain high-quality biallelic SNPs. This SNP dataset was used for analyses of population structure, recombination landscapes, and patterns of genomic differentiation.
The second file, stickleback_StLawrence_SV.vcf.gz, contains polymorphic structural variants, including insertions, inversions, and deletions, identified by combining long-read (Nanopore) and short-read (Illumina) whole-genome sequencing data. Structural variants were called using multiple algorithms for each sequencing technology, filtered to retain high-confidence variants detected by at least two callers per dataset, and merged into a single comprehensive dataset. Structural variants were subsequently genotyped across all samples using a genome graph–based approach. This dataset was used to examine the relationship between structural variation, recombination suppression, and adaptive genomic regions.
Files and variables
Sticklebacks_StLawrence_SNPs.vcf.gz
Description:
This file contains high-quality single-nucleotide polymorphisms (SNPs) identified from Illumina whole-genome sequencing data of three-spined sticklebacks (Gasterosteus aculeatus) from the St. Lawrence Estuary. SNPs were called after read alignment to the reference genome, coverage standardization, removal of related individuals, and stringent variant filtering. This dataset was used for analyses of population structure, recombination landscapes, and genomic differentiation.
File format:
Compressed VCF (VCF v4.x, gzip-compressed)
Filtering criteria:
- Biallelic SNPs only
- Minor allele frequency (MAF) > 0.05
- Total coverage between 4× and 35×
- Genotyping success rate > 50%
- Sex chromosomes and unplaced contigs excluded
Missing values:
- Missing genotypes are encoded as
./.following VCF conventions
stickleback_StLawrence_SV.vcf.gz
Description:
This file contains polymorphic structural variants (SVs) identified by combining long-read (Nanopore) and short-read (Illumina) whole-genome sequencing data. Structural variants include insertions, inversions, duplications, and deletions. SVs were detected using multiple callers for each sequencing technology, filtered to retain variants detected by at least two callers per dataset, and merged into a single high-confidence dataset. Variants were genotyped across all samples using a genome graph–based approach.
File format:
Compressed VCF (VCF v4.x, gzip-compressed)
Variables (VCF fields):
CHROM Chromosome identifier
POS Genomic start position (bp)
ID Structural variant identifier
REF Reference allele
ALT Alternate allele (SV representation)
QUAL Variant quality score
FILTER Filter status
INFO Structural variant annotations
SVTYPE Type of structural variant (INS, DEL, INV, DUP)
SVLEN Length of the structural variant (bp)
END End position of the variant
FORMAT Genotype field descriptors
GT Genotype call for each individual
DP Read depth
GQ Genotype quality
Code/software
BCFtools: Command-line tool for working with VCF/BCF files, including filtering, viewing, and converting.
vcftools: Command-line program designed to work specifically with VCF files to perform various types of analyses.
