Scripts from: Ancient stickleback genomes reveal the early stages of parallel adaptation

Laine, Jan1 ; Nickel, Jana2; Romundset, Anders3; Foote, Andrew 2

Published Oct 28, 2025 on Dryad. https://doi.org/10.5061/dryad.547d7wmmc

Data files

Oct 28, 2025 version files 71.79 KB

Laine_et_al_2025_UNIX_Code_Mapping_and_filtering_NGS_data_and_creating_haplofile_covmatrix_VCF_and_Fst.txt

7.54 KB
Laine_et_al._2025_Ancient_Stickleback_Bones_R-code.R

60.67 KB
README.md

3.58 KB

Abstract

The parallel evolution of traits and their underlying genetic basis is well studied; however, studies of the parallel chronology of adaptive genetic changes remain scarce. The probability of parallel genetic change should be increased by the clustering of adaptive alleles in regions of suppressed recombination, particularly for genes that have large fitness or phenotypic effects. The threespine stickleback is a model system for studying parallel evolution, here we present genomic data from nine subfossil stickleback bones dated to 14.8-0.7 KYR BP in age. Comparing the four highest coverage genomes, which represent different stages along the marine–freshwater divergence continuum, we find that the accumulation of freshwater ancestry is clustered rather than randomly distributed throughout the marine–freshwater divergent regions of the genome. We consistently find freshwater ancestry on chromosome IV at the early stages of freshwater adaptation. Regions of chromosome IV contain the greatest genetic differentiation between marine and freshwater ecotypes and among the highest density of quantitative trait loci. These include Ectodysplasin (EDA), a large-effect pleiotropic locus associated with defensive armor and variation in neurosensory and behavioral traits. Freshwater ancestry in the subfossils is also consistently found at inversions and X chromosome early in the adaptive process. Our findings add to emerging evidence that freshwater adaptation in threespine stickleback could have a staggered but predictable temporal dynamic.

Dataset DOI: 10.5061/dryad.547d7wmmc

Description of the data and file structure

Scripts for mapping ancient DNA sequence data and for generating plots:

Laine_et_al_2025_UNIX_Code_Mapping_and_filtering_NGS_data_and_creating_haplofile_covmatrix_VCF_and_Fst.txt
Laine_et_al._2025_Ancient_Stickleback_Bones_R-code.R

Code/software

BWA, Samtools, R

The data processing, analysis and visualisation of this study was done by using Bash and R programming languages (R version 4.4.1 (2024-06-14 ucrt)). Mapping and filtering of the next-generation sequencing data, variant calling, Fst etimations, haploid genotype calls and covariance matrix generation was done using suitable tools and programs in Bash as specified in the manusript and in the script provided as separate file. Shortly, the following software is required: AdapterRemoval ≥ v2, BWA (supports aln/sampe/samse), samtools ≥ 1.10, bedtools ≥ 2.29, ANGSD, bcftools ≥ 1.10, vcftools ≥ 0.1.16. Further data analysis and visualisation was done in R with methods similarly as specified in the manuscript and in the R script provided as a separate file.

The associated bash script only requires the raw sequencing data as the input for replication. Input files used in the R script analysis and data visualisation are specified below:

Files that contain metadata of the samples and are published as part of this manuscript:

1. "aBones_Locations_Split_Coordinates_v2.csv" <- This file contains the coordinates from the Supplementary Table 1 where the latitudes and longitudes are splitted in separate columns. This table also retained the sample names and added "Rank" to indicate the focal ancient samples.

2. "Supplementary_Table_2_v6_June_resub_Loberg_omitted.csv" <- This file is the Supplementary Table 2 of this study.

Files that were generated from the raw data using Bash commands and the demultiplexed raw sequencing data as the input:

1. "CoVMatrix_Modern_ref_data_aBones_trim5_transitions_included_Loberg_20_omitted.covMat" <- ANGSD generated covariance matrix file used as the starting point for the PCA of the dataset.

2. "Haplofile_Marine_10_FW_10_aSB_Bone_9_trim5_transitions_WG.haplo" <- ANGSD generated haplo file used as the haploid genotype information for the ancestry estimations.

3. "10_marine_10_fw_subset_divergent_region_filtered_VCF_Fst.weir.fst" <- Fst file generated from variant calls with bcftools mpileup in the Bash script.

The associated R script has working directories specified as they were durig the actual data analysis of this manuscript. For replications, these need to be adjusted appropiatly. Ancestry estimation pipeline contains multiple steps where intermediate files are generated and used as the input for the next step.

The following R packages are required: ggplot2, sf, rnaturalearth, rnaturalearthdata, ggspatial, ggrepel, cowplot, magrittr, missMDA, dplyr, tidyr, patchwork, tidyverse

The repeats were masked and the global marine-freshwater divergen regions were done using bed files provided in the Jones et al. (2012) study. The liftover to the gasAcu1-4 reference genome from the gasAcu1 reference was done following the instructions, script and associated files provided by Roberts Kingman et al. (2020).

Access information

Other publicly accessible locations of the data:

NCBI BioProject ‘stickle-back-in-time’ (PRJNA693136)

Data was derived from the following sources:

This study