Data from: A species interaction kick-starts ecological speciation
Data files
Oct 13, 2025 version files 1.92 GB
-
Data.and.Code_Roesti.et.al.2025.PNAS.zip
1.92 GB
-
README.md
19.92 KB
Abstract
This dataset includes whole-genome sequencing (WGS), genotyping-by-sequencing (GBS), and morphological body-shape data from stickleback populations, together with the scripts used for data processing, statistical analyses, and visualization. The dataset is organized into code, data, and reference subfolders, each containing clearly described files as detailed in the README. These resources enable replication of the analyses presented in the paper. All data were collected in compliance with ethical standards for animal research and genetic data sharing.
For questions regarding this repository or its contents, please contact Marius Roesti at marius.roesti@unibe.ch.
Repository Overview
This repository contains the data and code accompanying the study by Roesti et al. (2025, PNAS), titled “A species interaction kick‑starts ecological speciation in allopatry.”
The study investigates whether the interaction with prickly sculpin fish has triggered adaptive divergence and, as a consequence, reproductive isolation in allopatric populations of stickleback fish. It analyzes
reduced-representation sequencing data of hybrid F1 individuals from matings between divergent allopatric populations within experimental ponds, as well as whole‑genome sequencing (WGS) and morphological body‑shape data from wild populations. This repository provides all code, relevant raw data files, and reference materials required to reproduce the analyses and figures presented in the study.
Repository Structure
The dataset (Data.and.Code_Roesti.et.al.2025.PNAS.zip) is organized into three main subfolders:
- codes/ — R and Unix scripts used for data processing, filtering, analysis, and visualization.
- data.files/ — Raw data generated in the study and key intermediate data files produced by the provided scripts.
- reference.files/ — Reference data and auxiliary files used by several scripts.
Folder: codes/
This folder contains nine scripts used for data processing, filtering, analyses, and visualization. Each script is well annotated and details how it can be executed.
File | Script Type | Description | Main Figures / Outputs |
---|---|---|---|
GBS_initial.processing.sh | Unix (bash) | Processes reduced‑representation (GBS) data of F1 hybrids obtained from a pond experiment, as well as some reference individuals from each of the seven experimental source populations. This script aligns individual FASTQ files, generates BAMs, estimates genotype likelihoods, and infers admixture and relatedness using ANGSD, NGSadmix, and NGSremix. | Generates inputs for GBS_downstream.analyses.R. |
GBS_downstream.analyses.R | R | Reads and visualizes NGSadmix and NGSremix results, constructs admixture plots and kinship matrices, and exports summarized results. | Fig. 1D, Datasets S1–S2. |
WGS_variant.calling.sh | Unix + R (mixed) | Implements a GATK‑based workflow from raw whole-genome sequencing reads to raw variant calls, including reference indexing and quality recalibration. | Generates inputs for WGS_variant.filtering.sh. |
WGS_variant.filtering.sh | Unix + R (mixed) | Filters WGS variants by quality, depth, missingness, and repeats, and excludes sex chromosomes and masked regions. | Generates the filtered variant dataset used in Fig. 3 and related analyses, as well as in Supplementary Analysis 1. |
WGS_phylogenomic.analysis.sh | Unix + R (mixed) | Script outlining the generation of phylogenetic trees from whole-genome sequences of multiple solitary and sympatric stickleback populations using “neutral” SNP subsets, either restricted to chromosome XV or excluding SNPs in low-recombination genome regions and those showing high parallel divergence between marine and solitary or marine and sympatric populations. | Supplementary Analysis 1. |
WGS_divergence.calculation.sh | Unix + R (mixed) | Calculates pairwise genetic divergence in specific genome regions between the experimental source populations. | Fig. 3; Supplementary Analysis 2. |
Genomic_divergence.vs.assortativeMating.R | R | Tests whether genomic divergence predicts assortative mating across experimental ponds. | Fig. 3, Fig. S1B,C, Table S5, Supplementary Analysis 2. |
Phenotype_divergence.vs.assortativeMating.R | R | Tests whether phenotypic (body shape) divergence predicts assortative mating across experimental ponds. | Table 1, Fig. 2, Fig. S1A, Table S5, Supplementary Analysis 2. |
SpatioTemporal.isolation_nest.statistics.R | R | Tests for potential spatial or temporal isolation (nest depth and timing) between solitary and sympatric stickleback populations in the pond experiment. Generates figures summarizing nest characteristics. | Supplementary Analysis 2, Fig. S4. |
Folder: data.files/
This folder contains raw data generated by the study by Roesti et al. 2025 (PNAS) as well as key intermediate files produced by the provided scripts. File formats and dimensions (rows x columns) are indicated below.
File | Format | Dimensions | Description | Used in / Generated by |
---|---|---|---|---|
offspring.assignments.data.csv | CSV | 438 (incl. header) x 9 | GBS‑based offspring assignments to parental populations. Columns: offspring.id, pond, type, nest.id, mother.pop, father.pop, mother.type, father.type, exclude.full.sib.offspring. |
Generated by GBS_downstream.analyses.R and used in several downstream analyses. |
nest.data.csv | CSV | 154 (incl. header) x 6 | Information on recovered nests from experimental ponds, including nest ID, water depth, date found, and whether the nest was empty or contained eggs. Columns: pond, nest.nr, nest.id, nest.date.found, nest.depth_m, empty.nest. |
Used in SpatioTemporal.isolation_nest.statistics.R . |
shape.landmark.data.tps | TPS file | 335 individuals x 42 coords (21 landmarks) | Landmark‑based body‑shape data from Roesti et al. 2023 (Ecol Lett). Contains x,y coordinates of 21 fixed landmarks from lateral photos of 16–21 adult stickleback from each of 18 solitary and sympatric populations (total N of individuals = 335). | Used in Phenotype_divergence.vs.assortativeMating.R . |
allPondGenomes_filtered.step4.vcf.gz | VCF (gzipped) | 2,900,846 SNPs x 110 individuals (86 freshwater, 24 marine) | Contains 2,900,846 quality-filtered SNPs from all 110 WGS stickleback genomes used in Roesti et al. | Generated by WGS_variant.calling.sh and WGS_variant.filtering.sh and used in all downstream analyses using whole-genome data. |
allPondGenomes_filtered.step4_AFD_nonsource.sympatric.vs.nonsource.solitary_20kbWindows.csv | CSV | 2,803 (incl. header) × 5 | Genetic divergence (absolute allele-frequency difference, AFD) in non‑overlapping 20 kb genome windows from comparing 20 sympatric and 20 solitary individual genomes from non‑experimental populations. Columns: CHROM , BIN_START (window start), BIN_END (window end), N.RAW.AFD.VALUES (number of individual AFD values averaged per window), MEAN.AFD . |
Generated by WGS_divergence.calculation.sh ; used in Genomic_divergence.vs.assortativeMating.R . |
allPondGenomes_filtered.step4_AFD_nonsource.sympatric.vs.nonsource.solitary_20kbWindows_top.5percent.bed | BED | 1027 (incl. header) x 3 | The top 5 % highest‑divergence (absolute allele-frequency difference, AFD) 20 kb genome windows used as candidate adaptive regions for comparisons between solitary and sympatric source populations. Columns: CHROM , START , END . |
Generated and used in WGS_divergence.analysis_scripts.sh to obtain inputs for Genomic_divergence.vs.assortativeMating.R . |
exclude.lowRecRate.GenomeRegions.bed | BED | 611 (incl. header)× 3 | Genome regions with low recombination rates (i.e., below 1.5 cM/Mb) to be excluded from phylogenetic analyses to mitigate effects of linked selection. | Generated and used in WGS_phylogenomic.analysis.sh . |
high.divergence.regions_marine.vs.sympatric.and.marine.vs.solitary.bed | BED | 5,188 (incl. header) x 3 | The top 20 % of 20 kb AFD windows showing the highest divergence in comparisons of (i) all marine (N = 24) vs. solitary (N = 48) and (ii) all marine vs. sympatric (N = 38) stickleback. | Generated and used in WGS_phylogenomic.analysis.sh . |
genomic.divergence_specific.genome.regions.csv | CSV | 64 (incl. header) x 5 | Mean pairwise Fst between experimental source populations within specific genome regions: (1) autosome‑wide, (2) inside adaptive regions, (3) outside adaptive regions. Columns: genome.region, popA, popB, N_Fst.values (the number of individual SNP variants across which mean Fst was calculated), mean.Fst. | Generated by WGS_divergence.calculation.sh and used as input in Genomic_divergence.vs.assortativeMating.R . |
Roesti.recRate.from.lifted.linkageMap.csv | CSV | 1,818 (incl. header) x 9 | Recombination rate estimates, both raw and smoothed, along the stickleback genome derived from the linkage map of Roesti et al. 2013 (Mol Ecol) lifted to the stickleback genome v5. Columns: lifted.chromosome , lifted.pos , Roesti.marker , cM , phys_diff , phys_midpoint , genetic_diff , recRate , smoothed.recRate . |
Generated by and used in WGS_phylogenomic.analysis.sh . |
Folder: reference.files/
This folder contains reference and auxiliary files used in variant filtering, recombination rate estimation, and population assignment. File formats and dimensions (rows x columns) are indicated below.
File | Format | Dimensions | Description | Used in / Purpose |
---|---|---|---|---|
AutosomesOnly.txt | TXT | 21 x 1 | List of autosomal chromosomes used to exclude sex chromosomes in genomic analyses. | Used in WGS_variant.filtering.sh . |
stickleback_v5_repeatMasker_for.excluding.repeats.bed | BED | 525,813 x 3 | RepeatMasker annotation of the stickleback genome v5 used to exclude repetitive regions during variant filtering. | Used in WGS_variant.filtering.sh . |
mec12322-sup-0004-appendixS4.csv | CSV | 1,873 (incl. header) x 5 | Linkage map from Roesti et al. 2013 (Mol Ecol), downloaded directly from the supplement. This map was lifted to the stickleback genome v5 for recombination rate inference. | Used in WGS_phylogenomic.analysis.sh . |
pond06_bamfileslist.txt – pond18_bamfileslist.txt | TXT (batch of 8) | variable x 1 | Lists of BAM files (sequenced offspring plus reference individuals) per experimental pond used for GBS-based offspring assignment to parental populations. | Used in GBS_initial.processing.sh . |
ref_ambrose.txt – ref_trout.txt | TXT (batch of 7) | 4-7 x 1 | IDs of individual whole-genome sequences for each of the seven experimental source populations | Used in WGS_divergence.calculation.sh . |
all.individuals.txt | TXT | 532 x 1 | Master list of IDs of all individuals that were genotyped by reduced-representation sequencing, including experimental pond offspring and wild reference samples. | Used in GBS_initial.processing.sh . |