Data from: Evolutionary relationships between landlocked and anadromous Atlantic salmon populations in the North Shore region of the Gulf of St. Lawrence

Data files

May 19, 2026 version files 1.53 MB

AnalysisResults.zip

1.32 KB
Data.zip

1.47 MB
Functions.zip

20.23 KB
README.md

16.97 KB
Scripts.zip

22.64 KB

Abstract

Although Atlantic salmon (Salmo salar) are typically anadromous, some complete their life cycle in freshwater. A widely documented scenario suggests that this resident tactic arose independently in each river via isolation from anadromous Atlantic salmon populations after the last ice age. Yet, the origin of residency remains poorly studied in the North Shore region of the Gulf of St. Lawrence (Canada). To address this, we genotyped 189 resident and 196 anadromous individuals from five watersheds at 43 microsatellite markers. We found marked genetic differences between tactics within rivers, likely resulting from different levels of gene flow associated with geographic isolation, suggesting that residents may not have always been isolated as expected from this tactic. Moreover, we observed greater genetic differentiation between tactics within rivers than between residents from different rivers, supporting a shared ancestral source, likely shaped by two colonization events or ancestral intracontinental gene flow. These findings bring nuances and complexity to the views on the origin of residency in Atlantic salmon, and are valuable for guiding conservation practices.

Dataset DOI: 10.5061/dryad.v6wwpzh7v

Description of the data and file structure

This repository contains:

Microsatellite genotype data for Atlantic salmon (Salmo salar) sampled from 5 rivers (10 sampling sites) in the North Shore region of the Gulf of St. Lawrence, Québec, Canada.

Data/AllDataCombined_NotFiltered.txt

Full unfiltered dataset combining all rivers, all sampling years, and all life stages. Contains 1,612 individuals and 51 microsatellite loci plus the 20 strata columns. This is the starting point of the filtering pipeline.

Data/PUY_Ind_Cluster_With_ROM_To_Remove.txt

List of 9 individual IDs from the Puyjalon River identified via STRUCTURE analysis as clustering with a Rimouski-type genetic group. These individuals were excluded from the main filtered datasets. The file contains a single column of individual IDs with no header row.

Data/All_Rivers_Dataset_Filtered_Balanced.txt
Data/All_Rivers_Dataset_Filtered_Imputed_Balanced.txt

Filtered and balanced datasets containing 885 individuals across 9 sampling sites and 43 loci, plus the 20 strata columns. The Puyjalon River is represented by a random subsample of 50 individuals to match sample sizes across sites. The _Imputed version has missing genotypes replaced by alleles sampled in proportion to observed allele frequencies; the non-imputed version retains NA for missing genotypes.

Data/All_Rivers_All_PUY_Dataset_Filtered_Balanced.txt
Data/All_Rivers_All_PUY_Dataset_Filtered_Imputed_Balanced.txt

Same structure as the main filtered and balanced datasets but retaining all available Puyjalon individuals (987 individuals total, 43 loci, plus the 20 strata columns) rather than the balanced subsample of 50. The _Imputed version has missing genotypes replaced by alleles sampled in proportion to observed allele frequencies; the non-imputed version retains NA for missing genotypes. Used for analyses where the full Puyjalon sample size is required without population reassignment.

Data/All_Rivers_Dataset_Filtered_Balanced_Reassigned.txt
Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Reassigned.txt

Same as above, but with individual population assignments corrected based on PCA clustering: individuals whose genotype clustered with a population other than their sampling site of origin were reassigned to the population they clustered with. Structure is otherwise identical (885 individuals, 43 loci).

Data/All_Rivers_Dataset_Filtered_Balanced_Rimouski.txt
Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Rimouski.txt

Same structure as the main filtered datasets but include 50 additional individuals from the Rimouski River (936 individuals total, 40 loci). The Rimouski population is used as an outgroup in phylogenetic tree analyses.

Data/All_Rivers_Dataset_Filtered_Balanced_Reassigned_Rimouski.txt
Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Reassigned_Rimouski.txt

Combination of the Rimouski-inclusive dataset and the population reassignment procedure. Contains 936 individuals and 40 loci.

Data/All_Rivers_All_PUY_Dataset_Filtered_Balanced_Reassigned.txt
Data/All_Rivers_All_PUY_Dataset_Filtered_Imputed_Balanced_Reassigned.txt

Same as the reassigned datasets but retaining all available Puyjalon individuals (987 individuals total, 43 loci) rather than the balanced subsample of 50. Used for analyses where the full Puyjalon sample size is required.

Data/ALLELE_Filtered_Individuals885_Loci43_REASSIGNED_IND_FINAL.txt
Data/ALLELE_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.txt

Allele-format files corresponding to the 885-individual reassigned dataset, with (_IMPUTED) and without missing data imputation. These are the primary input files for the custom R functions in the Functions/ folder.

Data/ALLELE_Filtered_Individuals936_Loci40_REASSIGNED_IND_FINAL_Rimouski.txt

Allele-format file for the Rimouski-inclusive reassigned dataset (936 individuals, 40 loci). Non-imputed version.

Data/ALLELE_Filtered_Individuals987_Loci43_ALL_PUY_REASSIGNED_IND_IMPUTED_FINAL.txt

Allele-format file for the full-Puyjalon reassigned imputed dataset (987 individuals, 43 loci).

Data/GENOTYPE_Filtered_Individuals885_Loci43_IMPUTED_FINAL.txt
Data/GENOTYPE_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.txt

Genotype-format versions of the 885-individual imputed datasets, with and without population reassignment. Each locus is represented by a single column containing both alleles separated by an underscore (e.g., 45_49). Generated by allele2genotype.R and used as input for create_genind.R.

Data/GENOTYPE_Filtered_Individuals936_Loci40_REASSIGNED_IND_IMPUTED_FINAL_Rimouski.txt
Data/GENOTYPE_Filtered_Individuals936_Loci41_IMPUTED_FINAL.txt

Genotype-format versions of the Rimouski-inclusive imputed datasets, with and without population reassignment. The reassigned version contains 40 loci; the non-reassigned version contains 41 loci.

Data/GENIND_Filtered_Individuals885_Loci43_IMPUTED_FINAL.RDS
Data/GENIND_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.RDS

R objects of class genind (package adegenet) for the 885-individual imputed datasets, with and without population reassignment. Each object contains 43 loci and 9 populations (Aguanish, Downstream, Etamamiou, Lake, Perugia, Puyjalon, River, Upstream, Victor). Generated by create_genind.R.

Data/GENIND_Filtered_Individuals936_Loci40_REASSIGNED_IND_IMPUTED_FINAL_Rimouski.RDS
Data/GENIND_Filtered_Individuals936_Loci41_IMPUTED_FINAL.RDS

R objects of class genind for the Rimouski-inclusive imputed datasets. Both contain 10 populations (adding Rimouski to the 9 above). The reassigned version has 40 loci; the non-reassigned version has 41 loci.

Data/GENEPOP_Individuals385_Loci43_River_Site_IMPUTED_MUS_POOLED.txt
Data/GENEPOP_Individuals385_Loci43_River_Site_IMPUTED_REASS_MUS_POOLED.txt

GENEPOP-format files containing 385 individuals genotyped at 43 loci, with populations defined by river-site combination and individuals from the Musquaro River pooled across sites. The _REASS version incorporates population reassignment. Generated by genepop_creator.R and used as input for STRUCTURE or similar software.

NA values: NA indicates missing genotype data, which may result from PCR amplification failure, low DNA quality, or insufficient signal during genotyping scoring. In the columns Month, Day, Length_F_mm, Length_T_mm, Weigth_g, Sex, Age, and Age_Ext, NA indicates data that were not collected for that individual, which was common for certain life stages and sampling rivers; these columns were not used in the genetic analyses.

Column descriptions:

The following columns appear in all tabular data files (AllDataCombined_NotFiltered.txt, All_Rivers_*, ALLELE_*, and GENOTYPE_* files). The strata are also stored in the @strata slot of all genind (.RDS) objects.

ID = Unique individual identifier, formatted as EcotypeCode_IndividualNumber_RiverCode_Year (e.g., AN_0047_AGU_2004)
Ecotype = Migratory tactic (Landlocked, Anadromous, or unknown)
Ecotype_Num = Numeric code for ecotype
River = River where the individual was sampled
River_Num = Numeric code for river
Site = Sampling site on the river
Year = Sampling year
Month = Sampling month; NA when not recorded
Day = Sampling day; NA when not recorded
Num_Unique = Unique individual number
Life_Stage = Life stage of the individual (e.g., Adult, Parr, Smolt)
Life_Stage_Num = Numeric code for life stage
Length_F_mm = Fork length in millimeters; NA when not measured
Length_T_mm = Total length in millimeters; NA when not measured
Weigth_g = Weight in grams; NA when not measured
Sex = Sex of the individual; NA when not determined
Age = Age of the individual; NA when not determined
Age_Ext = Indicates whether the age estimate is exact or approximate ("+" indicates the fish is at least the recorded age but possibly older); NA when not determined
River_Site = Code combining river name and site (e.g., AGU_A)
River_Site_Num = Numeric code for River_Site
Microsatellite loci columns in ALLELE_* and All_Rivers_* files: Each locus is represented by two columns (e.g., NGS_SSsp2210 and NGS_SSsp2210_b), one per allele. The suffix "_b" denotes the second allele. Values represent allele sizes in base pairs. NA indicates missing genotype data (see NA values section above).
Microsatellite loci columns in GENOTYPE_* files: Each locus is represented by a single column. Values are two allele sizes in base pairs separated by an underscore (e.g., 45_49, where 45 is the first allele and 49 is the second allele). NA indicates missing genotype data.

R scripts used for filtering data sets and create needed formats.

Scripts/01_FilteringData.R

Script used to filter data for all rivers except the Rimouski River, used as an outlier for phylogenetic trees

Scripts/02_FilteringData_Rimouski.R

Script used to filter data for all rivers AND the Rimouski River, used as an outlier for phylogenetic trees

Scripts/03_Create_Format_Data.R

Script used to create the needed files formats (organized by alleles or genotypes, genind, genepop)

R scripts used for population structure analysis, genetic differentiation, and figure generation.

Scripts/04_PCA_MS.R

Script used to generate PCA figure that appear in the manuscript.

Scripts/05_PCA_MatSupp.R

Script used to generate PCA figure that appear in the supplementary materials.

Scripts/06_Tree_MS.R

Script used to generate phylogenetic tree figures.

Scripts/07_MigrationDirectionality_MS.R

Script used to compute migration directionality and generate figures.

Scripts/08_FST_MS.R

Script used to calculate FST and generate related figures.

Custom functions used within the scripts above.

Functions/allele_freq.R

Calculates allele frequencies for each locus from a biallelic dataset. Returns a named list of dataframes, one per locus, containing allele counts and relative frequencies. Output is used as input for impute.R.

Functions/allele_rich.R

Estimates allelic richness per locus and population using rarefaction (via the pegas package). Returns a long-format dataframe with allelic richness values per locus-population combination.

Functions/fst_calculation.R

Estimates pairwise FST values between populations using either Nei (1987) or Weir and Cockerham (1984) estimators via the hierfstat package, along with bootstrap confidence intervals. Returns a long-format dataframe with point estimates and lower/upper confidence bounds for each population pair.

Functions/fst_marker.R

Calculates per-locus pairwise FST values between all population pairs using the pegas package. Returns a long-format dataframe with FST per locus and per population pair, useful for identifying loci under selection.

Functions/het.R

Estimates expected (Hs) and observed (Hi) heterozygosity per locus and population using the pegas package. Returns a long-format dataframe with heterozygosity values per locus-population combination.

Functions/kp_ind.R

Subsets a genind object by retaining only individuals matching one or more specified values in a given strata column. Intended for filtering individuals by ecotype, life stage, sampling site, or any other stratification variable.

Functions/pca.R

Performs a principal component analysis on a genind object using the adegenet package. Returns a list containing the PCA object (scores, loadings) and the percentage of inertia explained by each principal component.

Functions/allele2genotype.R

Merges paired allele columns (one per allele) into a single genotype column per locus. The resulting genotype-formatted dataframe is exported as a .txt file. Required preprocessing step before running create_genind.R.

Functions/calculate_marker_missing_percentage.R

Computes the proportion of missing data per locus, both overall and within each sampling river. Returns a list containing a summary dataframe of missing data proportions and a vector of loci that exceed a user-defined missingness threshold.

Functions/create_genind.R

Converts a genotype-formatted dataframe (produced by allele2genotype.R) into a genind object from the adegenet package. The resulting object is exported as an .RDS file and used as input for genepop_creator.R and pca.R.

Functions/genepop_creator.R

Converts a genind object into a GENEPOP-formatted .txt file. Handles edge cases where fewer than three populations are present by temporarily adding dummy individuals. Missing genotypes encoded as 000000 are automatically recoded to 0000 for compatibility with downstream software.

Functions/impute.R

Replaces missing allele values (NA) with alleles sampled randomly in proportion to their observed frequency, as estimated by allele_freq.R. Imputation is performed independently for each allele column at each locus.

Functions/ld_test_locus.R

Tests for linkage disequilibrium between all pairs of loci using the pegas package. Returns a long-format dataframe with pairwise LD (Delta) values, allele information, chromosome assignments, and associated p-values for each locus pair.

Functions/na_heatmap.R

Reformats a genotyping dataframe into a long-format dataframe suitable for plotting a heatmap of missing data, with one row per individual-locus combination and a binary indicator of missingness. Population labels are included to allow visualization by sampling group.

Functions/p100_missing.R

Calculates the proportion of missing genotype data either per individual (rows) or per locus (columns), depending on the margin argument. Returns a dataframe (locus-level) or prints a numeric vector (individual-level).

Functions/export_allele_format.R

Exports a wide-format allele dataframe (one column per allele) to a .txt file with a standardized file name encoding the number of individuals and loci. This format is the main input format for most other functions in this workflow.

Functions/delta_K.R

Computes the delta K statistic (Evanno et al. 2005) from multiple STRUCTURE runs to identify the most likely number of genetic clusters (K). Takes the mean log-likelihood values across runs as input and returns delta K values per tested K.

Functions/mean_membership.R

Calculates mean individual membership coefficients across multiple STRUCTURE runs for a given K. Used to summarize and visualize population assignment probabilities.

Files and variables

File: AnalysisResults.zip

Description: Contains one RData file: DirMig_EcoSure_MusPooled_Reass_GST_10000_Boots_Forward.RData This file contains the R object mig_all_reass, which is the output of migration directionality analysis performed with the divMigrate() function from the R package diveRsity. The object contains two matrices where rows and columns both correspond to the 9 sampling sites (AGU_A, COR_A, COR_R, ETA_A, MUS_A, MUS_R, PER_R, PUY_A, VIC_R). Each cell [i, j] represents the migration value from population i to population j:

$gRelMig: 9x9 matrix of relative migration values between all pairs of sampling sites
$gRelMigSig: 9x9 matrix of the same values filtered for statistical significance (non-significant values set to 0), based on 10,000 bootstraps using the GST statistic. Diagonal values are NA (no self-migration).

File: Functions.zip

Description: Custom R functions used in all analyses.

File: Scripts.zip

Description: R scripts for data processing, population structure analysis, differentiation, migration directionality, and figure generation. These scripts rely on the functions provided in Functions.zip.

File: Data.zip

Description: Microsatellite genotype datasets for Atlantic salmon from rivers in the North Shore region of the Gulf of St. Lawrence. Includes raw, filtered and imputed datasets in both text and genind/GENEPOP formats.

GENEPOP file structure: Each row represents one individual. Loci names are listed in the header, separated by commas. Populations are delimited by "Pop" rows. Genotypes are encoded as pairs of allele sizes (in base pairs), concatenated into a single value per locus (e.g., "2545" = allele 1 of 25 bp, allele 2 of 45 bp).

Code/software

Scripts were written using R version 4.2.1.

Scripts should be executed in the following order:

01_FilteringData.R → 02_FilteringData_Rimouski.R →

03_Create_Format_Data.R → 04_PCA_MS.R → 05_PCA_MatSupp.R →

06_Tree_MS.R → 07_MigrationDirectionality_MS.R → 08_FST_MS.R

Required R packages (also listed in each script):

adegenet

pegas

reshape2

readxl

dplyr

tidyr

hierfstat

stringr

tidyverse

ggplot2

ggpubr

ggtext

cowplot

ggnewscale

poppr

ggtree

ape

tidytree

stats

diveRsity

tibble

igraph

ggraph