Data from: Evolutionary relationships between landlocked and anadromous Atlantic salmon populations in the North Shore region of the Gulf of St. Lawrence
Data files
May 19, 2026 version files 1.53 MB
-
AnalysisResults.zip
1.32 KB
-
Data.zip
1.47 MB
-
Functions.zip
20.23 KB
-
README.md
16.97 KB
-
Scripts.zip
22.64 KB
Abstract
Although Atlantic salmon (Salmo salar) are typically anadromous, some complete their life cycle in freshwater. A widely documented scenario suggests that this resident tactic arose independently in each river via isolation from anadromous Atlantic salmon populations after the last ice age. Yet, the origin of residency remains poorly studied in the North Shore region of the Gulf of St. Lawrence (Canada). To address this, we genotyped 189 resident and 196 anadromous individuals from five watersheds at 43 microsatellite markers. We found marked genetic differences between tactics within rivers, likely resulting from different levels of gene flow associated with geographic isolation, suggesting that residents may not have always been isolated as expected from this tactic. Moreover, we observed greater genetic differentiation between tactics within rivers than between residents from different rivers, supporting a shared ancestral source, likely shaped by two colonization events or ancestral intracontinental gene flow. These findings bring nuances and complexity to the views on the origin of residency in Atlantic salmon, and are valuable for guiding conservation practices.
Dataset DOI: 10.5061/dryad.v6wwpzh7v
Description of the data and file structure
This repository contains:
- Microsatellite genotype data for Atlantic salmon (Salmo salar) sampled from 5 rivers (10 sampling sites) in the North Shore region of the Gulf of St. Lawrence, Québec, Canada.
- Data/AllDataCombined_NotFiltered.txt
Full unfiltered dataset combining all rivers, all sampling years, and all life stages. Contains 1,612 individuals and 51 microsatellite loci plus the 20 strata columns. This is the starting point of the filtering pipeline.
- Data/PUY_Ind_Cluster_With_ROM_To_Remove.txt
List of 9 individual IDs from the Puyjalon River identified via STRUCTURE analysis as clustering with a Rimouski-type genetic group. These individuals were excluded from the main filtered datasets. The file contains a single column of individual IDs with no header row.
- Data/All_Rivers_Dataset_Filtered_Balanced.txt
- Data/All_Rivers_Dataset_Filtered_Imputed_Balanced.txt
Filtered and balanced datasets containing 885 individuals across 9 sampling sites and 43 loci, plus the 20 strata columns. The Puyjalon River is represented by a random subsample of 50 individuals to match sample sizes across sites. The _Imputed version has missing genotypes replaced by alleles sampled in proportion to observed allele frequencies; the non-imputed version retains NA for missing genotypes.
- Data/All_Rivers_All_PUY_Dataset_Filtered_Balanced.txt
- Data/All_Rivers_All_PUY_Dataset_Filtered_Imputed_Balanced.txt
Same structure as the main filtered and balanced datasets but retaining all available Puyjalon individuals (987 individuals total, 43 loci, plus the 20 strata columns) rather than the balanced subsample of 50. The _Imputed version has missing genotypes replaced by alleles sampled in proportion to observed allele frequencies; the non-imputed version retains NA for missing genotypes. Used for analyses where the full Puyjalon sample size is required without population reassignment.
- Data/All_Rivers_Dataset_Filtered_Balanced_Reassigned.txt
- Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Reassigned.txt
Same as above, but with individual population assignments corrected based on PCA clustering: individuals whose genotype clustered with a population other than their sampling site of origin were reassigned to the population they clustered with. Structure is otherwise identical (885 individuals, 43 loci).
- Data/All_Rivers_Dataset_Filtered_Balanced_Rimouski.txt
- Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Rimouski.txt
Same structure as the main filtered datasets but include 50 additional individuals from the Rimouski River (936 individuals total, 40 loci). The Rimouski population is used as an outgroup in phylogenetic tree analyses.
- Data/All_Rivers_Dataset_Filtered_Balanced_Reassigned_Rimouski.txt
- Data/All_Rivers_Dataset_Filtered_Imputed_Balanced_Reassigned_Rimouski.txt
Combination of the Rimouski-inclusive dataset and the population reassignment procedure. Contains 936 individuals and 40 loci.
- Data/All_Rivers_All_PUY_Dataset_Filtered_Balanced_Reassigned.txt
- Data/All_Rivers_All_PUY_Dataset_Filtered_Imputed_Balanced_Reassigned.txt
Same as the reassigned datasets but retaining all available Puyjalon individuals (987 individuals total, 43 loci) rather than the balanced subsample of 50. Used for analyses where the full Puyjalon sample size is required.
- Data/ALLELE_Filtered_Individuals885_Loci43_REASSIGNED_IND_FINAL.txt
- Data/ALLELE_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.txt
Allele-format files corresponding to the 885-individual reassigned dataset, with (_IMPUTED) and without missing data imputation. These are the primary input files for the custom R functions in the Functions/ folder.
- Data/ALLELE_Filtered_Individuals936_Loci40_REASSIGNED_IND_FINAL_Rimouski.txt
Allele-format file for the Rimouski-inclusive reassigned dataset (936 individuals, 40 loci). Non-imputed version.
- Data/ALLELE_Filtered_Individuals987_Loci43_ALL_PUY_REASSIGNED_IND_IMPUTED_FINAL.txt
Allele-format file for the full-Puyjalon reassigned imputed dataset (987 individuals, 43 loci).
- Data/GENOTYPE_Filtered_Individuals885_Loci43_IMPUTED_FINAL.txt
- Data/GENOTYPE_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.txt
Genotype-format versions of the 885-individual imputed datasets, with and without population reassignment. Each locus is represented by a single column containing both alleles separated by an underscore (e.g., 45_49). Generated by allele2genotype.R and used as input for create_genind.R.
- Data/GENOTYPE_Filtered_Individuals936_Loci40_REASSIGNED_IND_IMPUTED_FINAL_Rimouski.txt
- Data/GENOTYPE_Filtered_Individuals936_Loci41_IMPUTED_FINAL.txt
Genotype-format versions of the Rimouski-inclusive imputed datasets, with and without population reassignment. The reassigned version contains 40 loci; the non-reassigned version contains 41 loci.
- Data/GENIND_Filtered_Individuals885_Loci43_IMPUTED_FINAL.RDS
- Data/GENIND_Filtered_Individuals885_Loci43_REASSIGNED_IND_IMPUTED_FINAL.RDS
R objects of class genind (package adegenet) for the 885-individual imputed datasets, with and without population reassignment. Each object contains 43 loci and 9 populations (Aguanish, Downstream, Etamamiou, Lake, Perugia, Puyjalon, River, Upstream, Victor). Generated by create_genind.R.
- Data/GENIND_Filtered_Individuals936_Loci40_REASSIGNED_IND_IMPUTED_FINAL_Rimouski.RDS
- Data/GENIND_Filtered_Individuals936_Loci41_IMPUTED_FINAL.RDS
R objects of class genind for the Rimouski-inclusive imputed datasets. Both contain 10 populations (adding Rimouski to the 9 above). The reassigned version has 40 loci; the non-reassigned version has 41 loci.
- Data/GENEPOP_Individuals385_Loci43_River_Site_IMPUTED_MUS_POOLED.txt
- Data/GENEPOP_Individuals385_Loci43_River_Site_IMPUTED_REASS_MUS_POOLED.txt
GENEPOP-format files containing 385 individuals genotyped at 43 loci, with populations defined by river-site combination and individuals from the Musquaro River pooled across sites. The _REASS version incorporates population reassignment. Generated by genepop_creator.R and used as input for STRUCTURE or similar software.
NA values: NA indicates missing genotype data, which may result from PCR amplification failure, low DNA quality, or insufficient signal during genotyping scoring. In the columns Month, Day, Length_F_mm, Length_T_mm, Weigth_g, Sex, Age, and Age_Ext, NA indicates data that were not collected for that individual, which was common for certain life stages and sampling rivers; these columns were not used in the genetic analyses.
Column descriptions:
The following columns appear in all tabular data files (AllDataCombined_NotFiltered.txt, All_Rivers_*, ALLELE_*, and GENOTYPE_* files). The strata are also stored in the @strata slot of all genind (.RDS) objects.
- ID = Unique individual identifier, formatted as EcotypeCode_IndividualNumber_RiverCode_Year (e.g., AN_0047_AGU_2004)
- Ecotype = Migratory tactic (Landlocked, Anadromous, or unknown)
- Ecotype_Num = Numeric code for ecotype
- River = River where the individual was sampled
- River_Num = Numeric code for river
- Site = Sampling site on the river
- Year = Sampling year
- Month = Sampling month; NA when not recorded
- Day = Sampling day; NA when not recorded
- Num_Unique = Unique individual number
- Life_Stage = Life stage of the individual (e.g., Adult, Parr, Smolt)
- Life_Stage_Num = Numeric code for life stage
- Length_F_mm = Fork length in millimeters; NA when not measured
- Length_T_mm = Total length in millimeters; NA when not measured
- Weigth_g = Weight in grams; NA when not measured
- Sex = Sex of the individual; NA when not determined
- Age = Age of the individual; NA when not determined
- Age_Ext = Indicates whether the age estimate is exact or approximate ("+" indicates the fish is at least the recorded age but possibly older); NA when not determined
- River_Site = Code combining river name and site (e.g., AGU_A)
- River_Site_Num = Numeric code for River_Site
- Microsatellite loci columns in
ALLELE_*andAll_Rivers_*files: Each locus is represented by two columns (e.g., NGS_SSsp2210 and NGS_SSsp2210_b), one per allele. The suffix "_b" denotes the second allele. Values represent allele sizes in base pairs. NA indicates missing genotype data (see NA values section above). - Microsatellite loci columns in GENOTYPE_* files: Each locus is represented by a single column. Values are two allele sizes in base pairs separated by an underscore (e.g., 45_49, where 45 is the first allele and 49 is the second allele). NA indicates missing genotype data.
- R scripts used for filtering data sets and create needed formats.
- Scripts/01_FilteringData.R
Script used to filter data for all rivers except the Rimouski River, used as an outlier for phylogenetic trees
- Scripts/02_FilteringData_Rimouski.R
Script used to filter data for all rivers AND the Rimouski River, used as an outlier for phylogenetic trees
- Scripts/03_Create_Format_Data.R
Script used to create the needed files formats (organized by alleles or genotypes, genind, genepop)
- R scripts used for population structure analysis, genetic differentiation, and figure generation.
- Scripts/04_PCA_MS.R
Script used to generate PCA figure that appear in the manuscript.
- Scripts/05_PCA_MatSupp.R
Script used to generate PCA figure that appear in the supplementary materials.
- Scripts/06_Tree_MS.R
Script used to generate phylogenetic tree figures.
- Scripts/07_MigrationDirectionality_MS.R
Script used to compute migration directionality and generate figures.
- Scripts/08_FST_MS.R
Script used to calculate FST and generate related figures.
- Custom functions used within the scripts above.
- Functions/allele_freq.R
Calculates allele frequencies for each locus from a biallelic dataset. Returns a named list of dataframes, one per locus, containing allele counts and relative frequencies. Output is used as input for impute.R.
- Functions/allele_rich.R
Estimates allelic richness per locus and population using rarefaction (via the pegas package). Returns a long-format dataframe with allelic richness values per locus-population combination.
- Functions/fst_calculation.R
Estimates pairwise FST values between populations using either Nei (1987) or Weir and Cockerham (1984) estimators via the hierfstat package, along with bootstrap confidence intervals. Returns a long-format dataframe with point estimates and lower/upper confidence bounds for each population pair.
- Functions/fst_marker.R
Calculates per-locus pairwise FST values between all population pairs using the pegas package. Returns a long-format dataframe with FST per locus and per population pair, useful for identifying loci under selection.
- Functions/het.R
Estimates expected (Hs) and observed (Hi) heterozygosity per locus and population using the pegas package. Returns a long-format dataframe with heterozygosity values per locus-population combination.
- Functions/kp_ind.R
Subsets a genind object by retaining only individuals matching one or more specified values in a given strata column. Intended for filtering individuals by ecotype, life stage, sampling site, or any other stratification variable.
- Functions/pca.R
Performs a principal component analysis on a genind object using the adegenet package. Returns a list containing the PCA object (scores, loadings) and the percentage of inertia explained by each principal component.
- Functions/allele2genotype.R
Merges paired allele columns (one per allele) into a single genotype column per locus. The resulting genotype-formatted dataframe is exported as a .txt file. Required preprocessing step before running create_genind.R.
- Functions/calculate_marker_missing_percentage.R
Computes the proportion of missing data per locus, both overall and within each sampling river. Returns a list containing a summary dataframe of missing data proportions and a vector of loci that exceed a user-defined missingness threshold.
- Functions/create_genind.R
Converts a genotype-formatted dataframe (produced by allele2genotype.R) into a genind object from the adegenet package. The resulting object is exported as an .RDS file and used as input for genepop_creator.R and pca.R.
- Functions/genepop_creator.R
Converts a genind object into a GENEPOP-formatted .txt file. Handles edge cases where fewer than three populations are present by temporarily adding dummy individuals. Missing genotypes encoded as 000000 are automatically recoded to 0000 for compatibility with downstream software.
- Functions/impute.R
Replaces missing allele values (NA) with alleles sampled randomly in proportion to their observed frequency, as estimated by allele_freq.R. Imputation is performed independently for each allele column at each locus.
- Functions/ld_test_locus.R
Tests for linkage disequilibrium between all pairs of loci using the pegas package. Returns a long-format dataframe with pairwise LD (Delta) values, allele information, chromosome assignments, and associated p-values for each locus pair.
- Functions/na_heatmap.R
Reformats a genotyping dataframe into a long-format dataframe suitable for plotting a heatmap of missing data, with one row per individual-locus combination and a binary indicator of missingness. Population labels are included to allow visualization by sampling group.
- Functions/p100_missing.R
Calculates the proportion of missing genotype data either per individual (rows) or per locus (columns), depending on the margin argument. Returns a dataframe (locus-level) or prints a numeric vector (individual-level).
- Functions/export_allele_format.R
Exports a wide-format allele dataframe (one column per allele) to a .txt file with a standardized file name encoding the number of individuals and loci. This format is the main input format for most other functions in this workflow.
- Functions/delta_K.R
Computes the delta K statistic (Evanno et al. 2005) from multiple STRUCTURE runs to identify the most likely number of genetic clusters (K). Takes the mean log-likelihood values across runs as input and returns delta K values per tested K.
- Functions/mean_membership.R
Calculates mean individual membership coefficients across multiple STRUCTURE runs for a given K. Used to summarize and visualize population assignment probabilities.
Files and variables
File: AnalysisResults.zip
Description: Contains one RData file: DirMig_EcoSure_MusPooled_Reass_GST_10000_Boots_Forward.RData This file contains the R object mig_all_reass, which is the output of migration directionality analysis performed with the divMigrate() function from the R package diveRsity. The object contains two matrices where rows and columns both correspond to the 9 sampling sites (AGU_A, COR_A, COR_R, ETA_A, MUS_A, MUS_R, PER_R, PUY_A, VIC_R). Each cell [i, j] represents the migration value from population i to population j:
- $gRelMig: 9x9 matrix of relative migration values between all pairs of sampling sites
- $gRelMigSig: 9x9 matrix of the same values filtered for statistical significance (non-significant values set to 0), based on 10,000 bootstraps using the GST statistic. Diagonal values are NA (no self-migration).
File: Functions.zip
Description: Custom R functions used in all analyses.
File: Scripts.zip
Description: R scripts for data processing, population structure analysis, differentiation, migration directionality, and figure generation. These scripts rely on the functions provided in Functions.zip.
File: Data.zip
Description: Microsatellite genotype datasets for Atlantic salmon from rivers in the North Shore region of the Gulf of St. Lawrence. Includes raw, filtered and imputed datasets in both text and genind/GENEPOP formats.
GENEPOP file structure: Each row represents one individual. Loci names are listed in the header, separated by commas. Populations are delimited by "Pop" rows. Genotypes are encoded as pairs of allele sizes (in base pairs), concatenated into a single value per locus (e.g., "2545" = allele 1 of 25 bp, allele 2 of 45 bp).
Code/software
Scripts were written using R version 4.2.1.
Scripts should be executed in the following order:
01_FilteringData.R → 02_FilteringData_Rimouski.R →
03_Create_Format_Data.R → 04_PCA_MS.R → 05_PCA_MatSupp.R →
06_Tree_MS.R → 07_MigrationDirectionality_MS.R → 08_FST_MS.R
Required R packages (also listed in each script):
adegenet
pegas
reshape2
readxl
dplyr
tidyr
hierfstat
stringr
tidyverse
ggplot2
ggpubr
ggtext
cowplot
ggnewscale
poppr
ggtree
ape
tidytree
stats
diveRsity
tibble
igraph
ggraph
