Data and code: Effects of different SNP calling and sequence mapping choices on the inference of genetic architecture underlying migration tendency

Name: Data and code: Effects of different SNP calling and sequence mapping choices on the inference of genetic architecture underlying migration tendency
Creator: Giovanna Mottola

Mottola, Giovanna 1 2

Published May 27, 2026 on Dryad. https://doi.org/10.5061/dryad.v6wwpzh9f

Data files

May 27, 2026 version files 66.92 MB

data_and_codes.zip

66.89 MB
README.md

23.33 KB

Abstract

Genome-wide association studies with identification of biologically relevant genes rely on correct mapping of sequence variation. Here, we re-analysed RAD-Seq data from two life-history types of brown trout (Salmo trutta) from River Koutajoki and River Oulujoki by replacing the originally applied Atlantic salmon (Salmo salar) reference genome with the later published brown trout reference genome, and by testing two alternative bioinformatic pipelines for identifying single nucleotide polymorphisms (SNPs). As expected, the results from population genomics and outlier analyses largely confirmed the original patterns of population structure and divergence, although SNP detection success varied between the used bioinformatic pipelines and reference genomes. While only two SNP outliers were found by all the alternative methods, several other outlier candidate SNPs related to migration, growth, and other biologically significant traits were identified. These findings confirm that the choice of the reference genome is not critical for basic population genomics but can improve the ability to detect functionally relevant genomic variation explaining migratory patterns in brown trout.

https://doi.org/10.5061/dryad.v6wwpzh9f

Principle Investigator Contact Information

Name: Anssi Vainikka
Institution: University of University of Eastern Finland
Email: anssi.vainikka@uef.fi

Alternate Contact Information 1

Name: Tuomas Leinonen
Institution: Natural Resource Institute Finland
Email: tuomas.leinonen@luke.fi

Alternate Contact Information 2

Name: Giovanna Mottola
Institution: University of Eastern Finland
Email: giovanna.mottola@uef.fi and giovannadeanna@gmail.com

Dataset overview

This repository (data_and_codes.zip) contains the bioinformatic re-analysis of RADseq data from resident and migratory populations of Brown trout sampled from the Koutajoki and Oulujoki watersheds in North-eastern Finland. The study investigates how different sequence mapping strategies and SNP-calling pipelines influence the inference of population structure and the identification of genomic regions associated with migratory tendency.

The original analyses relied on the reference genome of Atlantic salmon and older SNP-calling approaches. In this project, the publicly available RADseq datasets were reprocessed using the later-published conspecific brown trout reference genome together with multiple alternative variant-calling workflows, including dDocent and Stacks v2. To disentangle the effects of alignment and SNP detection, both bwa and bowtie aligners were tested within the different pipelines.

The workflow includes:

preprocessing and demultiplexing of RADseq reads,
alignment against the brown trout reference genome,
SNP calling using alternative bioinformatic pipelines,
Hardy–Weinberg equilibrium filtering,
comparison of SNP overlap among pipelines,
population genomic analyses including STRUCTURE and FST estimation,
genome scans for loci under selection using PCAdapt, BayeScan, and BayeScEnv,
functional annotation of candidate outlier loci using VEP.

The re-analysis largely confirmed the original population genetic patterns, with clear differentiation between resident and migratory populations and strong consistency across analytical methods. However, the number and identity of detected SNPs differed substantially among pipelines and mapping approaches, highlighting the impact of technical choices on downstream genomic inference. Surprisingly, the conspecific brown trout reference genome yielded fewer filtered SNPs than the previously used Atlantic salmon reference.

Although overlap among outlier detection methods was limited, several candidate loci associated with migratory behaviour, osmoregulation, immune response, and environmental adaptation were identified. Multiple outlier SNPs were located near genes previously implicated in migration-related processes in salmonids, including KCNJ4, CLDN20, ST6GALNAC2, and znf76. Two highly consistent outliers were also detected within the PRDM9-like gene region across all SNP datasets.

Overall, this project demonstrates that broad population structure inference in brown trout is robust to alternative mapping and SNP-calling strategies, whereas candidate gene discovery is highly sensitive to bioinformatic choices. The analyses further support the hypothesis that migratory tendency in brown trout has a complex, polygenic genetic architecture.

Data Sources

Raw data have been collected from genomic repository (GenBank) Koutajoki accession number: PRJNA431174; Oulujoki accession number: SRP125540. Raw data have been previously produced by our research group and are present in:

Lemopoulos, A., Uusi-Heikkilä, S., Huusko, A., Vasemägi, A., Vainikka, A., 2018a. Comparison of Migratory and Resident Populations of Brown Trout Reveals Candidate Genes for Migration Tendency. Genome Biol. Evol. 10, 1493–1503. https://doi.org/10.1093/gbe/evy102
Lemopoulos, A., Uusi-Heikkilä, S., Vasemägi, A., Huusko, A., Kokko, H., Vainikka, A., 2018b. Genome-wide divergence patterns support fine-scaled genetic structuring associated with migration tendency in brown trout. Can. J. Fish. Aquat. Sci. 75, 1680–1692. https://doi.org/10.1139/cjfas-2017-0014

Funding

This work was funded by Research Council of Finland grant #347367.

Description of the data and file structure

Several files included in this repository are generated through Linux-based workflows. To ensure full reproducibility, all necessary scripts are provided for users who wish to rerun the analyses. At the same time, the corresponding output files are also included, meaning that access to a Linux operating system is not required to reproduce or explore the results presented here. Missing data throughout the datasets are represented as either -9 or NA.

Files and Folders

data and codes/*.sh

These files contain the Linux codes to run the alignment using bwa and bowtie and Stacks as SNP calling pipelines (Koutajoki_bowtieStacks; Koutajoki_bwaStacks; Oulujoki_bowtieStacks; Oulujoki_bwaStacks) + the one used for demultiplexng the Koutajoki library (process_radtags).
All the workflows were run in Puhti supercomputer, Atos BullSequana X400 cluster based on Intel CPUs, launched on September 2, 2019 by Finnish IT Center for Science (CSC).
In order to run this workflow you need to install bowtie v.2, samtools and Stacks v.2.65. These software are already present in Puhti under the "biokit" module.
Further info on each step are present in the .sh files

data and codes/*.txt

These files are the barcodes used to demultiplex the Koutajoki library (barcodes1, barcodes2, barcodes3, barcodes4) and the outliers found using PCAdapt (outliersVENN3). The latter has been used to generate Fig. 4 in the main manuscript and is used in the RMarkdown called "script for outlier overlapping".

The barcodes tables have 2 columns. The first column is the barcode and the second is the number of the individual.
The outliersVENN3 file is composed by 3 columns. The first column is showing the watershed (Koutajoki/Oulujoki) where samples were collected. The second column is showing which pipelines has been used to identify the SNPs (bwa+dDocent, bwa+Stacks, bowtie+Stacks). The third column is listing the outliers. The format is CHROM_POS_REF/ALT

data and codes/*.pdf

This file called "script for outlier overlappings" is an RMarkdown (knitted as PDF) showing the codes to generate the Fig. 4 in the main manuscript. The RStudio version used was R version 4.3.1 (R Core Team (2023). R: A Language and Environment for
Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria. https://www.R-project.org/.)

data and codes/bowtie+Stacks/*.pdf

These files are RMarkdown showing the codes and workflow used for all the downstream analyses present in the manuscript.

Diversity and Genome scan.pdf contains all the workflow relative to Hardy-Weinberg Equilibrium filtering, identification of the outliers through genome scan and population diverge calculation through Fst-values.
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Koutajoki watershed (further instructions are present in the pdf).
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Oulujoki watershed (further instructions are present in the pdf).

data and codes/bowtie+Stacks/*.txt

These files contain the output from Bayescan and BayeScEnv used in RStudio to identify outlier SNPs, the CLUMPAK output to estimate Evanno's bestK and the population map for both the watersheds.

bayescan_koutajoki_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
bayescenv_koutajoki_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
bayescan_oulujoki_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
bayescenv_oulujoki_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
evanno_koutajoki.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (Delta_K)
evanno_oulujoki.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (Delta_K)
popmap2_koutajoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).
popmap_oulujoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).

data and codes/bowtie+Stacks/*.vcf

These are the vcf files obtained after running Stacks pipeline using bowtie and for both Koutajoki and Oulujoki watersheds.The structure of the files, without the metadata, has the following order:

Variable list

CHROM: (numeric) The chromosome for the locus.
POS: (numeric) The position of the locus.
ID: (alphanumeric) A unique identifier for the locus.
REF: (character, A, C, G, T) The reference allele(s) for the locus.
ALT: (character, A, C, G, T) Alternate allele(s) for the locus.
QUAL: no quality specified.
FILTER: (character) PASS if the locus passed filtration.
INFO: no info specified
FORMAT: (character) A character string specifying the format of the calls (GT (genotype call):DP (depth):AD (allele depth):GQ (genotype quality):GL (genotype likelihood))
Variables 10-178 (Koutajoki)/80 (Oulujoki) are read calls for each individual.
Data type: 0/0 homozygous reference, 0/1 heterozygous, 1/1 homozygous alternate:genotype call:depth:allele depth:genotype quality,genotype likelihood
Missing data value: ./.:.:.,.

Koutajoki_4881_178.recode.vcf contains all the SNPs found using this workflow on Koutajoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 178
Number of rows: 4881

Koutajoki_3055_178.recode.vcf contains all the SNPs found using this workflow after Hardy-Weinberg Equilibrium filtering on Koutajoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 178
Number of rows: 3055

Oulujoki_5164_80.vcf contains all the SNPs found using this workflow on Oulujoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 80
Number of rows: 5164

Oulujoki_4500_80.recode.vcf contains all the SNPs found using this workflow on Oulujoki after Hardy_weinberg equilibrium filtering.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 80
Number of rows: 4500

data and codes/bwa+Stacks/*.pdf

These files are RMarkdown showing the codes and workflow used for all the downstream analyses present in the manuscript.

Diversity and Genome scan.pdf contains all the workflow relative to Hardy-Weinberg Equilibrium filtering, identification of the outliers through genome scan and population diverge calculation through Fst-values.
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Koutajoki watershed (further instructions are present in the pdf).
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Oulujoki watershed (further instructions are present in the pdf).

data and codes/bwa+Stacks/*.txt

These files contain the output from Bayescan and BayeScEnv used in RStudio to identify outlier SNPs, the CLUMPAK output to estimate Evanno's bestK and the population map for both the watersheds.

bayescan_koutajoki_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
bayescenv_koutajoki_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
bayescan_oulujoki_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
bayescenv_oulujoki_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
evanno_koutajoki.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (DElta_K)
evanno_oulujoki.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (DElta_K)
popmap_Koutajoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).
popmap_Oulujoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).

data and codes/bwa+Stacks/*.vcf

These are the vcf files obtained after running Stacks pipeline using bwa and for both Koutajoki and Oulujoki watersheds.The structure of the files, without the metadata, has the following order:

Variable list is similar to the one described in bowtie+Stacks

Koutajoki_179_4441.recode.vcf contains all the SNPs found using this workflow on Koutajoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 179
Number of rows: 4441

Koutajoki_179_2781.recode.vcf contains all the SNPs found using this workflow after Hardy-Weinberg Equilibrium filtering on Koutajoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 179
Number of rows: 2781

Oulujoki_80_4890.vcf contains all the SNPs found using this workflow on Oulujoki.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 80
Number of rows: 4890

Oulujoki_80_4244.recode.vcf contains all the SNPs found using this workflow on Oulujoki after Hardy_weinberg equilibrium filtering.

Number of metadata rows: 1455
Number of header rows: 1
Number of variables: 80
Number of rows: 4244

data and codes/bwa+dDocent/*.pdf

These files are RMarkdown showing the codes and workflow used for all the downstream analyses present in the manuscript.

Diversity and Genome scan.pdf contains all the workflow relative to Hardy-Weinberg Equilibrium filtering, identification of the outliers through genome scan and population diverge calculation through Fst-values.
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Koutajoki watershed (further instructions are present in the pdf).
evanno_koutajoki_test.pdf contains the steps to estimate the best K using Evanno's method for Oulujoki watershed (further instructions are present in the pdf).

data and codes/bwa+dDocent/*.txt

These files contain the output from Bayescan and BayeScEnv used in RStudio to identify outlier SNPs, the CLUMPAK output to estimate Evanno's bestK and the population map for both the watersheds.

Koutajoki_bayescan_file_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
Koutajoki_bayescenv_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
Oulujoki_bayescan_file_fst.txt contains 6 columns: the number of SNP, the posterior probability that selection is acting (prob), posterior odds for selection (log10(PO)), FDR-corrected significance value (qval), direction/strength of selection (alpha) and average locus Fst (fst)
Oulujoki_bayescenv_fst.txt contains 8 columns: the number of SNP, the significance for environmental association (PEP_g/q_val), environmental effect size (g), residual locus-specific effect (PEP_alpha/qval_alpha/alpha) and average locus fst (fst).
evanno_koutajoki.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (Delta_K)
evanno.txt contains 7 columns: the number of inferred genetic clusters tested (K), number of replicate STRUCTURE runs for that K (Reps), mean log probability of the data under K (Mean_LnP(K)), Standard deviation among replicate runs (Stdev_LnP(K)), first derivative of likelihood Ln'(K), Evanno statistic used to identify optimal K (Delta_K)
popmap2_koutajoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).
popmap_oulujoki.txt contains 3 columns: the number of each individual (Individuals), the population each individual belong (population) and whether it showed migratory/residency behaviour (behavior).

data and codes/bwa+dDocent/*.vcf

These are the vcf files obtained after running dDocent pipeline using bwa and for both Koutajoki and Oulujoki watersheds.The structure of the files, without the metadata, has the following order:

Variable list:

CHROM: (numeric) The chromosome for the locus.
POS: (numeric) The position of the locus.
ID: (alphanumeric) A unique identifier for the locus composed by the CHROM_POS.
REF: (character, A, C, G, T) The reference allele(s) for the locus.
ALT: (character, A, C, G, T) Alternate allele(s) for the locus.
QUAL: numerical
FILTER: no info specified
INFO: no info specified
FORMAT: (character) A character string specifying the format of the calls
GT Genotype
DP Total read depth
AD Allele depths
RO Reference allele observation count
QR Sum of quality scores for reference observations
AO Alternate allele observation count
QA Sum of quality scores for alternate observations
GL Genotype likelihoods
Variables 10-180 (Koutajoki)/ 80 (Oulujoki) are read calls for each individual.
Data type: 0/0 homozygous reference, 0/1 heterozygous, 1/1 homozygous, Total read depth:Allele depths:Reference allele observation count:Sum of quality scores for reference observations:Alternate allele observation count:Sum of quality scores for alternate observations,Genotype likelihoods
Missing data value: ./.:.:.,.

Koutajoki_5136_180ID.recode.vcf contains all the SNPs found using this workflow on Koutajoki.

Number of metadata rows: 60
Number of header rows: 1
Number of variables: 180
Number of rows: 5136

Koutajoki_2268_179.recode.vcf contains all the SNPs found using this workflow after Hardy-Weinberg Equilibrium filtering on Koutajoki.

Number of metadata rows: 60
Number of header rows: 1
Number of variables: 179
Number of rows: 2268

Oulujoki_6827_80ID.recode.vcf contains all the SNPs found using this workflow on Oulujoki.

Number of metadata rows: 60
Number of header rows: 1
Number of variables: 80
Number of rows: 6827

Oulujoki_4632_80.recode.vcf contains all the SNPs found using this workflow on Oulujoki after Hardy_weinberg equilibrium filtering.

Number of metadata rows: 60
Number of header rows: 1
Number of variables: 80
Number of rows: 4632

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.sh

All the script relative to the Bayescan and BayeScEnv genome scan for both Koutajoki and Oulujoki

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.geste

Geste files (population-structured genotype input file) obtained running PGDSpider on the vcf files and used to run Bayescan and Bayescenv codes.

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.txt

Migratory or residency behavior per each population/watershed (0.5 migratory, -0.5 residency)

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.sh

All the script relative to the Bayescan and BayeScEnv genome scan for both Koutajoki and Oulujoki

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.geste

Geste files (population-structured genotype input file) obtained running PGDSpider on the vcf files and used to run Bayescan and Bayescenv codes.

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.txt

Migratory or residency behavior per each population/watershed (0.5 migratory, -0.5 residency)

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.sh

All the script relative to the Bayescan and BayeScEnv genome scan for both Koutajoki and Oulujoki

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.geste

Geste files (population-structured genotype input file) obtained running PGDSpider on the vcf files and used to run Bayescan and Bayescenv codes.

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.txt

Migratory or residency behavior per each population/watershed (0.5 migratory, -0.5 residency)

Data and code: Effects of different SNP calling and sequence mapping choices on the inference of genetic architecture underlying migration tendency

Data files

Abstract

README: Data and code: Effects of different SNP calling and sequence mapping choices on the inference of genetic architecture underlying migration tendency

Dataset overview

Data Sources

Description of the data and file structure

Files and Folders

data and codes/*.sh

data and codes/*.txt

data and codes/*.pdf

data and codes/bowtie+Stacks/*.pdf

data and codes/bowtie+Stacks/*.txt

data and codes/bowtie+Stacks/*.vcf

data and codes/bwa+Stacks/*.pdf

data and codes/bwa+Stacks/*.txt

data and codes/bwa+Stacks/*.vcf

data and codes/bwa+dDocent/*.pdf

data and codes/bwa+dDocent/*.txt

data and codes/bwa+dDocent/*.vcf

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.sh

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.geste

data and codes/Bayescan and Bayescenv/bowtie+Stacks/*.txt

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.sh

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.geste

data and codes/Bayescan and Bayescenv/bwa+Stacks/*.txt

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.sh

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.geste

data and codes/Bayescan and Bayescenv/bwa+dDocent/*.txt