This dataset provides the genomic data and computational framework supporting the development and validation of AmpliRAD, a novel sequencing method that integrates targeted amplicon sequencing with a genome-wide reduced-representation (RAD-seq) approach. The data was generated to demonstrate a streamlined workflow that enables simultaneous analysis of specific adaptive markers and broad population structure in non-model organisms. A primary focus of the dataset is the technical evaluation of the AmpliRAD method, including a direct comparison of sequencing yields and target enrichment success against samples processed using a traditional RAD-seq protocol. These comparative metrics include read counts mapping to target regions and the assessment of coverage evenness across 39 target loci, providing a benchmark for the method's efficiency. Additionally, the dataset includes genomic information from 96 Chinook salmon (Oncorhynchus tshawytscha) collected from the Dean River, British Columbia, including 55 adults and 41 juveniles. These data include targeted genotypes for SNPs on chromosome 28 associated with adult migration timing—specifically within the GREB1L region—as well as a panel of over 31,000 RAD-seq SNPs used to assess fine-scale population structure. Researchers can reuse this dataset as a template for implementing analysis of AmpliRAD data in other species or to explore the genetic basis of migration timing in northern salmon populations. While many different approaches could be used to analyze AmpliRAD data, the included bioinformatics pipeline offers a complete path from raw reads to population structure analysis, designed to be accessible to researchers seeking cost-effective genomic tools.

Sample Collection and DNA Extraction

Samples were collected from both adult and juvenile Chinook salmon from the Dean River, British Columbia. The adult samples were caught in the lower river recreational fishery by anglers from early June to mid-August in 2022 and 2023. Juveniles were sampled by electrofishing in August of 2022. A small piece of caudal fin tissue was removed from each individual prior to live release back into the river. The tissue was dried on whatman filter paper and stored in a coin envelope with information designating the date and location of sampling.

To extract DNA, a small piece (~1mm square) of each sample was placed into a well of a 96-well PCR plate. The tissue was digested with Proteinase K. DNA was extracted from the digested material according to an SPRI bead-based protocol (Ali et al., 2016) and eluted into low-TE buffer ((10mM Tris-HCl [pH 8.0] + 0.1mM EDTA).

Study Design

DNA from 96 samples (55 adult Chinook salmon and 41 juveniles) was arrayed in a PCR plate for ampliRAD sequencing (Supplemental Materials). To facilitate comparison, an additional 13 juvenile samples were sequenced as part of another full plate using a traditional RAD protocol without the targeted (i.e., amplicon) component (Ali et al., 2016). The shared steps of the ampliRAD and traditional RAD protocols were performed simultaneously on both sets of samples. The two datasets will be referred to below as the ampliRAD and the traditional RAD sample/datasets.

AmpliRAD Protocol

Target primer design

The first step in ampliRAD is the design of primers to target specific loci of interest. Here, we designed primers to amplify 39 target loci within the Chinook salmon genome (Supplemental Materials). The loci are situated on chromosome 28 within a region strongly associated with adult migration timing, a critical adaptive trait (Waples et al., 2022). Each target locus encompasses a single nucleotide polymorphism (SNP) of interest. The SNPs were identified in an analysis of Chinook salmon from across the species’ range that will be presented elsewhere. Nineteen of the SNPs had also been identified in previously published analyses (Koch & Narum, 2020; Prince et al., 2017; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019) (Supplemental Materials). Two of the SNPs are adjacent to naturally occurring SbfI restriction sites and were included to facilitate comparison with samples prepared using the traditional RAD protocol (Prince et al., 2017). The other 37 SNPs are not located near natural restriction sites.

Candidate primer pairs were identified with NCBI’s Primer BLAST software (Ye et al., 2012). Approximately 500 bp of sequence centered around each target SNP was used as input for Primer-BLAST. Minimum and maximum PCR product sizes were set to 250 and 320 bp respectively, but the range was sometimes expanded to improve primer search results (the range of PCR product sizes for our final primer panel was 180-325 bp). Minimum and maximum primer melting temperatures (T_m) were set to 57°C and 61°C respectively, with an optimum of 59°C. Primer pair specificity was checked against the Chinook salmon reference genome (Otsh_v2.0) from the Refseq representative genomes database (Christensen et al., 2018). Default settings were used for all other Primer-BLAST parameters.

Final primer pairs for each target locus were selected based on three criteria: 1) strict avoidance of known polymorphic sites (based on available whole-genome data) within the primer sequences; 2) positioning of the target SNP within the first 150 bp of the amplicon to avoid potential decreases in coverage due to the downstream fragmentation step (this criterion was considered flexible); and 3) minimization of the potential for off-target amplification. This last point was prioritized by selecting primers with multiple mismatches to off-target regions, ideally near their 3' end. However, certain sections of the target region on chromosome 28 pose challenges to highly specific primer design (e.g., due to limited differences from off-target regions). Therefore, for targets where primer specificity was expected to be low or moderate, we focused on selecting primer pairs whose most likely off-target products (based on Primer-BLAST results) would clearly align better to their own region of the genome than to our target region (e.g., because the ~300 bp of sequencing data from off-target products would contain multiple mismatches to the target region).

After primer sequences were designed, a sequence containing a four base-pair leader followed by the SbfI/PstI restriction enzyme recognition sequence was added onto the 5’ end of each forward primer (5’-ACGTCCTGCAGG–forward_primer_sequence–3’; Figure 1; Supplemental Materials). The finalized primers were ordered from Integrated DNA Technologies (IDT) at 100uM concentration in IDTE Buffer pH 8.0 (10 mM Tris-HCl/0.1 mM EDTA). Primers were not prepooled.

Multiplex PCR to amplify target loci

To amplify target loci and append restriction sites in the ampliRAD sample set, a working primer stock was prepared by combining all target primers and diluting them in low-TE buffer (10 mM Tris-HCl [pH 8.0], 0.1 mM EDTA) to a final concentration of 0.5 μM for each primer. For each PCR reaction, the following components were combined: 5 μL Platinum^TM Multiplex PCR Master Mix (Applied Biosystems 4464268), 2 μL of the combined primer working stock, and 3 μL of genomic DNA eluted in low-TE buffer (variable concentrations). PCR was performed with a heated lid using the following cycling parameters:

PCR program:

95°C 5 min

-------------------------------

95°C 30 s

57°C 90 s (35 cycles)

72°C 30 s

--------------------------------

72°C 10 min

4°C Hold

After the PCR was complete, reaction cleanup was performed with AMPure XP beads (Beckman Coulter A63880) and two 80% EtOH wash steps (see Supplemental Materials for detailed protocol), then the PCR product was eluted into 30 μL low-TE buffer (10mM Tris-HCl [pH 8.0] + 0.1mM EDTA). An aliquot of the PCR product was then diluted further 1:100 in low-TE buffer.

RAD Library Prep

To generate RAD libraries for both the ampliRAD and traditional RAD sample sets, we performed the following protocol, modified from Ali, et al. (2016), simultaneously for both sample sets. This protocol is described with additional details and suggestions in the Supplemental Materials.

Initial digestion: 9 μL of genomic DNA (variable concentrations) and 1 μL of diluted PCR product were combined with 2 μL of digestion master mix (see protocol in Supplemental Materials). This mixture was incubated at 37°C for 30 minutes followed by heat inactivation at 80°C for 20 minutes in a thermocycler.

Adapter ligation: Small aliquots of annealed biotinylated BestRAD SbfI/PstI adapters (50 nM; Ali, et al. [2016]) were thawed at 4°C in an open thermocycler. While the adapters continued to be held at 4°C, 2 μL of the adapters and 2 μL of ligation mastermix (see protocol in Supplemental Materials) were added to each well of digested DNA and mixed thoroughly. The adapter aliquots were immediately returned to the freezer. Proper handling of annealed adapters is important because adapter degradation due to many freeze-thaw cycles and extended times at room temperature is a leading cause of poor RAD library generation. The library preparation plate was then incubated at room temperature for ten minutes, placed in a refrigerator overnight, then moved to room temperature for 20 minutes the following morning before the reaction was inactivated with the addition of 2 μL of 0.5M EDTA to each well. Ali, et al. (2016) stopped the ligation reaction via heat inactivation, but we have found stoppage through addition of EDTA to be more efficient for our workflow. Note, the amount of EDTA added here maintains an inhibitory concentration through the addition of other reagents in the following step.

Pooling: After inactivation of the ligase, 8 μL from each well of the library preparation plate were combined into a single 1.5 mL Eppendorf tube. The remaining ligation reactions were frozen and stored for future use. To cleanup the DNA, an equal volume of Ampure XP beads was added to the tube and two 80% EtOH washes were performed. The pooled and cleaned DNA was eluted from the beads into 82ul of low-TE buffer and placed on a magnet to collect the bead. Finally, 80 μL were transferred to a new tube.

DNA shearing: The original RAD library prep protocol (Ali et al., 2016) mechanically sheared DNA using a Bioruptor NGS sonicator (Diagenode). Here, DNA was enzymatically sheared using NEBNext dsDNA Fragmentase (NEB M0348). To shear, 10 μL of 10X NEBNext dsDNA Fragmentase^® Reaction Buffer and 10 μL of dsDNA Fragmentase were added to the tube containing 80 μL of pooled DNA. The mixture was immediately vortexed for 3 seconds, quickly spun down, and incubated at 37°C for exactly 20 minutes in a thermoblock before the fragmentation reaction was stopped by adding 50 μL of 0.5M EDTA. To check that the fragmentation reaction had produced sufficient fragments in the desired size range, 1 μL of the reaction mixture was run on an agarose gel and compared to a DNA ladder. The size of fragments generated depends on the amount of time the reaction is allowed to proceed. In our experience, DNA obtained from samples collected and dried in field-based settings achieves desired fragment sizes on the shorter end of manufacturer recommendations for fragmentation times.

RAD tag physical isolation: Next, the halted fragmentation reaction proceeded directly into the RAD tag physical isolation step without cleanup. Physical isolation of RAD tags from the rest of the genomic DNA was performed as described in Ali et al. (2016). In brief, 20μL Dynabead M-280 streptavidin magnetic beads (Life Technologies, 11205D) were washed twice with 100 μL 2X Binding and Wash (B+W) buffer (10 mM Tris-HCl [pH 8.0], 1 mM EDTA [pH 8.0], 2 M NaCl), then resuspended in a volume of 2X B+W buffer equivalent to the final volume of sheared DNA from above (in this case, ~150 μL). The resuspended streptavadin beads were added to the tube of fragmented DNA in order to bind the biotin incorporated onto RAD tags via the biotinylated adapters. The mixture was incubated at room temperature for 20 minutes with vigorous mixing every 2 minutes. Next, the tube was placed on a magnet to immobilize the beads and the supernatant was removed. Five wash steps were performed to remove DNA not bound to the streptavidin beads (i.e., non-RAD tags). For each wash, the beads were resuspended by pipetting in 150 μL 1X B+W buffer, then the tube was placed back on the magnet to immobilize the beads and the supernatant was removed. This was performed three times with room temperature 1X B+W buffer, then twice with 1X B+W buffer heated to 56°C.

To liberate the RAD tags from the streptavidin beads, the beads were washed twice with 100 μL 1X rSmartCut buffer (NEB B6004), then resuspended in 40 μL 1X rSmartCut buffer. 2 μL of SbfI-HF (NEB R3642) was added, mixed thoroughly, and incubated at 37°C for 60 minutes with occasional gentle mixing. The streptavidin beads were collected on a magnet and the supernatant containing the liberated RAD tags was moved to a new tube. The liberated DNA was cleaned by addition of an equal volume of Ampure XP beads and two 80% EtOH washes, then eluted in 53 μL of low-TE buffer. The beads were collected on a magnet and 52 μL of DNA were transferred to a new tube.

Final library preparation: Sequencing libraries were prepared from the isolated RAD tags with the NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB E7645) according to the manufacturer’s protocol. Plate barcodes were supplied with NEBNext Multiplex Oligos for Illumina (96 Unique Dual Index Primer Pairs) (NEB E6440). Adapters were diluted 1:10 as per manufacturer recommendations for low-input libraries, and size selection was conducted. PCR amplification was performed using 12 cycles. The final library was sequenced on approximately 10% of one NovaSeqX 25B lane using 150bp paired-end reads.

Bioinformatics and analysis

Alignment and filtering

Custom scripts were used to split raw sequencing reads into fastq files for individual samples based on well barcodes by requiring a perfect match to the barcodes. Residual adapter sequences were trimmed with Cutadapt (Martin, 2011), and the reads were aligned to the Chinook salmon reference genome (Otsh_v2.0) (Christensen et al., 2018) with bwa mem (Li, 2013) using default parameters.

At this point, PCR duplicates had not yet been removed from the bam files. The initial amplification of target loci in ampliRAD results in PCR products for a given target locus sharing identical start and end positions, and consequently nearly all ampliRAD target data would be eliminated with duplicate removal. Therefore, we created two sets of sorted and indexed BAM files using Samtools (Danecek et al., 2021; Li et al., 2009). One set, retaining all reads, was used for target loci analyses. The second set, processed with Samtools duplicate removal (markdup), was generated for all other analyses. A set of bam files without duplicates removed was also created for the traditional RAD sampleset for downstream comparison.

Evaluation of sequencing and targeting success

To evaluate overall sequencing success and the effectiveness of target enrichment, we calculated the following for all samples:

Total number of mapped reads (samtools view -c -F 0x4 sample.bam); (Danecek et al., 2021)).
Total number of reads mapped to the region containing target loci (samtools view -c -F 0x4 sample.bam NC_056456.1:13402786-13547912).
Percentage of all mapped reads aligned to the target region.

These calculations were performed on the bams prior to duplicate removal (see above). The calculations were repeated for the traditional RAD sampleset for comparison.

Calling genotypes at target loci

To evaluate the efficacy of AmpliRAD for generating genotypes at target loci, we used ANGSD (Korneliussen et al., 2014) to call genotypes at the target SNPs (-doGeno 4; -rf). Genotype likelihoods were calculated under the Samtools model (-GL 1; (Li, 2011a)), and were used to infer the major and minor alleles (-doMajorMinor 1) and their frequencies (-doMaf 3, (Kim et al., 2011). A uniform prior was used to estimate posterior genotype probabilities (-doPost 2). Filtering criteria included a minimum mapping quality score of 30 (-minMapQ 30), a minimum base quality score of 30 (-minQ 30), and a posterior genotype probability cutoff of 0.95. These filters were applied to bam files generated prior to duplicate removal. Due to the high read depth at target loci (see Results), ANGSD genotype calling proved memory-intensive. To mitigate this, we implemented a batch processing approach, calling genotypes for all samples in batches of six to eight loci and subsequently concatenating the results. This batch approach was taken in order to thoroughly evaluate all target loci data for the purposes of method validation, but subsampling the data would be a simpler approach for many analyses. The percentage of samples with called genotypes was calculated for each locus. For comparison, we also performed genotype calling on the traditional BestRAD samples using the same ANGSD parameters and batch processing approach. Results were visualized in R (R Core Team, 2024) with ggplot2 (Wickham, 2016).

Evaluating evenness of coverage across target loci

To evaluate read distribution across target loci and assess the evenness of sequencing coverage between target SNPs, we used samtools depth (Danecek et al., 2021) to calculate read depth at each amplicon position. These calculations incorporated base and mapping quality filters of 30 (-q 30; -Q 30). Reads containing deletions were included in the counts (-J), and only the first read of overlapping read pairs was counted (-s). For each sample, we calculated the total number of reads covering all target SNPs and subsequently determined the proportion of reads derived from each individual SNP. These proportions were visualized in R (R Core Team, 2024; Wickham, 2016).

The initial amplification of target loci produces numerous PCR products of identical length for each locus. However, fragmentation during library preparation can influence coverage depth, potentially reducing it near the 3’ amplicon end. This has implications for optimal target SNP placement within amplicons. To assess whether sequencing depth varied along amplicon lengths, we used the depth data generated above. For each target amplicon, we calculated the ratio of read depth at each position to the read depth at the first position. These ratios were then visualized in R (R Core Team, 2024; Wickham, 2016).

Identifying a panel of genome-wide RAD SNPs

Beyond targeted sequencing, the AmpliRAD method aims to generate genome-wide RAD data suitable for population genetic analyses. To assess its efficacy in this context, we identified polymorphic sites across the genome using ANGSD (Korneliussen et al., 2014). This analysis included all 96 AmpliRAD samples but excluded the traditional RAD dataset. The analysis was performed on bam files after removal of PCR duplicates (see above). Filtering criteria were as follows: minimum base and mapping quality scores of 30 (-minQ 30, -minMapQ 30); sites present in at least 50% of individuals (-minInd 48); a SNP p-value threshold of 1x10⁻⁶ (-SNP_pval 0.000001; (Kim et al., 2011)); base quality scores were recalibrated around indels using the SAMtools model (-baq 1; (Li, 2011b)); per-site allele frequencies were calculated (-doMaf 3; (Kim et al., 2011)) using a uniform prior (-doPost 2); major and minor alleles were inferred from genotype likelihoods (-doMajorMinor 1) under the SAMtools model (-GL 1) (Li, 2011a); and a minimum minor allele frequency of 0.05 (-minMaf 0.05) was applied.

Sequencing depth at each identified SNP was evaluated with samtools depth (Danecek et al., 2021). These calculations incorporated base and mapping quality filters of 30 (-q 30; -Q 30). Reads containing deletions were included in the counts (-J), and only the first read of overlapping read pairs was counted (-s).

Genetic analysis of Dean River Chinook salmon

To demonstrate the utility of combining targeted and reduced representation sequencing in a single protocol, we employed our ampliRAD dataset to analyze the association of the target loci with a critical adaptive trait and explore population structure in Dean River Chinook salmon.

The target loci in our AmpliRAD dataset reside within and upstream of a gene called GREB1L on chromosome 28, a region found to be strongly associated with adult migration timing across numerous populations in the southern portion of the species’ range (Hugentobler et al., 2024; Meek et al., 2020; Narum et al., 2018; Prince et al., 2017; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019; Waples et al., 2022; Willis et al., 2021). However, the association of GREB1L with migration timing has not yet been investigated in populations north of the contiguous United States, where migration tends to be more constrained and peak in the summer, as opposed to southern populations where migrations tend be be more extended with peaks in the spring and fall (Waples et al., 2004).

To investigate the association between migration timing and GREB1L variation in Dean River Chinook salmon, we analyzed the target SNP genotypes generated above. Because the discovery process and validation of some of the markers has not yet been published, we limited our analyses to three target SNPs previously shown to be associated with migration timing in multiple populations (Supplemental Materials; (Koch & Narum, 2020; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019)). Samples were assigned a consensus GREB1L genotype (homozygous early, heterozygous, or homozygous late) only if consistent genotype calls were obtained for all three SNPs. These consensus genotypes were used for downstream analyses for both adult and juvenile samples.

Adult samples had been caught near the Dean River mouth, allowing their collection dates to serve as reasonable proxies for the start of their upriver migration. Adult sample collection spanned 70 days, from June 4 to August 12 (Supplemental Materials). Samples from both collection years (2022 and 2023) were well represented across this span and were analyzed together. To test for an association between GREB1L variation and adult migration timing, we divided the sampling period into two 35-day windows (early season and late season) and tallied the number of early- and late-migrating alleles observed in each window based on the consensus GREB1L genotypes. A two-tailed Fisher's exact test was used to determine if GREB1L allele frequencies differed significantly between the early and late season groups. The temporal distributions of each genotype among the adult samples were visualized in R (R Core Team, 2024; Wickham, 2016).

To explore the relationship between spatial habitat usage and variation at GREB1L, we also calculated and compared GREB1L allele frequencies in groups of juveniles sampled from different locations in the Dean River basin. The same methods were employed for calling genotypes, evaluating allele frequencies, and testing for significance as described above for the adult samples.

To evaluate population structure in Dean River Chinook salmon, we analyzed the ampliRAD dataset with the genome-wide SNP panel identified above. Individuals underwent an initial screen for missing data, and were excluded if they were missing data at more than 50% of the sites in the panel. Next ANGSD (Korneliussen et al., 2014) was used to perform principle component analysis (PCA). A single read sampling approach was employed to mediate differences in coverage between samples (-doIBS 1; doCounts 1), quality scores were recalculated around indels (-baq 1; -ref path/to/Otsh_v2.0_genome; (Li, 2011b)); sites with base and mapping qualities below 30 were excluded (-minQ 30; -minMapQ 30); the major and minor alleles were inferred with genotype likelihoods under the samtools model (-doMajorMinor 1; -GL 1; (Li, 2011a)); a covariance matrix was generated with -doCov1; and a file was provided to restrict the analysis to loci in the SNP panel (-rf SNPs.txt). SNPs within the region containing target loci (NC_056456.1:13402786-13547912) were excluded. The results were visualized in R (R Core Team, 2024; Wickham, 2016).

AmpliRAD: A new method combining amplicon and RAD sequencing

Data files

Abstract

Description of the data and file structure

Workflow, Scripts, and Files

Software

Access information