AmpliRAD: A new method combining amplicon and RAD sequencing
Data files
Feb 10, 2026 version files 29.80 MB
-
adults_greb1l_genos_dates.csv
1.51 KB
-
align.keep_dups.sh
1.05 KB
-
align.rm_dups.sh
1.81 KB
-
all.keep_dups.bamlist
10.36 KB
-
all.keep_dups.snp_only_firsts.post95.targets.geno
13.61 KB
-
all.keep_dups.snp_only_firsts.post95.targets.geno.mod.T
13.57 KB
-
all.keep_dups.targamps.depth
5.37 MB
-
all.keep_dups.targamps.depth.mod.T
5.37 MB
-
ampliRAD_figures_code.R
15.59 KB
-
ampSamples.nodup.wbaq.minInd50pct.minmaf05.get_snps.mafs
2.36 MB
-
ampsamps.genomewide.snps.depth
7.24 MB
-
ampSamps.nodup.bamlist
8.16 KB
-
ampSamps.nodup.forPCA.annot.csv
4.17 KB
-
ampSamps.nodup.forPCA.bamlist
6.63 KB
-
ampSamps.nodup.forPCA.covMat
60.58 KB
-
ampSamps.nodup.missingcheck.ibs
7.69 MB
-
genomewide.snp.panel
666.96 KB
-
genomewide.snp.panel.bed
937.66 KB
-
get_alignment_stats.sh
1.70 KB
-
get_depth.sh
277 B
-
get_genos_targets.sh
422 B
-
get_PCA.sh
351 B
-
get_snps.sh
456 B
-
ids.txt
1.74 KB
-
read_counts.tsv
8.79 KB
-
README.md
12.46 KB
-
remove_adapters.sh
863 B
-
target_snps.reg
819 B
-
target.amplicons.bed
1.23 KB
Abstract
This dataset provides the genomic data and computational framework supporting the development and validation of AmpliRAD, a novel sequencing method that integrates targeted amplicon sequencing with a genome-wide reduced-representation (RAD-seq) approach. The data was generated to demonstrate a streamlined workflow that enables simultaneous analysis of specific adaptive markers and broad population structure in non-model organisms. A primary focus of the dataset is the technical evaluation of the AmpliRAD method, including a direct comparison of sequencing yields and target enrichment success against samples processed using a traditional RAD-seq protocol. These comparative metrics include read counts mapping to target regions and the assessment of coverage evenness across 39 target loci, providing a benchmark for the method's efficiency. Additionally, the dataset includes genomic information from 96 Chinook salmon (Oncorhynchus tshawytscha) collected from the Dean River, British Columbia, including 55 adults and 41 juveniles. These data include targeted genotypes for SNPs on chromosome 28 associated with adult migration timing—specifically within the GREB1L region—as well as a panel of over 31,000 RAD-seq SNPs used to assess fine-scale population structure. Researchers can reuse this dataset as a template for implementing analysis of AmpliRAD data in other species or to explore the genetic basis of migration timing in northern salmon populations. While many different approaches could be used to analyze AmpliRAD data, the included bioinformatics pipeline offers a complete path from raw reads to population structure analysis, designed to be accessible to researchers seeking cost-effective genomic tools.
Dataset DOI: 10.5061/dryad.d51c5b0fb
Description of the data and file structure
This README file provides a comprehensive overview of the bioinformatics workflow used for processing and analyzing sequencing data as described in AmpliRAD: A new method combining amplicon and RAD sequencing. This document is organized into sequential processing steps, with each section describing the scripts used, input files required, output files generated, and any relevant post-output processing. Files listed in bold have been uploaded to this DRYAD repository and are available for download. Other files used as input, but not uploaded here, are large sequencing data files. The raw fastq versions of these are available on NCBI with BioProject accension number: PRJNA1262529.
Workflow, Scripts, and Files
ADAPTER REMOVAL
This step removes residual adapter sequences from the raw fastq data prior to alignment using the program Cutadapt.
- Script/s used:
- remove_adapters.sh
- Bash script that loops through a list of sample IDs in order to run Cutadapt on each pair of fastq files for each sample.
- remove_adapters.sh
- Input file/s:
- ids.txt
- Text file containing an individual sample ID on each line.
- Raw fastq files (called within script)
- Read 1 and Read 2 fastq files for each sample named with the following naming convention: [SAMPLE_ID]_RA.fastq.gz and [SAMPLE_ID]_RB.fastq.gz
- ids.txt
- Output file/s:
- Two fastq files for each sample (containing Reads 1 and Reads 2 respectively)
- The files are named with the following convention: cutadapt_[SAMPLE_ID]RA.fq.gz and cutadapt[SAMPLE_ID]_RB.fq.gz
- Two fastq files for each sample (containing Reads 1 and Reads 2 respectively)
ALIGNMENT
Alignment and processing was performed twice, once with a script that removed duplicate read pairs and once with a script that retained them. Both scripts are provided here because this was the methodology used in our study. However, we recommend that researches looking to adapt our methods to their own work use a more streamlined process that performs alignment only once and then processes the aligments to generate two sets of bams (one with duplicates retained and one with removed), ideally with a single script.
- Script/s used:
- align.rm_dups.sh
- Bash script that aligns, removes duplicates, sorts, and generates alignment statistics for fastq files that have been through adapter removal. The script utilizes BWA mem for alignment and Samtools for post-alignment processing.
- align.keep_dups.sh
- Bash script that is identical to the above script (002_align.sh) except that the steps required for duplicate removal are excluded in order to generate bam files that retain duplicates.
- align.rm_dups.sh
- Input file/s:
- ids.txt
- Text file containing an individual sample ID on each line. Same file used as input for adapter removal
- cutadapt_[SAMPLE_ID]_RA.fq.gz
- output file from Adapter Removal step
- cutadapt_[SAMPLE_ID]_RB.fq.gz
- output file from Adapter Removal step
- Reference genome fasta files
- version used: Otsh_v2.0
- ids.txt
- Output file/s:
- [SAMPLE_ID].final.bam
- Final bam file with duplicates removed
- [SAMPLE_ID].stats
- Output from samtools flagstat with alignment statistics
- [SAMPLE_ID].keep_dups.final.bam
- Final bam file that retains duplicates
- [SAMPLE_ID].keep_dups.stats
- Output from samtools flagstat with alignment statistics
- [SAMPLE_ID].final.bam
EVALUATION OF SEQUENCING AND TARGETING SUCCESS
This step counts the number of mapped reads both inside and outside the target region to understand whether enrichment of target regions was successful.
- Script/s used:
- get_alignment_stats.sh
- Bash script that uses samtools view to count all mapped reads and reads mapped to the target region for each bam file. Uses bams that have not undergone duplicate removal. Loops through files. Assumes the script is being run in the same directory as the bam files.
- get_alignment_stats.sh
- Input file/s:
- None, but script calls bam files from within the script. Assumes the script is being run in the same directory as the bam files.
- Output file/s:
- read_counts.tsv
- File with the following columns:
- Filename: name of the bam file being evaluated
- all_mapped: the total number of mapped reads
- target_mapped: the total number of reads mapped to the target region
- no_target_mapped: the total number of reads mapped outside the target region
- all_no_dup: the total number of mapped reads present after duplicates have been excluded
- target_no_dup: total number of reads mapped to target region after exclusion of duplicates
- no_target_no_dup: total number of reads mapped outside target region after exclusion of duplicates
- File with the following columns:
- read_counts.tsv
CALL GENOTYPES AT TARGET SNPS
This step calls genotypes at target SNPs in both the ampliRAD and traditional RAD datasets in order to evaluate and compare the ability to successfully genotype ampliRAD targets.
- Script/s used:
- get_genos_targets.sh
- Bash script that uses Angsd to call genotypes at target snps for each sample.
- get_genos_targets.sh
- Input file/s:
- all.keep_dups.bamlist
- List of bamfiles. The bam files retain duplicates. All samples from both the ampliRAD and traditional RAD datasets are included.
- target_snps.reg
- region file containing CHR:POS location of each target SNP.
- all.keep_dups.bamlist
- Output file/s:
- all.keep_dups.snp_only_firsts.post95.targets.geno
- File containing genotype calls for each sample and each target SNP with the following columns:
- Chromosome
- Position
- First sample in bamlist
- Second sample in bamlist
- ...Nth sample in bamlist
- File containing genotype calls for each sample and each target SNP with the following columns:
- all.keep_dups.snp_only_firsts.post95.targets.geno
CALCULATE DEPTH IN TARGET REGION
This step calculates read depth at each position in each target amplicon in each individual sample in order to understand how coverage differs between target SNPs and across the length of amplicons.
- Script/s used:
- get_depth.sh
- Bash script that uses samtools depth to calculate the number of reads covering each position in each target amplicon
- get_depth.sh
- Input file/s:
- all.keep_dups.bamlist
- List of bamfiles (including the path to them). The bam files retain duplicates. All samples from both the ampliRAD and traditional RAD datasets are included. Note: data from the traditional RAD samples are excluded the downstream R script used for visualization.
- target.amplicons.bed
- BED file containing position information for each amplicon with the following columns:
- CHR
- Start POS
- Stop POS
- BED file containing position information for each amplicon with the following columns:
- all.keep_dups.bamlist
- Output file/s:
- all.keep_dups.targamps.depth
- File containing depth information for each position in each target amplicon in each sample. It contains the following columns:
- Chromosome
- Position
- Depth at Sample 1
- Depth at Sample 2
- ...Depth at Sample N
- File containing depth information for each position in each target amplicon in each sample. It contains the following columns:
- all.keep_dups.targamps.depth
IDENTIFY PANEL OF GENOME-WIDE SNPS
This step uses Angsd to identify a genome-wide panel of SNPs in the RAD data.
- Script/s used:
- get_snps.sh
- Bash script that uses ANGSD to call SNPs
- get_snps.sh
- Input file/s:
- ampSamps.nodup.bamlist
- File containing the names of bams (with duplicates removed) for all samples in the ampliRAD dataset
- Reference genome fasta
- version used: Otsh_v2.0
- ampSamps.nodup.bamlist
- Output file/s:
- ampSamples.nodup.wbaq.minInd50pct.minmaf05.get_snps.mafs
- File containing position and allele frequency information for each identified SNP. Contains the following columns:
- chromosome
- position
- major: the major allele
- minor: the minor allele
- ref: the reference allele
- knownEM: minor allele frequency under the "known minor allele" estimator
- unknownEM: minor allele frequency under the "unknown minor allele" estimator
- pK-EM: p-value under the known estimator
- pu-EM: pv-alue under the unknown estimator
- nIND: the number of individuals data at this position was present in
- File containing position and allele frequency information for each identified SNP. Contains the following columns:
- ampSamples.nodup.wbaq.minInd50pct.minmaf05.get_snps.mafs
- Processing of output files:
- The first two columns of ampSamples.nodup.wbaq.minInd50pct.minmaf05.get_snps.mafs** were copied to make a file called genomewide.snp.panel, and a colon (:) was used as a seprator between the columns. SNPs within the target region were removed. This file was used downstream as a region file to provide snp positions. It was also converted into a BED file called genomewide.snp.panel.bed for analyzing depth at each genome-wide snp position downstream.
CALCULATE DEPTH FOR GENOME-WIDE PANEL (EXCLUDING TARGET REGION)
This step uses samtools depth to calculate sequencing depth at each position in the genomewide SNP panel for each sample in the ampliRAD dataset.
- Script/s used:
- get_depth.sh
- Bash script that uses samtools depth to calculate the number of reads covering each position in each target amplicon.
- get_depth.sh
- Input file/s:
- ampSamps.nodup.bamlist
- List of bam files to include in analyses. It contains all the samples within the ampliRAD sample set, and referes to bam files that have undergone duplicate removal.
- genomewide.snp.panel.bed
- BED file containing position information for all SNPs in the genomewide panel (exludes target region SNPs).
- ampSamps.nodup.bamlist
- Output file/s:
- ampsamps.genomewide.snps.depth
- File containing read depth information for each ampliRAD dataset sample and each SNP in ghe genomewide. Contains the following columns:
- Chromosome
- Position
- Sample 1 depth
- Sample 2 depth
- ...Sample N depth
- File containing read depth information for each ampliRAD dataset sample and each SNP in ghe genomewide. Contains the following columns:
- ampsamps.genomewide.snps.depth
CALCULATE A COVARIANCE MATRIX FOR PRINCIPLE COMPONENT ANALYSIS
This step uses ANGSD to calculate a covariance matrix for the ampliRAD samples set using the sites in the genomewide SNP panel.
- Script/s used:
- get_PCA.sh
- Bash script that uses ANGSD to calculate a covariance matrix using a single read sampling approach. The genomewide panel of SNPs is used to predifine positions to evaluate.
- get_PCA.sh
- Input file/s:
- ampSamps.nodup.forPCA.bamlist
- Bamlist containing samples from the ampliRAD dataset, refering to bams that have undergone duplicate removal. Samples in this bamlist have been screened for excessive missingness (all samples in this bamlist have data present at at least 50% of the SNPs in the genomewide panel).
- genomewide.snp.panel
- Region file containing position information for all SNPs in the genomewide panel (format is CHR:POS).
- ampSamps.nodup.forPCA.bamlist
- Output file/s**:**
- ampSamps.nodup.forPCA.covMat
- Covariance matrix for evaluating population structure among the ampliRAD sampleset.
- ampSamps.nodup.forPCA.covMat
FIGURE GENERATION
Rscipt used to generate figures for the manuscript.
- Script/s used:
- ampliRAD_figures_code.R
- R script that generates figures for the manuscript based on many of the steps and output files described above.
- ampliRAD_figures_code.R
- Input file/s:
- all.keep_dups.targamps.depth.mod.T
- Contains depth information across all positions in all target amplicons. Same data as all.keep_dups.targamps.depth (see above) except that it has been transposed and recieved minor modifications to the header for compatability purposes with R.
- all.keep_dups.snp_only_firsts.post95.targets.geno.mod.T
- Contains genotype calls (posterior probs > 0.95) for all target SNPs in all samples. Same data as *all.keep_dups.snp_only_firsts.post95.targets.geno *(see above) except transposed and with minor formating changes to the header for compatibility.
- *adults_greb1l_genos_dates.csv *
- Contains GREB1L concensus genotype calls and collection date information for run timing analysis.
- ampSamps.nodup.missingcheck.ibs
- Contains single read sampling calls for all positions in the SNP panel for all ampliRAD samples--used to identify individuals with high missingness.
- ampSamps.nodup.forPCA.covMat
- Covariance matrix for the PCA
- ampSamps.nodup.forPCA.annot.csv
- Annotation file for the PCA
- all.keep_dups.targamps.depth.mod.T
- Output file/s:
- See figures in manuscript for visualization
- amplirad_fig2C.svg
- amplirad_fig2A.svg
- amplirad_fig3B.svg
- amplirad_fig3C.svg
- See figures in manuscript for visualization
Software
- ANGSD: version 0.935
- Samtools: version 1.19.2
- BWA: version 0.7.17
- R: version 4.4.3
- ggplot2
- tidyverse
- svglite
- dplyr
Access information
Raw sequencing data is available at NCBI under BioProject accession PRJNA1262529
Sample Collection and DNA Extraction
Samples were collected from both adult and juvenile Chinook salmon from the Dean River, British Columbia. The adult samples were caught in the lower river recreational fishery by anglers from early June to mid-August in 2022 and 2023. Juveniles were sampled by electrofishing in August of 2022. A small piece of caudal fin tissue was removed from each individual prior to live release back into the river. The tissue was dried on whatman filter paper and stored in a coin envelope with information designating the date and location of sampling.
To extract DNA, a small piece (~1mm square) of each sample was placed into a well of a 96-well PCR plate. The tissue was digested with Proteinase K. DNA was extracted from the digested material according to an SPRI bead-based protocol (Ali et al., 2016) and eluted into low-TE buffer ((10mM Tris-HCl [pH 8.0] + 0.1mM EDTA).
Study Design
DNA from 96 samples (55 adult Chinook salmon and 41 juveniles) was arrayed in a PCR plate for ampliRAD sequencing (Supplemental Materials). To facilitate comparison, an additional 13 juvenile samples were sequenced as part of another full plate using a traditional RAD protocol without the targeted (i.e., amplicon) component (Ali et al., 2016). The shared steps of the ampliRAD and traditional RAD protocols were performed simultaneously on both sets of samples. The two datasets will be referred to below as the ampliRAD and the traditional RAD sample/datasets.
AmpliRAD Protocol
Target primer design
The first step in ampliRAD is the design of primers to target specific loci of interest. Here, we designed primers to amplify 39 target loci within the Chinook salmon genome (Supplemental Materials). The loci are situated on chromosome 28 within a region strongly associated with adult migration timing, a critical adaptive trait (Waples et al., 2022). Each target locus encompasses a single nucleotide polymorphism (SNP) of interest. The SNPs were identified in an analysis of Chinook salmon from across the species’ range that will be presented elsewhere. Nineteen of the SNPs had also been identified in previously published analyses (Koch & Narum, 2020; Prince et al., 2017; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019) (Supplemental Materials). Two of the SNPs are adjacent to naturally occurring SbfI restriction sites and were included to facilitate comparison with samples prepared using the traditional RAD protocol (Prince et al., 2017). The other 37 SNPs are not located near natural restriction sites.
Candidate primer pairs were identified with NCBI’s Primer BLAST software (Ye et al., 2012). Approximately 500 bp of sequence centered around each target SNP was used as input for Primer-BLAST. Minimum and maximum PCR product sizes were set to 250 and 320 bp respectively, but the range was sometimes expanded to improve primer search results (the range of PCR product sizes for our final primer panel was 180-325 bp). Minimum and maximum primer melting temperatures (Tm) were set to 57°C and 61°C respectively, with an optimum of 59°C. Primer pair specificity was checked against the Chinook salmon reference genome (Otsh_v2.0) from the Refseq representative genomes database (Christensen et al., 2018). Default settings were used for all other Primer-BLAST parameters.
Final primer pairs for each target locus were selected based on three criteria: 1) strict avoidance of known polymorphic sites (based on available whole-genome data) within the primer sequences; 2) positioning of the target SNP within the first 150 bp of the amplicon to avoid potential decreases in coverage due to the downstream fragmentation step (this criterion was considered flexible); and 3) minimization of the potential for off-target amplification. This last point was prioritized by selecting primers with multiple mismatches to off-target regions, ideally near their 3' end. However, certain sections of the target region on chromosome 28 pose challenges to highly specific primer design (e.g., due to limited differences from off-target regions). Therefore, for targets where primer specificity was expected to be low or moderate, we focused on selecting primer pairs whose most likely off-target products (based on Primer-BLAST results) would clearly align better to their own region of the genome than to our target region (e.g., because the ~300 bp of sequencing data from off-target products would contain multiple mismatches to the target region).
After primer sequences were designed, a sequence containing a four base-pair leader followed by the SbfI/PstI restriction enzyme recognition sequence was added onto the 5’ end of each forward primer (5’-ACGTCCTGCAGG–forward_primer_sequence–3’; Figure 1; Supplemental Materials). The finalized primers were ordered from Integrated DNA Technologies (IDT) at 100uM concentration in IDTE Buffer pH 8.0 (10 mM Tris-HCl/0.1 mM EDTA). Primers were not prepooled.
Multiplex PCR to amplify target loci
To amplify target loci and append restriction sites in the ampliRAD sample set, a working primer stock was prepared by combining all target primers and diluting them in low-TE buffer (10 mM Tris-HCl [pH 8.0], 0.1 mM EDTA) to a final concentration of 0.5 μM for each primer. For each PCR reaction, the following components were combined: 5 μL PlatinumTM Multiplex PCR Master Mix (Applied Biosystems 4464268), 2 μL of the combined primer working stock, and 3 μL of genomic DNA eluted in low-TE buffer (variable concentrations). PCR was performed with a heated lid using the following cycling parameters:
PCR program:
95°C 5 min
-------------------------------
95°C 30 s
57°C 90 s (35 cycles)
72°C 30 s
--------------------------------
72°C 10 min
4°C Hold
After the PCR was complete, reaction cleanup was performed with AMPure XP beads (Beckman Coulter A63880) and two 80% EtOH wash steps (see Supplemental Materials for detailed protocol), then the PCR product was eluted into 30 μL low-TE buffer (10mM Tris-HCl [pH 8.0] + 0.1mM EDTA). An aliquot of the PCR product was then diluted further 1:100 in low-TE buffer.
RAD Library Prep
To generate RAD libraries for both the ampliRAD and traditional RAD sample sets, we performed the following protocol, modified from Ali, et al. (2016), simultaneously for both sample sets. This protocol is described with additional details and suggestions in the Supplemental Materials.
Initial digestion: 9 μL of genomic DNA (variable concentrations) and 1 μL of diluted PCR product were combined with 2 μL of digestion master mix (see protocol in Supplemental Materials). This mixture was incubated at 37°C for 30 minutes followed by heat inactivation at 80°C for 20 minutes in a thermocycler.
Adapter ligation: Small aliquots of annealed biotinylated BestRAD SbfI/PstI adapters (50 nM; Ali, et al. [2016]) were thawed at 4°C in an open thermocycler. While the adapters continued to be held at 4°C, 2 μL of the adapters and 2 μL of ligation mastermix (see protocol in Supplemental Materials) were added to each well of digested DNA and mixed thoroughly. The adapter aliquots were immediately returned to the freezer. Proper handling of annealed adapters is important because adapter degradation due to many freeze-thaw cycles and extended times at room temperature is a leading cause of poor RAD library generation. The library preparation plate was then incubated at room temperature for ten minutes, placed in a refrigerator overnight, then moved to room temperature for 20 minutes the following morning before the reaction was inactivated with the addition of 2 μL of 0.5M EDTA to each well. Ali, et al. (2016) stopped the ligation reaction via heat inactivation, but we have found stoppage through addition of EDTA to be more efficient for our workflow. Note, the amount of EDTA added here maintains an inhibitory concentration through the addition of other reagents in the following step.
Pooling: After inactivation of the ligase, 8 μL from each well of the library preparation plate were combined into a single 1.5 mL Eppendorf tube. The remaining ligation reactions were frozen and stored for future use. To cleanup the DNA, an equal volume of Ampure XP beads was added to the tube and two 80% EtOH washes were performed. The pooled and cleaned DNA was eluted from the beads into 82ul of low-TE buffer and placed on a magnet to collect the bead. Finally, 80 μL were transferred to a new tube.
DNA shearing: The original RAD library prep protocol (Ali et al., 2016) mechanically sheared DNA using a Bioruptor NGS sonicator (Diagenode). Here, DNA was enzymatically sheared using NEBNext dsDNA Fragmentase (NEB M0348). To shear, 10 μL of 10X NEBNext dsDNA Fragmentase® Reaction Buffer and 10 μL of dsDNA Fragmentase were added to the tube containing 80 μL of pooled DNA. The mixture was immediately vortexed for 3 seconds, quickly spun down, and incubated at 37°C for exactly 20 minutes in a thermoblock before the fragmentation reaction was stopped by adding 50 μL of 0.5M EDTA. To check that the fragmentation reaction had produced sufficient fragments in the desired size range, 1 μL of the reaction mixture was run on an agarose gel and compared to a DNA ladder. The size of fragments generated depends on the amount of time the reaction is allowed to proceed. In our experience, DNA obtained from samples collected and dried in field-based settings achieves desired fragment sizes on the shorter end of manufacturer recommendations for fragmentation times.
RAD tag physical isolation: Next, the halted fragmentation reaction proceeded directly into the RAD tag physical isolation step without cleanup. Physical isolation of RAD tags from the rest of the genomic DNA was performed as described in Ali et al. (2016). In brief, 20μL Dynabead M-280 streptavidin magnetic beads (Life Technologies, 11205D) were washed twice with 100 μL 2X Binding and Wash (B+W) buffer (10 mM Tris-HCl [pH 8.0], 1 mM EDTA [pH 8.0], 2 M NaCl), then resuspended in a volume of 2X B+W buffer equivalent to the final volume of sheared DNA from above (in this case, ~150 μL). The resuspended streptavadin beads were added to the tube of fragmented DNA in order to bind the biotin incorporated onto RAD tags via the biotinylated adapters. The mixture was incubated at room temperature for 20 minutes with vigorous mixing every 2 minutes. Next, the tube was placed on a magnet to immobilize the beads and the supernatant was removed. Five wash steps were performed to remove DNA not bound to the streptavidin beads (i.e., non-RAD tags). For each wash, the beads were resuspended by pipetting in 150 μL 1X B+W buffer, then the tube was placed back on the magnet to immobilize the beads and the supernatant was removed. This was performed three times with room temperature 1X B+W buffer, then twice with 1X B+W buffer heated to 56°C.
To liberate the RAD tags from the streptavidin beads, the beads were washed twice with 100 μL 1X rSmartCut buffer (NEB B6004), then resuspended in 40 μL 1X rSmartCut buffer. 2 μL of SbfI-HF (NEB R3642) was added, mixed thoroughly, and incubated at 37°C for 60 minutes with occasional gentle mixing. The streptavidin beads were collected on a magnet and the supernatant containing the liberated RAD tags was moved to a new tube. The liberated DNA was cleaned by addition of an equal volume of Ampure XP beads and two 80% EtOH washes, then eluted in 53 μL of low-TE buffer. The beads were collected on a magnet and 52 μL of DNA were transferred to a new tube.
Final library preparation: Sequencing libraries were prepared from the isolated RAD tags with the NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB E7645) according to the manufacturer’s protocol. Plate barcodes were supplied with NEBNext Multiplex Oligos for Illumina (96 Unique Dual Index Primer Pairs) (NEB E6440). Adapters were diluted 1:10 as per manufacturer recommendations for low-input libraries, and size selection was conducted. PCR amplification was performed using 12 cycles. The final library was sequenced on approximately 10% of one NovaSeqX 25B lane using 150bp paired-end reads.
Bioinformatics and analysis
Alignment and filtering
Custom scripts were used to split raw sequencing reads into fastq files for individual samples based on well barcodes by requiring a perfect match to the barcodes. Residual adapter sequences were trimmed with Cutadapt (Martin, 2011), and the reads were aligned to the Chinook salmon reference genome (Otsh_v2.0) (Christensen et al., 2018) with bwa mem (Li, 2013) using default parameters.
At this point, PCR duplicates had not yet been removed from the bam files. The initial amplification of target loci in ampliRAD results in PCR products for a given target locus sharing identical start and end positions, and consequently nearly all ampliRAD target data would be eliminated with duplicate removal. Therefore, we created two sets of sorted and indexed BAM files using Samtools (Danecek et al., 2021; Li et al., 2009). One set, retaining all reads, was used for target loci analyses. The second set, processed with Samtools duplicate removal (markdup), was generated for all other analyses. A set of bam files without duplicates removed was also created for the traditional RAD sampleset for downstream comparison.
Evaluation of sequencing and targeting success
To evaluate overall sequencing success and the effectiveness of target enrichment, we calculated the following for all samples:
- Total number of mapped reads (samtools view -c -F 0x4 sample.bam); (Danecek et al., 2021)).
- Total number of reads mapped to the region containing target loci (samtools view -c -F 0x4 sample.bam NC_056456.1:13402786-13547912).
- Percentage of all mapped reads aligned to the target region.
These calculations were performed on the bams prior to duplicate removal (see above). The calculations were repeated for the traditional RAD sampleset for comparison.
Calling genotypes at target loci
To evaluate the efficacy of AmpliRAD for generating genotypes at target loci, we used ANGSD (Korneliussen et al., 2014) to call genotypes at the target SNPs (-doGeno 4; -rf). Genotype likelihoods were calculated under the Samtools model (-GL 1; (Li, 2011a)), and were used to infer the major and minor alleles (-doMajorMinor 1) and their frequencies (-doMaf 3, (Kim et al., 2011). A uniform prior was used to estimate posterior genotype probabilities (-doPost 2). Filtering criteria included a minimum mapping quality score of 30 (-minMapQ 30), a minimum base quality score of 30 (-minQ 30), and a posterior genotype probability cutoff of 0.95. These filters were applied to bam files generated prior to duplicate removal. Due to the high read depth at target loci (see Results), ANGSD genotype calling proved memory-intensive. To mitigate this, we implemented a batch processing approach, calling genotypes for all samples in batches of six to eight loci and subsequently concatenating the results. This batch approach was taken in order to thoroughly evaluate all target loci data for the purposes of method validation, but subsampling the data would be a simpler approach for many analyses. The percentage of samples with called genotypes was calculated for each locus. For comparison, we also performed genotype calling on the traditional BestRAD samples using the same ANGSD parameters and batch processing approach. Results were visualized in R (R Core Team, 2024) with ggplot2 (Wickham, 2016).
Evaluating evenness of coverage across target loci
To evaluate read distribution across target loci and assess the evenness of sequencing coverage between target SNPs, we used samtools depth (Danecek et al., 2021) to calculate read depth at each amplicon position. These calculations incorporated base and mapping quality filters of 30 (-q 30; -Q 30). Reads containing deletions were included in the counts (-J), and only the first read of overlapping read pairs was counted (-s). For each sample, we calculated the total number of reads covering all target SNPs and subsequently determined the proportion of reads derived from each individual SNP. These proportions were visualized in R (R Core Team, 2024; Wickham, 2016).
The initial amplification of target loci produces numerous PCR products of identical length for each locus. However, fragmentation during library preparation can influence coverage depth, potentially reducing it near the 3’ amplicon end. This has implications for optimal target SNP placement within amplicons. To assess whether sequencing depth varied along amplicon lengths, we used the depth data generated above. For each target amplicon, we calculated the ratio of read depth at each position to the read depth at the first position. These ratios were then visualized in R (R Core Team, 2024; Wickham, 2016).
Identifying a panel of genome-wide RAD SNPs
Beyond targeted sequencing, the AmpliRAD method aims to generate genome-wide RAD data suitable for population genetic analyses. To assess its efficacy in this context, we identified polymorphic sites across the genome using ANGSD (Korneliussen et al., 2014). This analysis included all 96 AmpliRAD samples but excluded the traditional RAD dataset. The analysis was performed on bam files after removal of PCR duplicates (see above). Filtering criteria were as follows: minimum base and mapping quality scores of 30 (-minQ 30, -minMapQ 30); sites present in at least 50% of individuals (-minInd 48); a SNP p-value threshold of 1x10⁻⁶ (-SNP_pval 0.000001; (Kim et al., 2011)); base quality scores were recalibrated around indels using the SAMtools model (-baq 1; (Li, 2011b)); per-site allele frequencies were calculated (-doMaf 3; (Kim et al., 2011)) using a uniform prior (-doPost 2); major and minor alleles were inferred from genotype likelihoods (-doMajorMinor 1) under the SAMtools model (-GL 1) (Li, 2011a); and a minimum minor allele frequency of 0.05 (-minMaf 0.05) was applied.
Sequencing depth at each identified SNP was evaluated with samtools depth (Danecek et al., 2021). These calculations incorporated base and mapping quality filters of 30 (-q 30; -Q 30). Reads containing deletions were included in the counts (-J), and only the first read of overlapping read pairs was counted (-s).
Genetic analysis of Dean River Chinook salmon
To demonstrate the utility of combining targeted and reduced representation sequencing in a single protocol, we employed our ampliRAD dataset to analyze the association of the target loci with a critical adaptive trait and explore population structure in Dean River Chinook salmon.
The target loci in our AmpliRAD dataset reside within and upstream of a gene called GREB1L on chromosome 28, a region found to be strongly associated with adult migration timing across numerous populations in the southern portion of the species’ range (Hugentobler et al., 2024; Meek et al., 2020; Narum et al., 2018; Prince et al., 2017; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019; Waples et al., 2022; Willis et al., 2021). However, the association of GREB1L with migration timing has not yet been investigated in populations north of the contiguous United States, where migration tends to be more constrained and peak in the summer, as opposed to southern populations where migrations tend be be more extended with peaks in the spring and fall (Waples et al., 2004).
To investigate the association between migration timing and GREB1L variation in Dean River Chinook salmon, we analyzed the target SNP genotypes generated above. Because the discovery process and validation of some of the markers has not yet been published, we limited our analyses to three target SNPs previously shown to be associated with migration timing in multiple populations (Supplemental Materials; (Koch & Narum, 2020; N. F. Thompson et al., 2020; T. Q. Thompson et al., 2019)). Samples were assigned a consensus GREB1L genotype (homozygous early, heterozygous, or homozygous late) only if consistent genotype calls were obtained for all three SNPs. These consensus genotypes were used for downstream analyses for both adult and juvenile samples.
Adult samples had been caught near the Dean River mouth, allowing their collection dates to serve as reasonable proxies for the start of their upriver migration. Adult sample collection spanned 70 days, from June 4 to August 12 (Supplemental Materials). Samples from both collection years (2022 and 2023) were well represented across this span and were analyzed together. To test for an association between GREB1L variation and adult migration timing, we divided the sampling period into two 35-day windows (early season and late season) and tallied the number of early- and late-migrating alleles observed in each window based on the consensus GREB1L genotypes. A two-tailed Fisher's exact test was used to determine if GREB1L allele frequencies differed significantly between the early and late season groups. The temporal distributions of each genotype among the adult samples were visualized in R (R Core Team, 2024; Wickham, 2016).
To explore the relationship between spatial habitat usage and variation at GREB1L, we also calculated and compared GREB1L allele frequencies in groups of juveniles sampled from different locations in the Dean River basin. The same methods were employed for calling genotypes, evaluating allele frequencies, and testing for significance as described above for the adult samples.
To evaluate population structure in Dean River Chinook salmon, we analyzed the ampliRAD dataset with the genome-wide SNP panel identified above. Individuals underwent an initial screen for missing data, and were excluded if they were missing data at more than 50% of the sites in the panel. Next ANGSD (Korneliussen et al., 2014) was used to perform principle component analysis (PCA). A single read sampling approach was employed to mediate differences in coverage between samples (-doIBS 1; doCounts 1), quality scores were recalculated around indels (-baq 1; -ref path/to/Otsh_v2.0_genome; (Li, 2011b)); sites with base and mapping qualities below 30 were excluded (-minQ 30; -minMapQ 30); the major and minor alleles were inferred with genotype likelihoods under the samtools model (-doMajorMinor 1; -GL 1; (Li, 2011a)); a covariance matrix was generated with -doCov1; and a file was provided to restrict the analysis to loci in the SNP panel (-rf SNPs.txt). SNPs within the region containing target loci (NC_056456.1:13402786-13547912) were excluded. The results were visualized in R (R Core Team, 2024; Wickham, 2016).
