Skip to main content

The Pacific lamprey genomic divergence, association mapping, temporal Willamette Falls, spatial rangewide datasets

Cite this dataset

Hess, Jon et al. (2020). The Pacific lamprey genomic divergence, association mapping, temporal Willamette Falls, spatial rangewide datasets [Dataset]. Dryad.


High rates of dispersal can breakdown coadapted gene complexes. However, concentrated genomic architecture (i.e., genomic islands of divergence) can suppress recombination to allow evolution of local adaptations despite high gene flow.  Pacific lamprey (Entosphenus tridentatus) is a highly dispersive anadromous fish.  Observed trait diversity and evidence for genetic basis of traits suggests it may be locally adapted. We addressed whether concentrated genomic architecture could influence local adaptation for Pacific lamprey. Using two new whole genome assemblies and genotypes from 7,716 single nucleotide polymorphism (SNP) loci in 518 individuals from across the species range, we identified four genomic islands of divergence (on chromosomes 01, 02, 04, and 22). We determined robust phenotype-by-genotype relationships by testing multiple traits across geographic sites. These trait associations likely explain genomic divergence across the species’ range. We genotyped a subset of 302 broadly distributed SNPs in 2,145 individuals for association testing for adult body size, sexual maturity, migration distance and timing, adult swimming ability, and larval growth.  Body size traits were strongly associated with SNPs on chromosomes 02 and 04. Moderate associations also implicated SNPs on chromosome 01 as being associated with variation in female maturity. Finally, we used candidate SNPs to extrapolate a heterogeneous spatiotemporal distribution of these predicted phenotypes based on independent datasets of larval and adult collections. These maturity and body size results guide future elucidation of factors driving regional optimization of these traits for fitness. Pacific lamprey is culturally important and imperiled.  This research addresses biological uncertainties that challenge restoration efforts.


The population genomic divergence dataset of 7716 quality-filtered SNP loci.

Divergence mapping
For characterization of SNP densities and FST statistics, we used a set of 7,716 unique SNP loci from previously published RAD-seq datasets (Hess et al. 2013; Smith et al. 2018), which passed a set of population genetic quality control filters (Supplemental Materials).  This set of 7,716 unique SNPs was a combination of overlapping groups of SNPs from a previous dataset (Hess et al. 2013; SNPs N = 8,772) and a de novo linkage mapping dataset (Smith et al. 2018; SNPs N = 7,977).  BOWTIE2 (Langmead and Salzberg 2012) was used to align these two datasets to the male reference assembly to define homologous loci.  For the 7,716 total SNPs passing the QC filters, 4,046 loci were unique to Hess et al. 2013, 1,418 loci were unique to Smith et al. 2018, and 2,252 SNPs were shared across datasets.  Marker positions based on BOWTIE2 alignments were compared between Pacific lamprey male and female genomes and the Pacific lamprey male and sea lamprey (Petromyzon marinus) male gametic genome (GenBank assembly accession: GCA_002833325.1) to characterize synteny.
Using these 7,716 SNPs genotyped for the same individuals from Hess et al. (2013; i.e., 16 collections with >20 individuals which totaled 482 individuals; Table 1), LOSITAN (Antao et al. 2008) was run using parameter settings of 50,000 simulations, confidence interval of 0.99, false discovery rate set to 0.1, subsample size of 20, simulated FST of 0.019 and an attempted FST of 0.021. We considered loci candidates for positive selection (adaptive loci) above a probability level of 0.995, and neutral loci were defined as falling between the 10th and 90th quantiles of the FST distribution. Any remaining SNPs were conservatively considered undetermined (neither adaptive nor neutral).

The six datasets used for association mapping with 302 unique SNP loci.

Association testing
Genotyping-in-thousands by sequencing (GT-seq, Campbell et al. 2015) was employed to genotype 308 genetic markers for the association testing analyses.  The GT-seq 308 loci were a subset of markers developed from the paired end consensus reads from the Hess et al. (2013) RAD-seq dataset.  The selection of loci and steps in development are described in detail in Supplemental Materials.  Locus selection began with a group of 457 total SNP loci considered in round 1, which included 120 for which TaqMan assays had already been designed (Hess et al. 2015).  Final optimization left 308 loci that worked best in GT-seq genotyping.  For all datasets used below in the association testing we filtered out individuals missing >10% of genotypes at the 308 loci.  Excluding the four species diagnostic loci and two duplicated loci provided 302 unique loci for association tests.
There were six datasets, five comprised of adults (JDD, S_BON, T_BON, WFA♀, and WFA♂) and one comprised of larvae (GAR), with which we performed association testing (Table 1).  Adult datasets were from the following three locations: males (WFA♂, N=136) and females (WFA♀, N=133) from Willamette Falls collected in 2016 (Willamette River, Oregon City, OR; 205.6 Rkm upstream from the Columbia River mouth), two datasets (S_BON, N=295 and T_BON, N=883) from Bonneville Dam in 2014 (235.1 Rkm upstream from the Columbia River mouth), and one dataset (JDD, N=656) from John Day Dam in 2014 and 2015 (346.9 Rkm upstream from the Columbia River mouth).  The following five adult traits were measured on all adult datasets: ordinal “day” of collection (timing of migration to the sample point), girth (mm), total “length” (mm), weight (g), and distance between dorsal fins (“interdorsal”, mm).  Interdorsal measurements have been suggested to serve as an indicator of maturation status in Pacific lamprey because the distance tends to decrease with maturation (Clemens et al. 2009).  We measured an additional migration trait for three adult datasets (S_BON, T_BON, and JDD) via a combination of passive integrated transponder (PIT) and radio tagging of individual fish and observing their furthest upstream detection from the release location (“Rkm”).  Further, since the males and females collected at Willamette Falls (WFA♂ and WFA♀) were being harvested, we were able to measure gonad weight as a proxy for maturity in those datasets.  Finally, a subset of the adult dataset from Bonneville Dam (S_BON) was used in a swim trial experiment within a flume (Kirk et al. 2016), in which the following three swimming behavioral traits were measured: “approached” experiment, passed challenge (“pass”), and passed challenge without fallback (“passrep”). Details of these swimming performance experiments can be found in Kirk et al. (2016) and Supplemental Materials.
A single group of larvae were artificially propagated using adults captured at Bonneville Dam.  These larvae were reared in a common garden experiment to generate early larval growth (“GAR”) rate data (N=337).  All larvae were spawned in the spring of 2015 and allowed to rear from 30 to 163 days after hatching at which point they were measured for growth. Growth rate was measured as length / time (“growth”), and also corrected growth rate [“growth rate_b”; (length – 4 mm) / time] to correct for length at hatch (~4 mm).
Intercorrelation among all measured traits in these six datasets (i.e., JDD, S_BON, T_BON, WFA♀, WFA♂, and GAR) was examined (based on Pearson’s r) to avoid excessive redundancy of predictor variables (│r│ > 0.95), and P-values were calculated (SAS Institute, Inc. 2000).  We performed univariate analyses using a general linear model (GLM) and a mixed linear model (MLM) with TASSEL v. 5.1.0 (Bradbury et al. 2007). The GLM is a fixed effects linear model that is used in TASSEL to identify significant associations between phenotypes and genotypes.  TASSEL takes population structure into account by using genetic principal coordinate axes as covariates in the model. The MLM is similar to GLM but includes both fixed effects (e.g. population structure, and genetic marker) and random effects (i.e., relationships among individuals) and can thus account for both population structure and kinship to reduce false positive associations (Yu et al. 2006).  Details on the covariates and ways in which loci were used taking population structure and relatedness into account in the GLM and MLM tests are provided in the Supplemental Materials. To account for multiple tests, only those associations with P-values less than the critical value as determined using the false discovery rate procedure described by Benjamini and Hochberg (1995) were considered significant.  The Benjamini and Hochberg (1995) false discovery rate approach has more power to detect significant differences than sequential Bonferroni correction (Narum 2006).  Critical values were calculated using the function p.adjust within the R package stats (RDC Team 2019).

Association mapping
The 308 SNP loci in the GT-seq panel were aligned to reference genomes using BOWTIE2. There were 306 that were each assigned to a single location on the Pacific lamprey male genome (99.4%), covering 70 different chromosomes with an average of 4.4 loci per chromosome (range 1 – 22).  Marker locations were based on the alignments of marker sequences to the Pacific lamprey male and female genomes, homologous scaffolds of the sea lamprey genome, and positions on the previously published Pacific lamprey linkage map (Smith et al. 2018).
Adjusted P-values from the association testing described above were log transformed (-LOG10) and plotted by consensus genome position on the Pacific lamprey male genome. We tested correlation of association tests -LOG10(P) with FST from the rangewide divergence to understand whether trait associations may explain the high divergence observed at the rangewide scale for the subset of markers shared between datasets.  Among the 308 SNPs, there were 230 neutral SNPs, 41 adaptive markers SNPs, and a set of 31 “intermediate” SNPs that did not fit definitions of putatively neutral and putatively adaptive (divergence mapping).  Finally, four loci were species-diagnostic, and 2 loci were duplicated.  Therefore, there were 302 unique markers available for these association analyses. These markers included 38 SNPs that were mostly adaptive loci that were categorized into the following 4 groups of statistically linked loci: A (N=10), B (N=13), C (N=7), and D (N=8, Hess et al. 2013).
Two temporal datasets of Willamette Falls (2014-2015) adults with 302 unique SNP loci

For the temporal datasets, we used individuals collected from two successive spawning runs at Willamette Falls (2014 – 2015; N of 868 and 581, respectively) over which it was possible to randomly sample the majority of the annual adult migration of Pacific lamprey (typically February – August) in weekly strata. A daily abundance estimate (Whitlock et al. 2019) was used to expand candidate SNP genotypic proportions in the weekly strata.  We estimated the abundance and 95% confidence intervals of the candidate genotypes using a bootstrapping method (Steinhorst et al. 2017) that was automated in R for a broad set of fisheries applications that require stratified sampling (Thomas Delomas, PSMFC/IDFG,".  One biological complexity was that a portion of the adults encountered before May probably overwintered and experienced shrinkage in body size due to advanced maturation (Beamish 1980). Therefore, in addition to characterizing allele frequencies of candidate SNPs Etr_464 and Etr_5317, we categorized fish by body length to provide insight into the transition between overwintered fish and newly-arrived migrants.
The spatial dataset of larvae/juveniles rangewide with 85 SNP loci
For the spatial dataset, we primarily used collections of larvae and juveniles (95% of dataset of N=3,435) but included some adult collections that were distributed widely across the species’ range.  Larvae and juveniles were the ideal life stage to represent genotypic distributions of individuals that successfully spawned at discrete locations throughout the range. Adult collections were used to fill in portions of the range where larval samples were not available.  Genotyping was partially conducted with a TaqMan assay panel (Hess et al. 2015), which overlapped the GT-seq panel by 85 SNPs they had in common.  COLONY v. (Jones and Wang 2010) was used to reconstruct full-sibling families (Wang 2004) using the 85 shared SNPs on each of the 70 collections. We analyzed all collections together as one using the following parameter settings: polygamous mating for males and females without inbreeding, full-likelihood, medium length of run, no allele updating, and no sibship priors. Only 1 collection out of the 7 adult collections had full siblings (N=13, Stamp River, B.C.) which were maintained to accurately represent this small spawning segment.  We excluded duplicate genotypes, 797 full siblings, and collections with fewer than 5 individuals, resulting in a final set of 57 collections consisting of a total of 2,581 individuals each representing a unique family (Table S1). This dataset was then used to calculate allele frequencies across collections for the representative candidate SNPs Etr_464 and Etr_5317 within the adaptive regions on chromosomes 01 and 02, respectively.

Usage notes

There are 4 separate data files contained within this DOI, and the notes below are separated into four sections to accompany these datafiles.
The population genomic divergence dataset of 7716 quality-filtered SNP loci
This is an excel file that includes tabs with an input file in genepop format that was used to run the FST outlier analysis. In addition, the metadata include the approximate genomic positions of the loci used to produce Manhattan plots for divergence mapping.
The six datasets used for association mapping with 302 unique SNP loci.
The excel spreadsheet includes a README file to guide users through the tabs contained in the spreadsheet which include all the data necessary to run the association tests and the genome alignment positions to perform the association mapping.
Two temporal datasets of Willamette Falls (2014-2015) adults with 302 unique SNP loci
This Excel spreadsheet contains all the necessary data to replicate our analysis of temporal distribution of the genotypes in the Willamette Falls adult Pacific lamprey collected in 2014 and 2015.

Spread sheet contents:
WFA14_15RawData: The genotypic data collected from 295 loci and the individual trait data (length, capture date)
WFA14_15InputData: Only the subset of data used for the analysis of the study, i.e., the length category + Etr_5317 genotypes, the Etr_464 genotypes, the Etr_5317 genotypes, and the capture date and weekly strata.
R_script_CI_bootstrapping: The exact R scripts run to perform the abundance estimates and 95% Confidence Intervals using the input data and 10,000 bootstraps
.csv files: The exact content in each csv file that is called out in the R scripts; Each run year (2014 and 2015) was divided in half by approximately 50% of the total estimated daily abundance (Early vs Late) and each category was estimated within each half of the run:
2 run years x 2 halves of the run X 3 categories = 12 analyses (12 lines of R script to run total analysis).
Results: All abundance estimates and 95% CI for the categories of the 12 analyses and the charts used for the Figure S7 
The spatial dataset of larvae/juveniles rangewide with 85 SNP loci
This Excel spreadsheet contains all the necessary information to recreate the metadata used to generate the Table S1 that appeared in the Supplemental Materials.
There are separate tabs within the spreadsheet containing the input files to run COLONY can be found on tabs "MakerTypeErrorRate" and "OffspringGenotype" which can be run as text files in the program COLONY.


Bonneville Power Administration, Award: BPA Project # 2008-524-00