Data from: Genetic-environment associations explain genetic differentiation and variation between western and eastern North Pacific Rhinoceros Auklet (Cerorhinca monocerata) breeding colonies
Data files
Jul 14, 2025 version files 102.09 MB
-
auklets.vcf
102.07 MB
-
README.md
1.67 KB
-
RHAUenvVar.csv
18.48 KB
Abstract
Animals are strongly connected to the environments they live in and may become adapted to local environments. Examining genetic-environment associations of key indicator species, like seabirds, provide greater insights into the forces that drive evolution in marine systems. Here we examined a RADseq dataset of 19,213 SNPs for 99 Rhinoceros Auklets (Cerorhinca monocerata) from five western Pacific and ten eastern Pacific breeding colonies. We used partial-redundancy analyses to identify candidate adaptive loci and to quantify the effects of environmental variation on population genetic structure. We identified 262 candidate adaptive loci, which accounted for 3.0% of the observed genetic variation among western Pacific and eastern Pacific breeding colonies. Genetic variation was more strongly associated with pH and maximum current velocity, than maximum sea surface temperature. Genetic-environment associations explain genetic differences between western and eastern Pacific populations, however, genetic variation within the western and eastern Pacific Ocean populations appears to follow a pattern of isolation-by-distance. This study represents a first to quantify the relationship between environmental and genetic variation for this widely distributed marine species and provides greater insights into the evolutionary forces that act on marine species.
This README file is associated with the article Genetic-environment associations explain genetic differentiation and variation between western and eastern North Pacific Rhinoceros Auklet (Cerorhinca monocerata) breeding colonies. The readme here is associated with the RHAUenvVar.csv data file. These data were used to conduct the redundancy analyses and detect associations between loci and environmental variables.
IID= Individual ID
Lat=Latitude
Long=Longitude
salt.max/min= Maximum and minimum measured salinity value (probably in PSU or ppt)
cv.max/min=maximum and minimum current velocity (m/s)
temp.max/min=maximum and minimum sea surface temperature (degree C)
ph=pH
salt.mean=Mean salinity value
Res.PC1=Resistance analysis principal component 1
Res.PC2=Resistance analysis principal component 2
pg.PCA1= principal coordinate 1 for genetic data
pg.PCA2=principal coordinate 2 for genetic data
Coord1 A=derived coordinate
Coord2 A=second derived coordinate
Additionally I have attached the vcf file with all of the SNP data. The VCF file associated with these data is auklets.vcf
The VCF file is a VCFv4.2 file. The first 13 lines of the vcf characterize the attributes of the vcf file and follow standard format. The next 101499 lines are the loci/contig names corresponding to where SNPs are located the raw genetic data is contained on the lines that follow from line 101513. This raw data is needed to complete all population genetic analyses.
DNA was extracted from blood samples using a salting out extraction protocol (for samples from the eastern Pacific, Miller, Dykes & Polesky, 1988) or a Qiagen DNAeasy kit (for samples from western Pacific). Genomic DNA was used to construct nextRAD genotyping-by-sequencing libraries (SNPsaurus, LLC) using the Sbf1 enzyme as described by Baird et al. (2008). Genomic DNA was first fragmented with Nextera reagent (Illumina, Inc), which also ligates short adapter sequences to the ends of the fragments. The Nextera reaction was scaled for fragmenting 15 ng of genomic DNA, although 20 ng of genomic DNA was used for input to compensate for the amount of degraded DNA in the samples and to increase fragment sizes. Fragmented DNA was then amplified for 27 cycles at an annealing temperature of 74 oC, with one of the primers matching the adapter and extending ten nucleotides into the genomic DNA with the selective sequence GTGTAGAGCC. Only those fragments starting with that sequence can be hybridized by the selective sequence of the primer and efficiently amplified. This protocol resulted in a final library fragment size of 450 bp (Etter et al. 2011). The nextRAD libraries were sequenced on an Illumina NovaSeq 6000 with one lane of single-end 150 bp reads. All genomic library preparations and sequencing were completed at the University of Oregon.
Sequences were demultiplexed and then trimmed to 122 bp by SNPsaurus using the SNPsaurus pipeline with the bbduk package (BBMaptools, http://sourceforge.net/projects/bbmap/). Next, we assembled reference loci by collecting 10 million high quality reads, evenly from all of the samples (~70, 000 reads per individual were used), and excluding loci with fewer than seven or more than 700 reads. This range of seven to 700 represents a standardized number calculated by SNPsaurus to retain as many loci as possible without compromising the quality of the data with low quality reads. Overall mean depth of the reference genome was 65x. Loci that met the previously stated criteria were then aligned to the assembled reference genome using custom script from SNPsaurus (SNPsaurus, LLC). For the de novo alignment, we mapped 152,204,819 of the original 289,864,865 single-end reads to the de novo reference genome using an identity threshold of 0.95 using bbmap (BBMap tools). Genotype calling was done using the callvariants tool (BBMap tools), with the following settings (multisample=t rarity=0.05 minallelefraction=0.05 usebias=f ow=t nopassdot=f minedistmax=5 minedist=5 minavgmapq=15 minreadmapq=15 minstrandratio=0.0 strandedcov=t). The genotype data were converted to a VCF file where we filtered the data to remove loci with a minimum frequency of less than 3%, a Q-score below 20, and removed all individuals with greater than 60% missing data (an additional 13 individuals did not meet this criterion and were omitted from all analyses). The average percentage of missing data was much lower than this original threshold (mean=5.8% missing data; median = 3.3% missing data), although we included three individuals with 40% missing data because they grouped with other individuals from the same population. To ensure that relatedness did not skew our results we calculated relatedness among individuals in Genodive 3.04 (Miermans, 2020). Relatedness among individuals from the same population was <0.08, with exception to one pair that had a relatedness of 0.26, suggesting that one set of full siblings from Teuri were present in our data. Given the low level of relatedness among our data, we retained all samples in our analyses. We retained all of the 19,213 SNPs following the filtering for our examination of genetic-environment associations.