Footprints of local adaptation span hundreds of linked genes in the Atlantic silverside genome
Wilder, Aryn; Palumbi, Stephen; Conover, David; Overgaard Therkildsen, Nina (2020), Footprints of local adaptation span hundreds of linked genes in the Atlantic silverside genome, Dryad, Dataset, https://doi.org/10.5061/dryad.jm63xsj7w
We provide excepts of the methods used to generate the data in this repository. Full details can be found in the associated publication and supplemental material (doi:10.1002/evl3.189).
Data generation, read mapping and SNP calling
Atlantic silverside muscle samples collected from Georgia (GA), New York (NY), Gulf of Maine (GoM) and Gulf of St. Lawrence (GSL) were sequenced by Therkildsen & Palumbi (2017) to a final, average depth of 1.5x across the reference transcriptome. See Therkildsen & Palumbi (2017) and Therkildsen et al. (2019) for further detail. We added 47 additional samples from Oregon Inlet, North Carolina (NC) near Cape Hatteras. Individually barcoded sample libraries were prepared at Cornell’s Biotechnology Resource Center using similar methods to Therkildsen & Palumbi (2017), with a few minor modifications. Reagents from the Illumina Nextera kit (96 sample Nextera DNA Library Prep Kit) were used at 1/3 the recommended concentration in 1/10 the recommended volume (5ul instead of 50ul), with 2ng of input DNA. Individual libraries were pooled, and size-selected using a Pippin Prep to remove fragments <286 bp (150 bp insert plus 136 bp of Illumina adapters). We generated paired-end 75 bp sequences on the NextSeq500. Filtered mapped reads for the NC samples had a final average depth of 1.5x.
Read adapters were trimmed from the paired-end reads using Trimmomatic v.0.3.6 (Bolger et al. 2014), and mapped to the reference transcriptome with Bowtie2 v2.2.3 (Langmead and Salzberg 2012), using the very-sensitive-local mode and retaining unpaired, orphaned and concordantly paired reads with mapping quality >20. Duplicate reads were removed with MarkDuplicates (PICARD Tools v1.139; http://broadinstitute.github.io/picard/). To avoid spurious SNP calls stemming from differences in read length in the NC sample (Leigh et al. 2018), we called SNPs across the original four population sample set (GA, NY, GoM and GSL). Downstream analyses of all samples (including NC) were performed on this set of SNPs. Because NC is located in the center of the distribution of the southern phylogeographic group (Mach et al. 2005; Lou et al. 2018), we do not expect to have missed variants private to that population by excluding it from the SNP calling. Biallelic SNPs were called in the program ANGSD v. 0.912 (Korneliussen et al. 2014). ANGSD estimates genotype likelihoods at each site based on the aligned reads and their mapping and sequencing quality scores. Using genotype likelihoods for low-coverage sequencing data allows for the incorporation of uncertainty in genotyping, providing more accurate population genetic inference (Li 2011; Nielsen et al. 2011). We called polymorphic sites across the four original populations (n = 189), at sites with a probability <1e-6 of being monomorphic, using the SAMtools model of genotype likelihoods in ANGSD. Bases with quality score <20 were excluded. We excluded sites with minor allele frequency (MAF) < 0.01, total read depth <75 and >759 (mean depth +1 standard deviation), or data from <75 individuals. ANGSD called 1,942,329 SNPs that passed the above filters. From these, we filtered out 38,210 SNPs within repetitive elements identified by RepeatMasker (Smit et al. 2017). The final dataset comprised 1,904,119 bialleleic SNPs across the ~52 MB transcriptome.
Estimating allele frequencies, identifying FST outliers and functional annotation of SNPs
At all globally variant sites, major and minor alleles were inferred based on global allele frequencies and minor allele frequencies (MAF) were estimated within populations from genotype likelihoods at sites with data for at least 10 individuals in the program ANGSD. We estimated pairwise FST at each SNP between population pairs using the 2-dimensional site frequency spectrum (2D SFS) as prior.
For each SNP, we estimated the global FST across all five populations as in the OutFLANK script FST functions.R, except that we made a slight modification to the script so that we could directly input allele frequencies and sample sizes estimated in ANGSD (which takes genotype likelihoods into account), rather than estimating allele frequencies from allele counts in OutFLANK. FST was then estimated by the procedure implemented in OutFLANK. We fit the X2 distribution to the pruned SNP set by trimming 5% from each side of the FST distribution and using q<0.05 to estimate the degrees of freedom and FSTbar of the X2 distribution in OutFLANK (Whitlock and Lotterhos 2015). We then applied the X2 fit to all SNPs across the transcriptome to identify FST outliers using q<0.05.
To understand the potential function of SNPs within the transcriptome, we used two programs, Transdecoder and GeneMarkS-T, to predict coding sequences (CDS) and untranslated regions (UTR). From these annotations, we used snpEff to predict the function (e.g., missense variant, synonymous variant, 3’ UTR variant) of each SNP.
Identifying LD blocks and ordering Atlantic silverside genes along medaka chromosomes
We estimated linkage disequilibrium between all pairs of SNPs identified as FST outliers. Because the majority of these SNPs were fixed for opposite alleles at the extremes of the geographic distribution, precluding LD inference, we limited our LD analysis to the three sampling locations in the center of the range (NC, NY and GoM). Within each of these populations, we estimated LD across pairs of SNPs with MAF>0.1 in the program ngsLD. We summarized patterns by calculating an average LD between pairs of contigs that had at least 2 outlier SNPs with MAF>0.1. For each pair of these contigs, we calculated the mean D’ across all pairs of outlier SNPs between contigs. We used a hierarchical clustering approach to determine how contigs clustered into blocks within each population by generating dendrograms from the pairwise LD matrix using the “Ward.D” method in the hclust function in R (Murtagh and Legendre 2014).
We downloaded all medaka peptide sequences and used blastx to compare silversides contigs to medaka peptides. We then compared medaka peptides to silverside contigs using tblastn with soft masking and an e-value < 10-4. 19,230 of the 20,998 contigs had a reciprocal best hit to the medaka genome. Of these, 17,724 contigs mapped to one of the 24 medaka chromosomes.
The spreadsheet SampleLocationDetails.xlsx in this repository provides locality data associated with sequence data for samples archived in NCBI's sequence read archive (SRA).
AtlanticSilversides_TranscriptomeWideSNPs_sorted.txt gives data for 1,904,119 SNPs analyzed in the study.
AtlanticSilversides_TranscriptomeWideSNPs_description.txt describes the columns of the SNP data table.
National Science Foundation, Award: OCE-1434325
National Science Foundation, Award: OCE-1756316