Selection over small and large spatial scales in the face of high gene flow

Rumberger, Camille1 ; Armstrong, Madison 2 ; Kim, Martin2 ; Ponce, Raquel3 ; Melendez, Josue4 ; DeBiasse, Melissa5 ; Caplins, Serena1 ; Bay, Rachael2

Published Dec 12, 2024 on Dryad. https://doi.org/10.5061/dryad.cvdncjtd8

Data files

Dec 12, 2024 version files 48.35 MB

Dryad_Urchin_CCGP.zip

48.34 MB
README.md

3.56 KB

Abstract

Local adaptation represents the balance of selection and gene flow. Increasingly, studies find that adaptation can occur on spatial scales much smaller than the scale of dispersal, resulting in balanced polymorphisms within populations. However, in many cases we lack information on how this microgeographic adaptation might facilitate or hinder larger scale environmental heterogeneity, for example across latitude. Marine systems present a special case, as many marine species have high dispersal capacity so that dispersal ‘neighborhoods’ may encompass environmental heterogeneity over both extremely small and extremely large spatial scales. Here, we leverage fine-scale sampling across the California range of the Pacific purple urchin (Strongylocentrotus purpuratus), a species with previous evidence of both local adaptation and extremely high gene flow. We find that despite complete absence of neutral population structure, satellite-based sea surface temperature and tidal zone drive genetic differences among populations, suggesting that balanced polymorphisms can lead to adaptation across both large scale (latitudinal) and small scale (subtidal v. intertidal) scales. In fact, some of the same genetic variants differentiate populations at both spatial scales, potentially because both environmental parameters are related to temperature. Further, we find that genes that are expressed at a single tissue or life history stage are more divergent than expected across both latitudinal and tidal zone comparisons, suggesting that these genes have specific functions that might generate phenotypic variation important for local adaptation. Together these results suggest that even in populations with little population structure, genetic variation can be sorted across even small spatial scales, potentially resulting in local adaptation across a complex environmental mosaic.

README contains necessary files for rerunning analyses discussed in this paper

Description of the data and file structure

Directories and their contents. Any NAs represent missing data.

data

6.filtered_goodinds.recode.vcf: vcf file for 114 individuals and 991,002 SNPs
6.filtered_goodinds_thin.recode.vcf: thinned to 19,081 SNPs
CORRECTED_SNPS_qvalues_114ind.csv: data for Outflank
* CHROM= chromosome
* POS= position
* FST_site= FST values associated with site
* FST_tidal= FST values associated with tidal zone (intertidal v. subtidal)
* FST_NS= FST values associated with northern v. southern sites
* NS_qvalues= qvalues associated with northern v. southern sites
* Tidal_qvalues= qvalues associated with tidal zone (intertidal v. subtidal)
* Site_qvalues= qvalues associated with site.

metadata

sites_data.csv: site information
*site= site name
*site_code= 2-3 letter code for each site 
*long,lat= longitude and latitude for each site
*tide= tidal zone 
*color,color2= colors assigned to sites for figures
*site_name=how site name will be shown in figures

new_sites_data.csv all columns matching with sites_data with new column "shapes" for differentiating between intertidal and subtidal sites

Urchin_metadata_114inds_SORTED.csv: sample information, any section with "NA" represents data that was not provided by the collector(s) of that sample
*Number=number for sample
*CCGP_code=code for each site
*M_Number, other, stuff, id= additional numbers for differentiating between samples and sample sites
*Lat, Long=lat/long coordinates for sample sites
*Individuals.per.site..erase.this.column.after.corroborating.information.= number of individuals collected
*General.notes= more site name information
*FieldNumber=acronym for site if applicable
*Instant.ID=species ID
*Collector=who collected the samples
*DateCollected= date samples were collected
*Phylum, Class, Order, Family, Genus, Species= Echinodermata, Echinoidea, Echinoida, Strongylocentrotidae, Strongylocentrotus, purpuratus for all samples
*Depth= depth of sampling site
*Habitat= general description of habitat where samples were collected
*Habitat_tidal= categorization of sites in the intertidal v. subtidal (sub) 
*north_south=categorization of sites in north v. south
*site_name=concise full name of site
*site_code=2-3 letter code for each site

Rmarkdown

Combine_Cands_Urch114.Rmd: candidate genes from RDA and LFMM (the input data for these scripts are created in the RDA_Urchin114_final.Rmd, and LFMM_Urch114_final.Rmd, respectively).
ConStruct.Rmd: Construct analyses for K = 1- 5 and Figure generation for K=2
GE_comparison.R: Gene expression comparison and figure generation
LFMM_Urch114_final.Rmd: LFMM analyses
outflank_urchin.Rmd: Outflank analyses and figure generation
pca_urchin.Rmd: PCA analyses
RDA_Urch114_final.Rmd: RDA analyses
Site Map.R: Site map for figure 1
TopGO.R: Top Gene Ontology values
Shellscripts
Dups.sh: this script removes PCR duplicates
gatk1.sh: step one of gatk, GenomicsDBImport
gatk2.sh: step two of gatk, GenotypeGVCFs
gatk3.sh: step three of gatk, GatherVcfs, SelectVariants, VariantFiltration, filtering with vcftools
Gvcf_array.sh: slurm array script that combines gvcf files with HaplotypeCaller in gatk
mapping.sh: slurm array script maps each set of reads per sample to the reference genome
rg.sh: adds read groups to bam files

Sharing/Access information

Data files listed above