Local adaptation in an alpine finch, GWAS and FST output files
Data files
Mar 27, 2026 version files 12.37 GB
-
bcftools_BCRF.zip
83.27 KB
-
bslmm_output.zip
3.29 GB
-
bslmm_results.zip
49.50 MB
-
lmm_output.zip
7.13 GB
-
lmm_results.zip
22.78 KB
-
OutFLANK_fst_results_0.005.csv
1.91 GB
-
README.md
7.60 KB
Abstract
Understanding patterns and mechanisms underlying local adaptation is becoming increasingly important for species conservation amid anthropogenically driven environmental change. Alpine systems are experiencing particularly intense pressure from environmental change resulting from increased rates of warming and corresponding loss of snow and ice. We integrate morphological and genetic analyses to identify traits important for local adaptation in one of the highest elevation breeding birds in North America, the Sierra Nevada Gray-crowned Rosy Finch. We performed an in-depth analysis of how traits with known links to thermoregulation in birds, such as wing length, bill size, and feather microstructure, vary between two populations at sites with contrasting climate and environmental conditions. We identified loci underlying these traits using a genome-wide association study and further examined regions of the genome related to altitude adaptation and cold tolerance using FST outlier tests. Together, these results indicate that temperature, food availability, and alpine landscape features may impose multifaceted and potentially conflicting selective pressures on morphological traits important to adaptation in alpine birds. Overall, this work represents one of the first in-depth analyses of the genetic basis of adaptation in an alpine specialist songbird.
Dataset DOI: 10.5061/dryad.kwh70rzhs
Description of the data and file structure
Overview
This repository contains data and analysis outputs associated with the study:
The Genetic and Morphological Basis of Local Adaptation to Elevational Extremes in an Alpine Finch
The dataset stored here includes:
- Genome-wide association study (GWAS) outputs
- FST outlier analyses
- Gene annotation and candidate gene identification outputs
These data were used to investigate patterns of local adaptation in the Gray-crowned Rosy Finch (Leucosticte tephrocotis dawsonii) through the integration of genomic and phenotypic variation.
Study Design Summary
Samples were collected from alpine populations and included:
- Morphological measurements (e.g., beak traits, wing chord, body size metrics)
- Feather microstructure measurements (e.g., barbule density, node density)
- Whole-genome sequencing data
Whole-genome sequencing libraries were prepared using a modified Nextera protocol and sequenced on an Illumina NovaSeq 6000. Reads were processed through a Snakemake pipeline. GWAS analyses were conducted using GEMMA, implementing:
- Bayesian sparse linear mixed models (BSLMM)
- Univariate linear mixed models (ULMM)
FST outliers were identified using OutFLANK, and candidate genes were identified using BEDTools closest. Functional enrichment was conducted using PANTHER and STRING.
Files and variables
File: bcftools_BCRF.zip
Description: Output from bcftools closest identifying genes nearest to candidate SNPs using the Brown-capped Rosy Finch genome annotation.
Contents include:
- SNP position
- Closest gene
- Distance to gene
- Gene annotation
These files were used for candidate gene identification.
Files Inside:
bslmm_0.5_BCRF_soft_distance_new.bed and lmm_snps0.5_BCRF_soft_distance_new.bed are the raw outputs from the bcftools closest function.
genes_trait_bslmm0.5_0.01_25kb_BCRF_soft_new.csv is the output from the bslmm analysis with the associated trait and gene. The "_clean" extension means uncharacterized genes (LOC genes) have been removed.
genes_trait_lmm0.5_25kb_BCRF_soft_new.csv is the same as above.
File: lmm_results.zip
Description: Filtered results from ULMM analysis, including only significant SNPs.
Filtering threshold:
- p-value < 5 × 10⁻⁸
Files Inside:
GCRF_merged.SNP.filtered_0.5 file for each trait showing the outlier SNPs for each trait.
lmm_snps_0.5_fst_new.csv shows the fst values for each SNP.
lmm_summary_outliers_0.5_soft.csv is all the outliers for all the traits.
File: bslmm_results.zip
Description: Filtered BSLMM output.
Filtering threshold:
- Mean posterior inclusion probability (PIP, gamma) > 0.01
Files Inside:
Here is a file for each trait showing the outlier SNPs given different filtering thresholds (PIP 0.01, 0.1, 0.05).
Summary_.csv* are the summary files for each PIP filtering with all the outlier loci.
hyperparameters are the folder with all the hyperparameter outputs from BSLMM for each trait and each PIP filtering
File: bslmm_output.zip
Description: Raw output files from GEMMA Bayesian sparse linear mixed model (BSLMM).
Includes:
- Posterior inclusion probabilities (PIP) (represented as gamma)
- Effect size estimates
- Model convergence outputs
File: lmm_output.zip
Description: Raw output files from GEMMA univariate linear mixed model (ULMM) analyses.
Includes:
- Effect sizes
- Standard errors
- p-values
- SNP identifiers
File: OutFLANK_fst_results_0.005.csv
Description: Results from genome-wide FST outlier detection using OutFLANK (Whitlock & Lotterho,s 2015). OutFLANK identifies loci with unusually high FST values relative to the neutral distribution, suggesting potential selection.
Variables
- rs: SNP identifier.
- He: Expected heterozygosity at the locus across populations
- FST: Corrected FST value estimated by OutFLANK
- T1: Numerator of the FST calculation after trimming extreme values.
- T2: Denominator of the FST calculation after trimming.
- FSTNoCorr: Uncorrected FST value prior to OutFLANK trimming and correction. (Used internally by OUTFLANK)
- T1NoCorr: Uncorrected numerator component of FST calculation.
- T2NoCorr: Uncorrected denominator component of FST calculation.
- meanAlleleFreq: Mean allele frequency across populations.
- indexOrder: Original SNP order from input file.
- GoodH: Logical flag indicating whether heterozygosity falls within acceptable bounds. TRUE = SNP retained for neutral distribution fitting. FALSE = SNP excluded (often due to low heterozygosity)
- qvalues: False discovery rate (FDR) corrected p-values.
- pvalues: Two-tailed p-values for each SNP.
- pvaluesRightTail: Right-tailed p-values.
- OutlierFlag: Logical indicator of whether the SNP is classified as an outlier. TRUE = significant FST outlier. FALSE = not significant
Code/software
The files included in this dataset are outputs from genome-wide association and population differentiation analyses. These results were generated using GEMMA and OutFLANK, both freely available software tools.
Genome-Wide Association Analyses
Genome-wide association analyses were conducted using GEMMA v0.98.1 (Zhou & Stephens 2012; Zhou et al. 2013).
Two models were implemented:
Bayesian Sparse Linear Mixed Model (BSLMM)
BSLMM analyses were performed using:
- 500,000 burn-in iterations
- 5,000,000 sampling iterations
- Posterior inclusion probability (PIP) threshold: 0.01
These analyses produced the following files:
bslmm_output.zip— raw GEMMA outputbslmm_results.zip— filtered significant SNPs
Univariate Linear Mixed Model (LMM)
Univariate linear mixed models were run using the GEMMA Wald test.
Filtering threshold:
- p-value < 5 × 10⁻⁸
These analyses produced:
lmm_output.zip— raw GEMMA outputlmm_results.zip— filtered significant SNPs
GEMMA input files were generated using PLINK v1.9 (Purcell et al. 2007; Chang et al. 2015).
FST Outlier Analysis
Genome-wide FST outlier detection was conducted using OutFLANK (Whitlock & Lotterhos, 2015) implemented in R v4.2.0.
OutFLANK identifies loci under potential selection by:
- Fitting a neutral FST distribution
- Identifying SNPs with unusually high differentiation
Parameters used:
- False discovery rate (FDR): 0.005
- Default trimming parameters
This analysis produced:
OutFLANK_fst_results_0.005.csv
Reproducibility
Scripts used to run analyses and generate outputs are available at:
https://github.com/ecnrobertson/GCRF_Pub
This repository contains:
- GEMMA workflow scripts
- OutFLANK analysis scripts
- Filtering and plotting scripts
Software References
- Zhou, X. & Stephens, M. (2012). Genome-wide efficient mixed-model analysis. Nature Genetics, 44, 821–824.
- Zhou, X., Carbonetto, P., & Stephens, M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics, 9, e1003264.
- Whitlock, M. C., & Lotterhos, K. E. (2015). Reliable detection of loci responsible for local adaptation. Molecular Ecology, 24, 1039–1051.
- Purcell, S. et al. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81, 559–575.
Sample Collection
A total of 171 Gray-crowned Rosy Finches were collected using potter traps in the White Mountains (n=98) and Piute Pass (n=73) in June and July, the peak of the species’ breeding season (MacDougall-Shackleton et al., 2000). Birds showing signs of stress were released, resulting in slight variation in the types of data collected for each individual. Blood samples were collected from 150 of the captured birds using the brachial wing vein and stored in Queen’s lysis buffer at room temperature (Owen, 2011; Seutin et al., 1991). At the same time, morphological measurements - tarsus length, wing chord, beak width, beak length, beak depth, and nare length - were also collected from 154 of the birds using an electronic caliper, and 5-10 body feathers were collected from the breast of all individuals. Birds were banded to allow for identification of the individuals and to avoid recapture, then released. All handling and banding of birds was done following the guidelines and protocols of the U.S Geological Survey (USGS) Bird Banding Laboratory (BBL) and complied with permits and permissions from federal and state agencies.
Feather Microstructure Measurements
We measured the microstructure of three feathers for each individual. For each feather, three photos were taken using an Olympus BX51 Microscope: one photo each of an unbroken pennaceous and plumulaceous barb from the center of each region was taken at 4x objective, and one photo of a plumulaceous barbule was taken at 10x objective. Structures were then measured from the photos using the segmented line tool in ImageJ. For the pennaceous and plumulaceous barbule length, we measured one barbule from each region, starting from the distal tip to where the barbule meets the barb. For the pennaceous and plumulaceous barbule density, we drew a 0.5mm line along the middle of a barb, counting the number of barbules along that line. And finally, for plumulaceous node density, we drew a 0.2mm line along the center of a barbule, counting the number of nodes along that line (Fig 2A). Each replicate measure for a feather trait was averaged per individual prior to downstream analysis.
Sequencing and Genotyping
DNA was extracted from blood samples of 150 individuals using Qiagen’s DNeasy Blood and Tissue Extraction Kit protocol and quantified with the Qubit dsDNA HS Assay Kit. We used a modified version of Illumina’s Nextera Library Preparation protocol to prepare WGS libraries and pooled the libraries by equal mass before sequencing. The resulting libraries were sequenced on 2 lanes of an Illumina NovaSeq 6000 at Novogene Corporation.
To process the raw sequence reads and detect variants, we utilized Snakemake (Molder et al. 2021), a workflow management system that provides efficiency, adaptability, and reproducibility. The pipeline we adapted to our species can be found on GitHub (https://github.com/eriqande/mega-non-model-wgs-snakeflow/). Sequence data were trimmed using fastp (S. Chen et al., 2018) to remove adaptor sequences and polyG tails using a sliding window, and then aligned to a Brown-capped Rosy Finch (Leucosticte australis) reference genome (GenBank: GCA_025504685.1) using Burrows-Wheeler Aligner software (H. Li & Durbin, 2009). We marked PCR duplicates using samtools (Li et al., 2009) and read groups (sample, lane, library) were added using picard (http://broadinstitute.githut.io/picard). Individual coverage was estimated using samtools, and given the range of coverage (XX-XX), we downsampled BAM files to 10x using the samtools subsample function (Li et al., 2009). Individual gvcf files were created using GATK’s HaplotypeCaller (Poplin et al., 2018) with high base quality score filters (--min-base-quality-score 33 --minimum-mapping-quality 20) to remove batch effects (Lou and Therkildsen, 2021). We then parallelized the calling of genotypes across 3 million bp regions across the genome using the GenomicsDBImport and GenotypeGVCFs functions (Auwera & O’Connor, 2020; McKenna et al., 2010). GATK versions above 4.0 call missing data as homozygous reference, which can bias downstream analysis. To avoid this, all loci with a depth of zero were manually marked as missing. To remove systematic errors according to GATK variant quality score recalibration (VQSR) best practices, we hard filtered variants with the following parameters: StrandOddsRatio (>3.0), FisherStrand (>60.0), MappingQuality (>40.0), and Quality by Depth (>2.0). We further filtered Single Nucleotide Polymorphisms (SNPs), removing indels, filtering for missingness (<50%), allele frequencies (>0.5 and <0.95), depth (>4x), quality (>30.0) using bcftools ( Li, 2011. After filtering, the resulting high-quality SNPs were passed through BEAGLE 4.1 (Browning & Browning, 2016) to impute missing genotypes.
