Data from: Genetic, phenotypic, and environmental drivers of local adaptation and climate-change induced maladaptation in yellow warblers
Data files
Oct 24, 2025 version files 428.25 MB
-
GEA.GF_overlappingSNPs.AF_171inds.txt
45.06 MB
-
GEA.samples_22pops_env.meta.csv
3.91 KB
-
GWAS_relate.cXX.txt
230.34 KB
-
museum_meta_clean.csv
25.26 KB
-
PopID.loc
27.36 KB
-
PopID.map
4.64 KB
-
PopID.ped
225.28 KB
-
README.md
12.76 KB
-
yewa.breed.fix.dbf
259 B
-
yewa.breed.fix.prj
145 B
-
yewa.breed.fix.shp
1.77 MB
-
yewa.breed.fix.shx
108 B
-
yewa.filtered.imputed.depth.PAMI.121inds.bed
92.99 MB
-
yewa.filtered.imputed.depth.PAMI.121inds.bim
97.45 MB
-
yewa.filtered.imputed.depth.PAMI.121inds.fam
4.41 KB
-
yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.bed
92.99 MB
-
yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.bim
97.45 MB
-
yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.fam
4.42 KB
-
yewa.precip.TL_residuals.csv
5.16 KB
Abstract
Understanding processes driving local adaptation in wild species is a key goal in evolutionary biology, but linking genotype to phenotype to environmental drivers of natural selection remains challenging. This dataset contains the necessary data to replicate the analyses in Rodriguez et al, which explores the connections between genotypes, phenotypes, and environment in yellow warblers across their breeding range. First, we conduct genome-wide association studies (GWAS) to identify loci related to bill shape and individual quality. We then conduct a gene-environment association (GEA) analysis on the resulting loci and find precipitation is underlying putative selection on bill shape. Finally, we test whether contemporary individuals whose bill shape deviates from historical relationships with precipitation exhibit increased stress—measured by telomere length—resulting from maladaptation. We collected samples from 121 yellow warblers from two reference populations in Michigan and Pennsylvania. At each site, birds were captured using mist-netting, bill depth measurements were taken, and blood samples were collected via brachial venipuncture and preserved in Queens lysis buffer. Further, we collected an additional 171 genetic samples from 22 sites across the yellow warbler breeding range to validate associations between allele frequencies and environmental variables in key loci. From the 171 samples, 63 samples with bill depth measurements from 10 sites across the breeding range were also used to validate the associations between bill depth and environmental variables. In addition, 169 historical yellow warbler samples were collected from museum specimens on the breeding range to run a population structure analysis to ask if local populations have shifted their geographic ranges over the last century.
https://doi.org/10.5061/dryad.hmgqnk9tp
Description of the data and file structure
The following datasets include the required input files needed to conduct genome-wide association studies (GWAS) to identify loci related to bill shape and telomere length in yellow warbler reference populations. This includes the .bed files (binary biallelic genotype tables), .bim files (extended variant information files), and the .fam files (sample information file) for both telomere length and bill depth, as well as the kinship matrix GWAS_relate.cXX.txt. The metadata file for the GWAS data is located in GWAS.samples_121inds_meta.csv, and includes the sample IDs, the coordinates, and the telomere and bill depth measurement for each individual.
This dataset also includes the data needed to conduct a gene-environment association (GEA) analysis on loci from the preceding GWAS. This includes the allele frequency data for each of the overlapping non-zero single nucleotide polymorphisms (SNPs) from the GWAS on bill depth and telomere length (in GEA_overlappingSNPs.AF_171inds.txt). This file includes allele frequency for each SNP (columns) for each sample site (rows). The metadata for this analysis is in GEA.samples_22pops_env.meta.csv and includes the population location, coordinates, and Bioclim data for each location extracted from Worldclim.
Finally, this dataset also includes the data needed to test the relationship between telomere length and phenotype-climate mismatch. This file (yewa.precip.TL_residuals.csv) includes the population location (LOC1), contemporary bill depth measurement (BDEPTH), average breeding precipitation measure extracted from WorldClim (clim), scaled telomere length (TL.scale), coordinates (y,x), and the distance between contemporary and historic associations between precipitation and bill depth. This was calculated by first finding the line of best fit for historic data from Wiedenfeld (1991) between precipitation and bill depth. We then found the distance between each contemporary association between precipitation and bill depth and the historic line of best fit.
Files and variables
File: yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.fam
Description: This file is a sample information file accompanying the associated .bed binary genotype table and .bim extended variant information file. This file has no header line, and one line per sample with the following six fields:
- Family ID ('FID')
- Population ID
- Within-family ID of father
- Within-family ID of mother
- Sex code
- Scaled telomere length
File: yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.bim
Description: This file is an extended variant information file accompanying the associated .bed binary genotype table and .fam sample information file. This file has no header line, and contains one line per variant with the following six fields:
- Chromosome code or name
- Variant identifier
- Variant position
- Base-pair coordinate
- Allele 1
- Allele 2
File: yewa.filtered.imputed.depth.PAMI.121inds.fam
Description: This file is a sample information file accompanying the associated .bed binary genotype table and .bim extended variant information file. This file has no header line, and one line per sample with the following six fields:
- Family ID ('FID')
- Population ID
- Within-family ID of father
- Within-family ID of mother
- Sex code
- Bill depth
File: GWAS_relate.cXX.txt
Description: Estimated relatedness matrix calculated from genotypes in program GEMMA. Contains an n × n matrix of estimated relatedness between all samples.
File: yewa.filtered.imputed.TL.statesex.resids.PAMI.121inds.bed
Description: Primary representation of genotype calls at biallelic variants accompanying the associated .bim extended variant information file and .fam sample information file. The file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.
The low-order two bits of a block's first byte store the first sample's genotype code. ("First sample" here means the first sample listed in the accompanying .fam file.) The next two bits store the second sample's genotype code, and so on for the 3rd and 4th samples. The second byte stores genotype codes for the 5th-8th samples, the third byte stores codes for the 9th-12th, etc.
File: yewa.precip.TL_residuals.csv
Description:
Variables
- LOC1: Population name
- BDEPTH: bill depth measured in mm
- clim: measure of average breeding precipitation in mm
- TL.scale: Telomere length measured using program TelSeq and scaled by age and mass
- resids: We used the ‘lm’ function in R version 3.5.3 (https://www.R-project.org) to fit linear models to test the association between bill depth and the environment for both our historic and contemporary samples. We then calculated the residuals from the contemporary association to the historical line of best fit to get a measure of the phenotype-climate mismatch.
- y: Lattitude
- x: Longitude
File: GEA.samples_22pops_env.meta.csv
Description: Metadata needed to run GradientForest on yellow warbler population allele frequencies. Includes Bioclim variables extracted from Worldcim.
Variables
- Location: population ID
- Lat: Lattitude
- Long: Longitude
- bio_19: Precipitation of coldest quarter (mm)
- bio_18: Precipitation of warmest quarter (mm)
- bio_17: Precipitation of driest quarter (mm)
- bio_16: Precipitation of wettest quarter (mm)
- bio_15: Precipitation seasonality (coefficient of variation)
- bio_14: Precipitation of driest Month (mm)
- bio_13: Precipitation of wettest month (mm)
- bio_12: Annual precipitation (mm)
- bio_11: Mean Temperature of Coldest Quarter (°C)
- bio_10: Mean Temperature of warmest Quarter (°C)
- bio_9: Mean Temperature of Driest Quarter (°C)
- bio_8: Mean Temperature of wettest Quarter (°C)
- bio_7: Temperature annual range (°C)
- bio_6: Min Temperature of Coldest Month (°C)
- bio_5: Min Temperature of warmest Month (°C)
- bio_4: Temperature seasonality (°C)
- bio_3: Isothermality (%)
- bio_2: Mean Diurnal Range (°C)
- bio_1: Annual mean temperature (°C)
- tree: Tree cover
- srtm: Elevation (m)
- qscat: surface moisture characteristics
- ndvistd: vegetation variation
- ndvimax: maximum vegetation cover
- hii: human impact
File: yewa.breed.fix.shp
Description: Shapefile of yellow warbler breeding range
File: yewa.breed.fix.dbf
Description: Attribute data to accompany the shapefile of yellow warbler breeding range
File: yewa.breed.fix.shx
Description: Shape index file to accompany the shapefile of yellow warbler breeding range
File: yewa.breed.fix.prj
Description: Coordinate reference system file to accompany the shapefile of yellow warbler breeding range
File: yewa.filtered.imputed.depth.PAMI.121inds.bim
Description: This file is an extended variant information file accompanying the associated .bed binary genotype table and .fam sample information file. This file has no header line, and contains one line per variant with the following six fields:
- Chromosome code or name
- Variant identifier
- Variant position
- Base-pair coordinate
- Allele 1
- Allele 2
File: yewa.filtered.imputed.depth.PAMI.121inds.bed
Description: Primary representation of genotype calls at biallelic variants accompanying the associated .bim extended variant information file and .fam sample information file. The file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.
The low-order two bits of a block's first byte store the first sample's genotype code. ("First sample" here means the first sample listed in the accompanying .fam file.) The next two bits store the second sample's genotype code, and so on for the 3rd and 4th samples. The second byte stores genotype codes for the 5th-8th samples, the third byte stores codes for the 9th-12th, etc.
File: GEA.GF_overlappingSNPs.AF_171inds.txt
Description: This file contains the allele frequencies for all of the overlapping non-zero SNPs found in the bill depth and telomere length GWAS. The SNP names are in the first row, with population allele frequencies in each row.
File: museum_meta_clean.csv
Description: This file contains the metadata for the historical specimens collected from museums across the breeding range of the yellow warbler.
Variables
- FieldID
- Museum: Three letter code for the museums from which we took historical samples: Buffalo Society of Natural Sciences, California Academy of Sciences, Carnegie Museum of Natural History, Charles R. Conner Museum, Delaware Museum of Natural History, Denver Museum of Nature & Science, Field Museum of Natural History, University of Kansas Biodiversity Institute, Natural History Museum of Los Angeles County, Museum of Comparative Zoology - Harvard University, Museum of Vertebrate Zoology - UC Berkeley, James R. Slater Museum of Natural History, Royal Ontario Museum, Museum of Wildlife and Fish Biology - UC Davis, University of Michigan Museum of Zoology, Yale Peabody Museum.
- Catalog no: Identification number for each individual specimen at their host museum
- GUID: museum and catalog number for each sample
- Year
- Month
- Day
- Sex
- Date
- Lat1
- Lon1
File: PopID.ped
Description: This file contains the genotype data for historical yellow warbler samples used to ask whether local populations have shifted their geographic ranges over the last century. Samples were skin or toe pads loaned from museums. The first column is the family ID, and the second column is the individual ID. The next 4 columns are for phenotypes, but are all zeros as we did not include phenotypes in this analysis. After the first six columns are the genotypes for each SNP in the order they appear in the companion PopID.map file. Each SNP is represented by two alleles, separated by a space.
File: PopID.map
Description: This file contains the location data for the SNPs used to find whether local populations have shifted their geographic ranges over the last century. This is a companion file to the PopID.ped file. The first column indicates the chromosome, the second column is the SNP ID, the third column is the genetic distance, and the last column indicates the base-pair position.
File: PopID.loc
Description: This file contains the location data for samples used in the test of whether local populations have shifted their geographic ranges over the last century. The first column indicates the sample ID, the second column is the town where the sample was collected, the third column is the decade in which the sample was collected, and the last two columns are the sample collection coordinates.
Code/software
The allele frequency file in this repository require Linux to generate. Users are provided with the genome-wide association scripts to do so if they wish; however, users are also provided with the file outputs so that obtaining access to a machine with a Linux operating system is not a requirement to replicate our analyses. The remaining scripts require R to run.
Access information
Data was derived from the following sources:
- Worldclim database (Harris et al., 2020, Fick and Hijmans, 2017).
- National Landcover Database: https://www.usgs.gov/centers/eros/science/national-land-cover-database
- BYU Center for Remote Sensing: https://www.scp.byu.edu
- Carroll, M. L., DiMiceli, C. M., Sohlberg, R. A., & Townshend, J. R. G. (2004). 250m MODIS normalized difference vegetation index. University of Maryland, College Park, Maryland.
- Sexton, J. O., Song, X. P., Feng, M., Noojipady, P., Anand, A., Huang, C., ... & Townshend, J. R. (2013). Global, 30-m resolution continuous fields of tree cover: Landsat-based rescaling of MODIS vegetation continuous fields with lidar-based estimates of error. International Journal of Digital Earth, 6(5), 427-448.
- Wiedenfeld, D. A. (1991). Geographical morphology of male Yellow Warblers. The Condor, 93(3), 712-723.
We collected samples from 121 yellow warblers from two reference populations in Michigan and Pennsylvania. At each site, birds were captured using mist-netting, bill depth measurements were taken, and blood samples were collected via brachial venipuncture and preserved in Queens lysis buffer. Further, we collected an additional 171 genetic samples from 22 sites across the yellow warbler breeding range to validate associations between allele frequencies and environmental variables in key loci. From the 171 samples, 63 samples with bill depth measurements from 10 sites across the breeding range were also used to validate the associations between bill depth and environmental variables.
Whole genome sequencing libraries were prepared following modifications of Illumina’s Nextera Library Preparation protocol with a target sequencing depth of 2X per individual. We used the program Trimmomatic 0.39 to trim the sequence data to remove Illumina adapter sequences and polyG tails using a sliding window approach (SLIDINGWINDOW:4:20). We then mapped reads to the yellow warbler reference genome (NCBI BioProject PRJNA777222) using BWA 0.7.17. After mapping, the resulting SAM files were sorted, converted to BAM files, and indexed using Samtools version 1.16. We used MarkDuplicates from Picard (http://broadinstitute.github.io/picard) to mark read duplicates and clipped overlapping reads with the clipOverlap function from bamUtil. To reduce sequencing depth variation, we used the downsample function from Picard (http://broadinstitute.github.io/picard) to downsample reads from BAM files with greater than 3X coverage, to 3X coverage. This resulted in an average read depth of 2.7X coverage.
To identify genetic markers from low-coverage WGS data, we used the program HaplotypeCaller in the Genome Analysis Toolkit (GATK version 4.1.6.0) applying a minimum base quality score of 33 and a minimum mapping quality score of 20 to reduce lane effects. To parallel the genotype calling process, we generated genomic databases in ~3 Mb intervals across the genome and combined and indexed the genotyped VCF files with BCFtools 1.16. To remove systematic errors, we applied a hard filter to the subsequent VCF file with the following parameters, "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0", filtering the indels separately with "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0". We then used BCFtools to keep biallelic sites (-m 2 -M 2) missing in fewer than 20% of the sampled individuals ('F_MISSING < 0.20'), with minor allele frequency of at least 0.05 (--min-af 0.05, --max-af 0.95), and with a sequencing quality score of at least 30 (‘QUAL > 30’)42. This filtering resulted in 2,999,708 variants in 298 individuals with an average of 21% missing data.
We measured telomere length from bam files using Telseq v0.0.2. We modified parameters in the Telseq source code to adapt it to the yellow warbler genome, which includes changing the number of chromosomal ends, read length, and total GC content (bp). The parameters TELOMERE_ENDS, READ_LENGTH and GENOME_LENGTH_AT_TEL_GC were set equal to 62, 100, and 143831148, respectively. We calculated the latter by measuring the total length of 150 base pair windows in the yellow warbler genome with a GC content between 48% and 52%.
To compare the current and pre-climate-change associations between phenotype and environment, we used data from Wiedenfeld (1991) which includes morphometric measurements from 153 yellow warblers captured between 1873 and 1987. We used wing-chord as a proxy for body size to calculate body-size corrected bill depth in historic and current samples. Using locations of capture, we extracted historical monthly climate data from Worldclim for the breeding months of May, June, and July for each sample between the years of 1901 – 1950, which we then averaged. As bioclim variables are not available for historic time-periods, we used an average precipitation. We then used the ‘lm’ function in R version 3.5.3 (https://www.R-project.org) to fit linear models to test the association between bill depth and the environment for both our historic and contemporary samples. We then calculated the residuals from the contemporary association to the historical line of best fit. We used those residuals as a measure of change between the historic and contemporary relationship between bill depth and climate, where a larger residual means a bigger mismatch between bill depth and the environment, relative to what we assume is the pre-climate change optimal.
We used population structure analyses to ask whether local populations have shifted their geographic ranges over the last century. We assembled a collection 169 historic samples of yellow warblers sampled on their breeding range. Historic samples were skin or toe pads loaned from museums (Supplementary Table ##). All samples were extracted using the Qiagen DNeasy Blood and Tissue Kit and genotyped at a set of 96 SNPs, previously identified for geographic assignment (Bay et al. 2021), using a Fluidigm 96.96 IFC controller. After SNP genotyping, we discarded individuals with poor quality data (<50% of SNPs genotyped). Genotypes from historic samples were combined with previously genotyped contemporary yellow warblers (1990-present) sampled on their breeding range (Bay et al. 2021). This left us with a final set of 551 samples (129 historical and 422 contemporary)
We performed principal components analysis (PCA) on contemporary samples only to establish the relationship between genetic variation and geography. PCA was performed using the SNPRelate package in R v4.3.2. We then predicted loadings of historical samples using the snpgdsPCASampLoading function. Historical samples were plotted alongside contemporary samples to visualize whether relationships between genetic variation and geography changed over time. We used linear models to test for effects of latitude, longitude, and time (historical v. contemporary) on PC axes.
