Data from: Genomics-informed conservation units reveal spatial variation in climate vulnerability in a migratory bird
Data files
Oct 31, 2024 version files 24.29 GB
-
adapt.cawa.bqsr.miss0.2.qual30.1X.fst85.imputeAll.imputed.gl.noIndel.noRdaPlatform.vcf.gz
2.86 MB
-
cawa_seq_metadata.xlsx
24.42 KB
-
cawa.bqsr.miss0.2.qual30.1X.fst85.imputeAll.imputed.gl.noIndel.noRdaPlatform.vcf.gz
156.76 MB
-
cawa.bqsr.unfiltered.sort.vcf.gz
24.13 GB
-
ind.Env.xlsx
33.20 KB
-
README.md
3.79 KB
Abstract
Identifying genetic conservation units (CUs) in threatened species is critical for the preservation of adaptive capacity and evolutionary potential in the face of climate change. However, delineating CUs in highly mobile species remains a challenge due to high rates of gene flow and genetic signatures of isolation by distance. Even when CUs are delineated in highly mobile species, the CUs often lack key biological information about what populations have the most conservation need to guide management decisions. Here we implement a framework for rigorous CU identification in the Canada Warbler (Cardellina canadensis), a highly mobile migratory bird species of conservation concern, and then integrate demographic modeling and genomic offset within a CU framework to guide conservation decisions. We find that whole-genome structure in this highly mobile species is primarily driven by putative adaptive variation. Identification of CUs across the breeding range revealed that Canada Warblers fall into two Evolutionarily Significant Units (ESU), and three putative Adaptive Units (AUs) in the South, East, and Northwest. Quantification of genomic offset within each AU reveals significant spatial variation in climate vulnerability, with the Northwestern AU being identified as the most vulnerable to future climate change based on genomic offset predictions. Alternatively, quantification of past population trends within each AU revealed the steepest population declines have occurred within the Eastern AU. Overall, we illustrate that genomics-informed CUs provide a strong foundation for identifying current and potential future regional threats that can be used to manage highly mobile species in a rapidly changing world.
This dataset contains vcfs of unfiltered and filtered SNPs associated with individuals sequenced for the project titled
Description of the data and file structure
1. cawa.bqsr.sort.unfiltered.vcf.gz
No filterering, but has been recalibrated using BQSR.
2. cawa.bqsr.miss0.2.qual30.1X.fst85.imputeAll.imputed.gl.noIndel.noRdaPlatform.vcf.gz
Filtered to allow only 20% missing data, variants must be at least qual30, no indels, and no samples less than 1X coverage. Secondary filtering to remove platform effects based on 85th FST percentile and redundancy analysis, then imputed with genotype likelihoods.
3. adapt.cawa.bqsr.miss0.2.qual30.1X.fst85.imputeAll.imputed.gl.noIndel.noRdaPlatform.vcf.gz
Adaptive SNPs identified from LFMM and RDA only. Filtered from
cawa.bqsr.miss0.2.qual30.1X.fst85.imputeAll.imputed.gl.noIndel.noRdaPlatform.vcf.gz
4. cawa_seq_metadata.xlsx
Metadata for individuals in dataset. Data header is as follows
- sample - sample ID
- state - state of collection
- town - nearest town to collection
- site - state.town used to differentiate sites in the same states
- sex - sex of individual
- type - sample type, f for feather, b for blood
- lat - latitude of collection location, to protect exact locations data has been made into decimal degrees
- long - longitude of collection location, to protect exact locations data has been made into decimal degrees
- month - month of collection
- date - date of collection
- breed - estimate of breeding
- cov - average depth of coverage of sample in number of reads on average
- plate - sequencing plate ID
- merged - was the bam merged across multiple sequencing runs
- year - year of collection
- platform - sequencing platform type
- group - estimated region membership from isotope analysis
5. indEnv.xlsx
Environmental data extracted from BIOCLIM variables, tree cover, human influence index, and elevation. Data is extracted at individual sample latitude and longitude.
- lat - latitude of collection location, to protect exact locations data has been made into decimal degrees
- long - longitude of collection location, to protect exact locations data has been made into decimal degrees
BIOCLIM variables:
- BIO1 Annual Mean Temperature C*100
- BIO2 Mean Diurnal Range (Mean of monthly (max temp - min temp)) C*100
- BIO3 Isothermality (BIO2/BIO7)*100
- BIO4 Temperature Seasonality (standard deviation 100) C
- BIO5 Max Temperature of Warmest Month C*100de
- BIO6 Min Temperature of Coldest Month C100
- BIO7 Temperature Annual Range C*100
- BIO8 Mean Temperature of Wettest Quarter C*100
- BIO9 Mean Temperature of Driest Quarter C*100
- BIO10 Mean Temperature of Warmest Quarter C*100
- BIO11 Mean Temperature of Coldest Quarter C*100
- BIO12 Annual Precipitation mm
- BIO13 Precipitation of Wettest Month mm
- BIO14 Precipitation of Driest Month mm
- BIO15 Precipitation Seasonality (Coefficient of Variation) mm
- BIO16 Precipitation of Wettest Quarter mm
- BIO17 Precipitation of Driest Quarter mm
- BIO18 Precipitation of Warmest Quarter mm
- BIO19 Precipitation of Coldest Quarter mm
- SRTM Elevation, meters
- NDVI Normalized difference vegetation index
- NDVISTD Standard deviation of the normalized difference vegetation index
- TREE Percent tree cover, 0-100
- INDHII Human influence index, measure of how much human development is near the area 0-100
Sharing/Access information
Canada Warbler genome
Code/Software
All code for genomic analyses can be found at: https://github.com/mcaitlinv/cawa-breeding
Resequencing sample collection and DNA extraction
We collected samples from an additional 181 breeding adult Canada Warblers from across the breeding range in North America in collaboration with multiple university researchers, private environmental companies, and state and federal agencies (Supplemental Figure 1). For DNA extraction, we collected blood from 134 individuals (~80 µl), via brachial venipuncture, preserved it in Queen’s lysis buffer, and stored it at room temperature. Blood (50-80 µl) was extracted using Qiagen DNeasy Blood and Tissue Kits (QIAGEN) and eluted into 100 µl of provided AE buffer. For the remaining 47 individuals, we collected tail feathers by pulling 2 tail feathers from each bird and storing feathers at -20C. We cut the calamus of one feather from the shaft and extracted the calamus using the modified Qiagen DNeasy Blood and Tissue protocol (Schweizer & DeSaix, 2023). After DNA extraction, we quantified samples using Qubit dsDNA assay.
DNA resequencing
We prepared the breeding samples for low coverage whole genome sequencing using a modified Nextera prep (Schweizer & DeSaix, 2023) with normalized DNA input. We sequenced samples in two libraries, 110 individuals on an Illumina HiSeq 4000 using paired end 150bp reads and 71 individuals on an Illumina NovaSeq 6000 using paired end 150bp reads. The 71 individuals on the NovaSeq were sequenced across multiple lanes to get to the targeted sequencing depth of 2-3X coverage per sample and included replicates of 32 samples with lower than 1.5X coverage from the HiSeq 4000 run.
Bioinformatic processing
We used Conda v4.13.0 (Anaconda Documentation, 2020) environments to manage bioinformatic packages on the RMACC Summit supercomputer managed jointly by Colorado State University and University of Colorado, Boulder. To process raw fastqs from the 181 individuals that underwent low coverage whole genome sequencing, we used Trim Galore v0.6.7 (Krueger, 2012), a wrapper for cutadapt v1.18 (Martin, 2011), and FastQC v0.11.9 (Andrews, 2010) to trim any remaining Illumina adaptors in the fastqs. Next, based on recommendations for low coverage data generated with NovaSeq platforms (Lou & Therkildsen, 2022) we performed a sliding window cut of the 3-prime end of the reads to remove low quality tails, defined as 4 bases in a row with mean QUAL scores less than 20, using fastp v0.22.0 (Chen et al., 2018). We checked fastqs for quality using FastQC and MultiQC v1.0.dev0 (Ewels et al., 2016) before and after trimming reads.
After processing raw fastqs, we aligned samples to the Canada Warbler reference genome using Burrows-Wheeler Alignment software (bwa mem, bwa v0.7.17) (Li & Durbin, 2009). Then we added read group information using Picard v2.26.11 AddorReplaceReadGroups (Picard Toolkit, 2014/2019)and marked duplicate reads using samtools v1.11 markdup (Danecek et al., 2011) before merging individuals with multiple bams. After merging bams, we checked sample coverage using bedtools v2.30.0 genomecov (Quinlan & Hall, 2010), and samples with less than 1X coverage were removed, leaving 169 individuals.
We used the processed bams to call variants using GATK v4.2.5.0 HaplotypeCaller (McKenna et al., 2010) and BCFtools v1.15.1 mpileup (Danecek et al., 2021). Then, we stringently filtered the variant sets using BCFtools, allowing only biallelic sites, a minor allele frequency of greater than 5%, QUAL score of greater than 30 and less than 10% missing across the 169 individuals. We intersected the filtered variant sets from bedtools and GATK to create a high-quality variant set to use for base quality score recalibration. Using the intersected variants, we recalibrated the sample bams using GATK BaseRecalibrator and ApplyBQSR. With the recalibrated bams, we used HaplotypeCaller to call a recalibrated set of variants. Then we filtered the recalibrated variant set allowing only biallelic sites, a minor allele frequency of less than 5%, QUAL score of greater than 30 and less than 20% missing data across the 169 individuals.
Using the recalibrated, filtered variant set we performed an exploratory analysis using R (R Core Team, 2022) and the package srsStuff (Anderson, 2020) to produce single-read sampling principal components analysis (PCA) of whole genome structure. We used single-read sampling because differences in the average coverage across samples can be mistaken for population structure on PCA in low coverage data (Lou & Therkildsen, 2022). Single read sampling equalizes coverage for all samples. Despite equalizing coverage, we found significant platform effects, where samples sequenced on different platforms have inherent bias that can be mistaken for population structure (for example of platform effects on low coverage data, see Lou & Therkildsen, 2022). We removed platform-associated variants from the dataset and proceeded with the analysis once samples no longer clustered in platform groups by PCA (for full methods to remove platform effects, see Supplemental Methods).
Adaptive loci identification
To select environmental variables, we used gradient forest (Ellis et al., 2012), an extension of random forest (Liaw et al., 2002), and 23 environmental variables potentially important to Canada Warbler breeding ecology based on previous research (Supplemental Table 3, Ferrari et al., 2018; Reitsma et al., 2020). Environmental data were extracted from each of 16 sampling locations, excluding two sampling sites with fewer than 4 individuals. To identify putatively adaptive loci, we used two approaches, redundancy analysis (RDA) and Latent Factor Mixed Models (LFMM). Once loci were identified using both RDA and LFMM, the union of loci discovered by both methods was used as our set of candidate adaptive loci which was filtered from the original vcf.