Western burrowing owl genomics
Data files
Nov 20, 2023 version files 4.47 GB
-
all_buows.vcf.gz
-
all_migrants.glf.gz
-
environmental_data_for_gradient_forest.csv
-
five_res.saf.gz
-
genetic_data_for_gradient_forest.txt
-
Imperial.glf.gz
-
Imperial.saf.gz
-
LakeHavasu.glf.gz
-
LakeHavasu.saf.gz
-
LasVegas.glf.gz
-
LasVegas.saf.gz
-
mig_Colorado.saf.gz
-
mig_Idaho.saf.gz
-
mig_NewMexico.saf.gz
-
mig_OR_Baker.saf.gz
-
mig_OR_Depot.saf.gz
-
mig_Utah.saf.gz
-
mig_Wash.saf.gz
-
migrants_norels_noMan.saf.gz
-
NorCal.glf.gz
-
NorCal.saf.gz
-
Phoenix.glf.gz
-
Phoenix.saf.gz
-
README.md
-
Riverside.glf.gz
-
Riverside.saf.gz
-
SanDiego.glf.gz
-
SanDiego.saf.gz
Abstract
Migration is driven by a combination of environmental and genetic factors, but many questions remain about those drivers. Potential interactions between genetic and environmental variants associated with different migratory phenotypes are rarely the focus of study. We pair low coverage whole genome resequencing with a de novo genome assembly to examine population structure, inbreeding, and the environmental factors associated with genetic differentiation between migratory and resident breeding phenotypes in a species of conservation concern, the western burrowing owl (Athene cunicularia hypugaea). Our analyses reveal a dichotomy in gene flow depending on whether the population is resident or migratory, with the former being genetically structured and the latter exhibiting no signs of structure. Among resident populations, we observed significantly higher genetic differentiation, significant isolation‐by‐distance, and significantly elevated inbreeding. Among migratory breeding groups, on the other hand, we observed lower genetic differentiation, no isolation‐by‐distance, and substantially lower inbreeding. Using genotype–environment association analysis, we find significant evidence for relationships between migratory phenotypes (i.e., migrant versus resident) and environmental variation associated with cold temperatures during the winter and barren, open habitats. In the regions of the genome most differentiated between migrants and residents, we find significant enrichment for genes associated with the metabolism of fats. This may be linked to the increased pressure on migrants to process and store fats more efficiently in preparation for and during migration. Our results provide a significant contribution toward understanding the evolution of migratory behavior and vital insight into ongoing conservation and management efforts for the western burrowing owl.
README
The data files are 4 treatments of whole genome sequences at low coverage collected for ~120 Burrowing Owls sampled throughout western North America.
Treatment 1: 1) Using ANGSD to produce genotype likelihood files for all individuals in the BEAGLE format (-doGlf 3) and a minor allele frequency file (-domaf 1) with restrictive filtering that uses a conservative minimum minor allele frequency (-minmaf 0.05), a low maximum likelihood of being polymorphic (-SNP_pval 1e-6), and verifying variants by removing sites with excessive mismatches from the reference genome (-C 50), and confirming variants using a base alignment quality estimation (-baq 1). These have the suffix '.glf.gz' and there are versions for all migrants and each individual resident breeding population.
Treatment 2: Using the HaplotypeCaller module in GATK (McKenna et al., 2010) to call genotypes for all individuals sequenced, filtered by removing insert/deletion variants, and kept only biallelic variants found in 50% of the individuals. The resulting file is called 'all_buows.vcf.gz'.
Treatment 3: Using ANGSD to create population-specific site frequency spectra (SFSs) from site allele frequency (SAF) files using the reference genome to polarize allele calls (-anc), adjusting frequencies using individual FIS (-indF), and with strict filtering conditions including discarding reads without unique mapping (-uniqueOnly 1), removing bad reads (-remove_bads 1), using only reads for which mates are mapped (-only_proper_pairs 1), discarding reads with low mapping quality (-minMapQ 1), keeping reads with high base quality (-minQ 20), dropping reads with low or high depth across samples (-setMinDepth 10 -setMaxDepth 500), keeping only biallelic sites (-skipTriallelic 1), and also previously described conditions (-minMaf 0.05 -C 50 -baq 1). These have the suffix 'saf.gz' and there are files for all individual sample sizes, combined migratory sites, and the five resident populations for which there is evidence for structure (ie, Lake Havasu, NorCal, Las Vegas, Riverside, and San Diego).
Treatment 4: Minor allele frequency files (MAFs) were also generated for each sample site (-doMaf 1), sampling all the sites identified in the overall MAF file, and only generating minor allele frequencies for variants found in a minimum of four individuals in each population. Missing data were imputed using the R package "MICE." This file is called 'genetic_data_ for_gradient_forest.txt'. The environmental data used is also included.
Gradient Forest Analysis Header & Data:
- Status: Migratory (1) or Resident (0)
- pc1: PC1 values from SRS PCA
- pc2: PC2 values from SRS PCA
- pcnm1: pcnm1 from PCNM analysis
- pcnm2: pcnm2 from PCNM analysis
- BIO1: Annual Mean Temperature in C * 10
- BIO2: Mean Diurnal Range in C * 10
- BIO3: Isothermality in C * 10
- BIO4: Temperature Seasonality in C * 10
- BIO5: Max Temperature of the Warmest Month in C * 10
- BIO6: Min Temperature of the Warmest Month in C * 10
- BIO7: Temperature Annual Range in C * 10
- BIO8: Mean Temperature of the Wettest Quarter in C * 10
- BIO9: Mean Temperature of the Driest Quarter in C * 10
- BIO10: Mean Temperature of the Warmest Quarter in C * 10
- BIO11: Mean Temperature of the Coldest Quarter in C * 10
- BIO12: Annual Precipitation in mm
- BIO13: Precipitation of the Wettest Month in mm
- BIO14: Precipitation of the Driest Month in mm
- BIO15: Precipitation Seasonality in mm
- BIO16: Precipitation of the Wettest Quarter in mm
- BIO17: Precipitation of the Driest Quarter in mm
- BIO18: Precipitation of the Warmest Quarter in mm
- BIO19: Precipitation of the Coldest Quarter in mm
- NDVI_Mean: Mean Normalized Difference Vegetation Index
- NDVI StDev: Standard Deviation of Normalized Difference Vegetation Index
- QuickSCAT: Surface Moisture
- SRTM: Elevation (m)
- TREE: Tree Coverage
- LC11: Open Water; areas of open water, generally with less than 25% cover of vegetation or soil.
- LC12: Perennial Ice/Snow; areas characterized by a perennial cover of ice and/or snow, generally greater than 25% of total cover.
- LC21: Developed, Open Space: areas with a mixture of some constructed materials, but mostly vegetation in the form of lawn grasses. Impervious surfaces account for less than 20% of total cover. These areas most commonly include large-lot single-family housing units, parks, golf courses, and vegetation planted in developed settings for recreation, erosion control, or aesthetic purposes.
- LC22: Developed, Low Intensity; areas with a mixture of constructed materials and vegetation. Impervious surfaces account for 20% to 49% percent of total cover. These areas most commonly include single-family housing units.
- LC23: Developed, Medium Intensity: areas with a mixture of constructed materials and vegetation. Impervious surfaces account for 50% to 79% of the total cover. These areas most commonly include single-family housing units.
- LC31: Barren Land (Rock/Sand/Clay); areas of bedrock, desert pavement, scarps, talus, slides, volcanic material, glacial debris, sand dunes, strip mines, gravel pits and other accumulations of earthen material. Generally, vegetation accounts for less than 15% of total cover.
- LC41: Deciduous Forest; areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. More than 75% of the tree species shed foliage simultaneously in response to seasonal change.
- LC42: Evergreen Forest; areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. More than 75% of the tree species maintain their leaves all year. Canopy is never without green foliage.
- LC43: Mixed Forest; areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. Neither deciduous nor evergreen species are greater than 75% of total tree cover.
- LC51: Dwarf Scrub; Alaska-only areas dominated by shrubs less than 20 centimeters tall with shrub canopy typically greater than 20% of total vegetation. This type is often co-associated with grasses, sedges, herbs, and non-vascular vegetation.
- LC52: Shrub/Scrub; areas dominated by shrubs; less than 5 meters tall with shrub canopy typically greater than 20% of total vegetation. This class includes true shrubs, young trees in an early successional stage or trees stunted from environmental conditions.
- LC71: Grassland/Herbaceous; areas dominated by gramanoid or herbaceous vegetation, generally greater than 80% of total vegetation. These areas are not subject to intensive management such as tilling, but can be utilized for grazing.
- LC72: Sedge/Herbaceous; Alaska-only areas dominated by sedges and forbs, generally greater than 80% of total vegetation. This type can occur with significant other grasses or other grass-like plants, and includes sedge tundra, and sedge tussock tundra.
- LC73: Lichens; Alaska-only areas dominated by fruticose or foliose lichens generally greater than 80% of total vegetation.
- LC74: Moss; Alaska only areas dominated by mosses, generally greater than 80% of total vegetation.
- LC81: Pasture/Hay; areas of grasses, legumes, or grass-legume mixtures planted for livestock grazing or the production of seed or hay crops, typically on a perennial cycle. Pasture/hay vegetation accounts for greater than 20% of total vegetation.
- LC82: Cultivated Crops; areas used for the production of annual crops, such as corn, soybeans, vegetables, tobacco, and cotton, and also perennial woody crops such as orchards and vineyards. Crop vegetation accounts for greater than 20% of total vegetation. This class also includes all land being actively tilled.
- LC90: Woody Wetlands; areas where forest or shrubland vegetation accounts for greater than 20% of vegetative cover and the soil or substrate is periodically saturated with or covered with water.
- LC95: Emergent Herbaceous Wetlands; Areas where perennial herbaceous vegetation accounts for greater than 80% of vegetative cover and the soil or substrate is periodically saturated with or covered with water.
Order and GPS data for sites in gradient forest analysis
Site Lat Long Status
LakeHavasu 34.47956 -114.31766 Resident
Phoenix 33.33345 -112.18368 Res-Mig
Colorado 39.83 -104.84 Migratory
Idaho 43.065 -116.05417 Migratory
Imperial 32.65 -115.61 Res-Mig
NorCal 37.42971 -121.99854 Resident
Nevada 36.30138 -115.34678 Resident
ORBaker 44.8 -117.83 Migratory
ORDepot 45.84 -119.43 Migratory
Riverside 33.71 -116.18 Resident
SouthDakota 43.49 -103.31 Migratory
SanDiego 32.55 -116.98 Resident
Utah 40.28153 -112.30692 Migratory
Washington 46.26 -119.11 Migratory
Methods
Details regarding sample collection, genome sequencing, and sequence processing may be found in Methods S1; but notably, we sequenced a reference genome to high coverage and 202 burrowing owl samples collected across their migratory and resident breeding range to low coverage. Because our resequencing dataset was low coverage, we used variant detection and analytical methods that largely did not require called genotypes. This included both genotype likelihoods as estimated in the program ANGSD (Korneliussen et al., 2014) and a single‐read‐sampling (SRS) method that randomly selects one read per variant to temper the bias of high variation in locus‐to‐locus depths. Using these methods, files were prepared for analyses as described below using the following four filtering and genotyping frameworks and conditions: (1) Using ANGSD to produce genotype likelihood files for all individuals in the BEAGLE format (‐doGlf 3) and a minor allele frequency file (‐domaf 1) with restrictive filtering that uses a conservative minimum minor allele frequency (‐minmaf 0.05), a low maximum likelihood of being polymorphic (‐SNP_pval 1e‐6), adjusting mapQ scores for excessive mismatches from the reference genome (‐C 50), and confirming variants using a base alignment quality estimation (‐baq 1). (2) For SRS analyses, we used the ‘HaplotypeCaller’ module in GATK (McKenna et al., 2010) to call genotypes for all individuals sequenced, filtered by removing insert/deletion variants, and kept only biallelic variants found in 50% of the individuals. (3) We used ANGSD to create population‐specific site frequency spectra (SFSs) from site allele frequency files using the reference genome to polarize allele calls (‐anc), adjusting frequencies using individual FIS (‐indF), and with strict filtering conditions including discarding reads without unique mapping (‐uniqueOnly 1), removing bad reads (‐remove_bads 1), using only reads for which mates are mapped (‐only_proper_pairs 1), discarding reads with low mapping quality (‐minMapQ 1), keeping reads with high base quality (‐minQ 20), dropping reads with low or high depth across samples (‐setMinDepth 10 ‐setMaxDepth 500), keeping only biallelic sites (‐skipTriallelic 1), and also previously described conditions (‐minMaf 0.05 ‐C 50 ‐baq 1). (4) Minor allele frequency files (MAFs) were also generated for each sample site (‐doMaf 1), sampling all the sites identified in the overall MAF file, and only generating minor allele frequencies for variants found in a minimum of four individuals in each population.
Usage notes
ANGSD