Skip to main content

Data from: The American Kestrel (Falco sparverius) genoscape: implications for monitoring, management, and subspecies boundaries

Cite this dataset

Ruegg, Kristen et al. (2021). Data from: The American Kestrel (Falco sparverius) genoscape: implications for monitoring, management, and subspecies boundaries [Dataset]. Dryad.


Identifying population genetic structure is useful for inferring evolutionary process as well as defining subspecies boundaries and/or conservation units that can aid in species management. The American kestrel (Falco sparverius) is a widespread species with two described North American subspecies, (F. s. sparverius  and  F. s. paulus), the latter in the southeastern United States and the former across the remainder of its distribution.  In many parts of their range, American kestrels have been declining, but it has been difficult to interpret demographic trends without a clearer understanding of gene flow among populations.  Here we sequence the first American kestrel genome and scan the genome of 197 individuals from 12 sampling locations across the range of the two North American subspecies to identify population structure.  To validate signatures of population structure and fill in sampling gaps across the breeding range we screen 192 outlier loci in an additional 376 samples from 34 sampling locations.  Overall, our analyses support the existence of 5 genetically distinct lineages within American kestrels - eastern, western, Texas, Florida, and Alaska.  Interestingly, we find that while our genome-wide genetic data support the existence of previously described subspecies boundaries, genetic differences across the species’ range correlate more with putative migratory phenotypes (resident, long-distance, and short-distance migrants) rather than  a priori  described subspecies boundaries per se.  Based on our results, we suggest the resulting five genetic lineages serve as the foundation for American kestrel conservation and management in the face of future threats.  


SNP discovery and SNP filtering

High-density RADseq was carried out on 287 individuals from 12 sampling locations following a modified version of the bestRAD library preparation protocol (Table SI 1; Ali et al. 2016).  The program Stacks (Catchen et al. 2013) was used to demultiplex, filter and trim adapters from the data with the process_radtags function and remove duplicate read pairs using the clone_filter function.  Bowtie2 was used to map reads to the genome (Langmead and Salzberg 2012), and the HaplotypeCaller in the Genome Analysis Toolkit was used to identify SNPs (McKenna et al. 2010, Auwera et al. 2013). Vcftools (Danecek et al. 2011) was used to remove indels, non-biallelic SNPs, and low quality and rare variants (genotype quality 20; coverage depth 10; minor allele frequency 0.05). The final number of SNPs and individuals to be retained for further analyses was assessed by visualizing the tradeoff between discarding low-coverage SNPs and discarding individuals with missing genotypes using custom scripts within the R-package genoscapeRtools (Anderson 2019). Because preliminary analyses revealed outliers in the principal components analysis and we were concerned about sample contamination between individuals during the library preparation stage, we filtered out individuals with >30% heterozygosity as heterozygosity is expected to be high in cases where multiple individuals are combined into a single well.

Identification of outlier SNPs for Population Assignment

Population genomic analyses were conducted on all SNPs that passed our filters to assess genome-wide patterns of genetic divergence and identify SNPs for population assignment and assay design.  Population genetic structure was assessed by calculating pairwise population level FST (with different sampling sites representing different populations) with bootstrapped confidence intervals using the R package assigner (Gosselin et al. 2019), and principal components analysis (PCA) using SNPRelate (Zheng et al. 2012). To test for isolation by distance we compared linearized FST with pairwise geographic distance calculated from the central longitudinal and latitudinal coordinates of each location using the Vincenty ellipsoid method in the R package Geoshpere (Hijmans 2019). Because the PCA of all SNPs from the genome-wide analysis revealed five major groups, including Alaska, Texas, the west, the east and Florida (see results, Figure 1), subsequent analyses focused on developing SNPs for population assignment within and among these groups. 

To identify SNPs useful for population assignment between the 5 genetically distinct populations, we used vcftools (Danecek et al. 2011) to calculate site-wise FST between populations and identify individual SNPs with the most power for discriminating between populations (SNPs with the biggest allele frequency differences). It is important to note that population genetic summary statistics were based on the full RADseq data set (see above) rather than downstream SNP dataset in order to avoid potential biases associated with selecting SNPs with the highest discriminatory power for population assignment.  Custom R-scripts were used to evaluate which of our top-ranking SNPs would generate designable assays based on the following parameters:  1) Guanine – Cytosine content was less than 0.65, 2) there were no insertions or deletions (indels) within 30bp of the variable site, or 3) there were no ambiguous codes within 20bp of the variable site. Additionally, we used bwa-mem (Burrows-Wheeler Aligner; Li and Durbin 2009) to determine which of our designable SNPs mapped uniquely to the reference genome.  Fluidigm SNPtype assays (Fluidigm Inc.) were then developed in the 216 top ranking SNPs that passed our filters. 

Genetic screening and building the genoscape

Ninety-three samples and three non-template controls were screened on the Fluidigm Corporation EP1 Genotyping System (Fluidigm Inc.) and assays were ranked by variability and call rate to identify the most reliable 192 SNP assays of 216 that were designed.  The 192 variable SNP assays with the highest call rate were used to screen an additional 396 American Kestrel feather samples from 34 breeding locations in the United States and Canada in order to fill in sampling gaps and refine the resulting map of population genetic structure (Table 1). Following the methods described in Ruegg et al. (2014), we amplified PCR products using fluorescently labelled allele-specific primers and then used the EP1 Array Reader and Fluidigm’s automated Genotyping Analysis Software (Fluidigm Inc.) to call alleles with a confidence threshold of 90%. Each genotype was also visually inspected for potential irregularities and uncertain genotype calls were removed from the analysis. Samples with missing genotypes at >25% of SNP assays were removed from the analyses and SNP loci with >25% missing genotypes were removed, resulting in a total of 376 additional individuals at 186 SNP loci that could be used to identify genetic structure across the range (Table 1).

Usage notes

This dataset includes the filtered RADseq dataset. It is in the vcf format, which includes 72,263 filtered genotypes for all 197 American Kestrel individuals. Missing data is coded as "./.".

The dataset also includes the structure file for the SNPtype assays for all breeding individuals. This includes 186 loci, population information and genetic group information. Given the dataset does not include SNP information, we've also included the plink map and ped files used to create the structure file. The map file includes genomic location information in 4 columns. By default, each line of the MAP file describes a single marker with scaffold information, snp identifier, physical base-pair position (0 if unknown) and base-pair posiiton. The ped file includes genotype information for all positions in the .map file for each individual, along with location information (FID).

Distance matrices are also included: pairwise linearized FST values for the 10 population that had greater than 4 individuals and pairsie geographic distance matrices calculated from the Latitude and Longitude of the 10 populations (this information can be found in the published tables).