Skip to main content

Genomic variation in the Black-throated Green Warbler (Setophaga virens) suggests divergence in a disjunct Atlantic Coastal Plain population (S. v. waynei)

Cite this dataset

Carpenter, John P et al. (2022). Genomic variation in the Black-throated Green Warbler (Setophaga virens) suggests divergence in a disjunct Atlantic Coastal Plain population (S. v. waynei) [Dataset]. Dryad.


We used whole-genome resequencing to estimate genetic distinctiveness in the Black-throated Green Warbler (Setophaga virens)—including S. v. waynei—a putative subspecies that occupies a narrow disjunct breeding range along the Atlantic Coastal Plain. Despite detecting low-global differentiation (FST = 0.027) across the entire species, the principal components analysis of genome-wide differences shows the main axis of variation separates S. v. waynei from all other S. v. virens samples. We also estimated a low-migration rate for S. v. waynei, but found them to be most similar to another disjunct population from the Piedmont of North Carolina, and detected evidence of a historical north-to-south geographic dispersal among the entire species. New World wood warblers (family: Parulidae) can exhibit strong phenotypic differences among species, particularly, in song and plumage; however, within-species variation in these warblers—often designated as subspecies—is much more subtle. The existence of several isolated Black-throated Green Warbler populations across its eastern North American breeding range offers an excellent opportunity to further understand the origin, maintenance, and conservation status of subspecific populations. Our results, combined with previously documented ecological and morphological distinctiveness, support that S. v. waynei be considered a distinct and recognized subspecies worthy of targeted conservation efforts.


For all individuals, we collected ≥50 μL of blood from the brachial vein, which was stored in either Queen’s lysis buffer or ethanol (Seutin et al. 1991). In addition, two S. v. virens tissue samples from New York were donated by The Cornell University Museum of Vertebrates.

Our final dataset (n = 29) included five S. v. virens samples from New York that were sequenced in a previous study (Baiz et al. 2021). These samples were prepared with an identical method but sequenced separately from the new samples included in the current study (n = 24). For phenetic analyses we also included a single Golden-cheeked Warbler (Setophaga chrysoparia) as an outgroup, also published in Baiz et al. (2021). Genomic DNA was extracted from blood using the Qiagen DNeasy Blood and Tissue kit following the nucleated blood spin-column protocol with these minor modifications: we used 75 μL of blood and the 56°C incubation was performed overnight. All samples were normalized to 2 ng/μL and DNA was sheared with a Covaris S220 to a target insert size of 350 bp, according to the TruSeq Nano protocol. We then prepared sequencing libraries using the Illumina TruSeq Nano DNA kit and submitted them to the Pennsylvania State University Genomics Core Facility for sequencing (150 nt, paired-end) on a single NextSeq High Output lane, targeting genome-wide coverage of 4-5X per individual.

Whole-Genome Resequencing Analysis

We removed adapter sequences and quality trimmed reads using AdapterRemoval v2.1.7  (Schubert et al. 2016). We then aligned reads to a new chromosome-level assembly of a Yellow-rumped Warbler (Setophaga coronata) (Baiz et al. 2021) using Bowtie2 (Langmead and Salzberg 2012). PCR duplicates were marked with Picard tools (Broad Institute 2021) and quantified coverage statistics with qualimap (v2.2.1; Okonechnikov et al. 2016). 

We analyzed the resultant assemblies using the ANGSD bioinformatics pipeline (Korneliussen et al. 2014), the most appropriate method for low-coverage data. To measure the extent of differentiation, we calculated the global estimation of FST between the main range S. v. virens and S. v. waynei. We then generated a windowed estimate of FST in 10 Kb windows across the genome, comparing S. v. virens and S. v. waynei, thus quantifying whether different parts of the genome are more divergent than other regions of the genome between the groups. 

To identify genes within the divergent regions, we used the annotation information associated with the S. coronata genome (Baiz et al. 2021), which used SNAP gene predictions, trained on the Zebra Finch (Taeniopygia guttata) genome, within the MAKER (Holt and Yandell 2011) annotation pipeline.  We focused on coding sequences where MAKER had either transcript or protein matches with the Zebra Finch annotation. We first identified genes within the two regions (see below) that were bounded by 10Kb FST windows >0.15. We also focused on two genes intersecting the two highest FST windows.

We next performed Principal Components Analysis (PCA) to determine the genome-wide signal of clustering. We used PCAngsd (Meisner and Albrechtsen 2018) to generate a covariance matrix from genome-wide genotype likelihoods, and R 4.0.5 (R Core Team 2021) to calculate and plot eigenvalues. Lastly, we constructed a bootstrapped dendrogram to estimate relationships of the various populations. Genotype likelihoods generated by ANGSD for 31,241,801 sites were analyzed using ngsDist v1.0.8 (Vieira et al. 2016) and run with 100 bootstrap replicates to calculate distance matrices. We used these to produce distance-based trees with FastME v2.1.6.2 (Lefort et al. 2015) set to default options, and combined support values on the main tree using RAxML v8.2.12 (Stamatakis 2014).

To quantify how population structure and gene flow varies across the landscape, we used the Estimated Effective Migration Surface (EEMS; Pekova et al. 2015). As the program requires individual differences in called SNPs, we output genotype calls from ANGSD using “-doGeno 4 -postCutoff 0.99  -SNP_pval 1e-6 -minMaf 0.06”. This resulted in 12,314,343 SNPs, which we used “bed2diffs_v1” to estimate pairwise SNP differences between each individual. To estimate migration surfaces, we used “runeems_snps” set for 300 “nDemes” with a burn-in of 5,000,000 iterations and run for 20,000,000 iterations. We inspected the posterior probability of the MCMC chain to confirm that it had converged. Finally, we used the program “eems.plots” in R (R Core Team 2021) to plot the surfaces.

Usage notes

All files are text files which contain the code needed to run the analyses and should be readable by any text editor. The code refers to software which is freely assesible.


Pennsylvania State University

Arkansas State University

North Carolina Wildlife Resources Commission

Pittman-Robertson Federal Aid Wildlife Restoration

Catawba College

Eberly College of Science

Huck Institutes of the Life Sciences