Low-coverage whole genome sequencing for highly accurate population assignment: Mapping migratory connectivity in the American Redstart (Setophaga ruticilla)
Data files
Aug 24, 2023 version files 4.35 GB
-
amre_breeding_data.csv
13.36 KB
-
amre_genoscape.tif
643.28 KB
-
amre_nonbreeding_data.csv
4.93 KB
-
amre.breeding.122.equal_effective.WGSassign_IDs.txt
2.87 KB
-
amre.combined.47.148.122.beagle.IDs.txt
3.43 KB
-
amre.combined.47.148.122.ds_0.01x.beagle.gz
1.29 GB
-
amre.combined.47.148.122.ds_0.1x.beagle.gz
574.82 MB
-
amre.combined.47.148.122.ds_2.0x.beagle.gz
2.49 GB
-
amre.combined.47.148.WGSassignIDs.txt
3.60 KB
-
README.md
4.56 KB
Abstract
Understanding the geographic linkages among populations across the annual cycle is an essential component for understanding the ecology and evolution of migratory species and for facilitating their effective conservation. While genetic markers have been widely applied to describe migratory connections, the rapid development of new sequencing methods, such as low-coverage whole genome sequencing (lcWGS), provides new opportunities for improved estimates of migratory connectivity. Here, we use lcWGS to identify fine-scale population structure in a widespread songbird, the American Redstart (Setophaga ruticilla), and accurately assign individuals to genetically distinct breeding populations. Assignment of individuals from the nonbreeding range reveals population-specific patterns of varying migratory connectivity. By combining migratory connectivity results with demographic analysis of population abundance and trends, we consider full annual cycle conservation strategies for preserving numbers of individuals and genetic diversity. Notably, we highlight the importance of the Northern Temperate-Greater Antilles migratory population as containing the largest proportion of individuals in the species. Finally, we highlight valuable considerations for other population assignment studies aimed at using lcWGS. Our results have broad implications for improving our understanding of the ecology and evolution of migratory species through conservation genomics approaches.
Genetic sampling and library preparation
Sample site locations were chosen to maximize sampling coverage across the breeding and nonbreeding ranges of the American Redstart. We used genetic samples from a total of 330 individuals: 182 individuals from 16 locations across the breeding range and 148 individuals from 15 locations in the nonbreeding range (Table S1). Sample collection occurred between 1993 and 2022 and consisted of either blood from brachial venipuncture or feathers. We extracted DNA from blood samples using the standard protocol for Qiagen DNEasy Blood and Tissue Kits and we modified the protocol to maximize DNA yield from feathers. Whole genome sequencing libraries were prepared following modifications of Illumina’s Nextera Library Preparation protocol (Schweizer & DeSaix, 2023). Pooled libraries were sequenced on eight HiSeq 4000 lanes at Novogene Corporation Inc. with a target sequencing depth of 2X per individual.
Bioinformatics
We trimmed the sequence data to remove potential PCR artifacts using the program TrimGalore version 0.6.5 (https://github.com/FelixKrueger/TrimGalore), a wrapper for Cutadapt (Martin, 2011). We used the Burrows-Wheeler Aligner software version 0.7.17 (Li & Durbin, 2009) to map reads to a reference genome from the closely related Yellow Warbler (Setophaga petechia; Bay et al. 2018). After mapping, the resulting SAM files were sorted, converted to BAM files, and indexed using Samtools version 1.9 (Li et al., 2009). We marked read duplicates with MarkDuplicates from GATK version 4.1.4.0 (McKenna et al., 2010) and clipped overlapping reads with the clipOverlap function from bamUtil (https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap). Sequencing depth for individuals was calculated using Samtools. Initial population genetics analyses revealed a large effect in the data due to high variation in sequencing depth among individuals. To reduce sequencing depth variation, we followed the recommendations of Lou & Therkildsen (2022) and used the DownsampleSam function from GATK to randomly down sample reads from BAM files with greater than 2X coverage, to 2X coverage.
To identify genetic markers from low-coverage WGS data, we used stringent filtering options in ANGSD version 0.9.40 (Korneliussen et al., 2014). We retained reads with a mapping quality of at least 30 and base quality of at least 33. SNPs were identified based on a p-value of less than 1e-6. We retained SNPs that had read data in at least 50% of individuals (n = 165), a minor allele frequency greater than 0.05, and minimum and maximum total depths of 231 and 924, respectively. The minimum total depth threshold was chosen by the minimum number of individuals required to call a variant (n = 165) multiplied by the mean sequencing depth of all individuals (1.4X). The maximum total depth threshold was determined by 2 * total number of individuals * mean sequencing depth. The filtered variants were output as genotype likelihoods and used in subsequent analyses.
The Beagle files are gzip compressed tab-delimited text files (see the README.md for formatting details) specifying the genotype likelihood data of the sampled American Redstarts (Setophaga ruticilla). Thus, these files can be viewed using any software or utilities that decompress gzip. For example, with the `zcat` utility, a user can open a Terminal window on a OSX system or in Linux, and enter `zcat < amre.combined.47.148.122.ds_2.0x.beagle.gz` to view the entire file. However, it is not recommended to view these files in their entirety due to their large size. Bash shell utilities are recommended as open-source tools for viewing and editing these large gzip compressed text files. The main purpose of the Beagle files is for users to be able to process them with software such as PCangsd and WGSassign, as demonstrated in the corresponding manuscript for this data set.
The remaining files are uncompressed text files.