Genotype likelihoods for low-coverage whole-genome sequencing data of yellow warblers
Data files
Jan 03, 2024 version files 798.76 MB
-
README.md
2.04 KB
-
yewa.known.ind105.ds_2x.beagle.gz
798.76 MB
-
yewa.known.ind105.reference.IDs.txt
1.72 KB
Abstract
The following datasets include the required input files used to empirically test population assignment in WGSassign on Yellow Warbler data. The file "yewa.known.ind105.ds_2x.beagle.gz" includes the filtered variants of 105 Yellow Warbler individuals output as genotype likelihoods and stored in a Beagle-formatted file. The ID file, "yewa.known.ind105.reference.IDs.txt", is a tab-delimited file with 2 columns, the first being the sample ID, and the second being the known reference population. The sample order in the ID file should match that of the input beagle file. To measure the assignment accuracy of WGSassign, we used leave-one-out cross validation using the input beagle file and our ID file.
The following datasets include the required input files used to empirically test population assignment in WGSassign on Yellow Warbler data. This includes a beagle file, entitled yewa.known.ind105.ds_2x.beagle.gz, and a text file entitled, yewa.known.ind105.reference.IDs.txt.
Description of the data and file structure
To measure the assignment accuracy of WGSassign, we used leave-one-out cross validation (the --loo specification in WGSassign) using the input beagle file (yewa.known.ind105.ds_2x.beagle.gz) and our ID file (yewa.known.ind105.reference.IDs.txt).
The file "yewa.known.ind105.ds_2x.beagle.gz" includes the filtered variants of 105 Yellow Warbler individuals output as genotype likelihoods and stored in a Beagle-formatted file. In a Beagle-formatted file, the first column is the marker chromosome and position, the second column is the major allele, and the third column is the minor allele. The following columns include three columns per individual, where the first column is the genotype likelihood for the major/major genotype for the individual, the fifth column is the genotype likelihood for the major/minor genotype for the individual, and the sixth column is the genotype likelihood for the minor/minor genotype for the first individual.
The ID file is a tab-delimited file with 2 columns, the first being the sample ID, and the second being the known reference population where the sample was collected from. The sample order in the ID file should match the sample order for the columns in the input beagle file. Leave-one-out cross validation was run using the script WGSassign.loo.sh.
Sharing/Access information
The script used to run Leave-one-out cross validation in WGSassign as well as the R-script used to check assignment accuracy was run using the scripts found at https://github.com/mgdesaix/WGSassign-manuscript-data/tree/main/yellow-warbler.
We used WGSassign on data from yellow warblers to test its accuracy when applied to individuals from a species exhibiting isolation by distance (Bay et al. 2021; Gibbs et al. 2000). Previous work on yellow warblers has found weak differentiation between populations, with pairwise FST values on the order of 0.01 or less (Gibbs et al. 2000). Blood samples from 105 individuals was collected via brachial venipuncture in the years 2020 and 2021. These served as reference samples from 3 populations—North, Central, and South—previously described in Bay et al. (2021) and Gibbs et al. (2000). We extracted DNA from blood using the manufacturer’s protocol for Qiagen DNEasy Blood and Tissue Kits. Whole genome sequencing libraries were prepared following modifications of Illumina’s Nextera Library Preparation protocol (Schweizer & DeSaix 2023) and sequenced on a HiSeq 4000 at Novogene Corporation Inc., with a target sequencing depth of 2X per individual.
Sequences were trimmed with TrimGalore version 0.6.5 (https://github.com/FelixKrueger/TrimGalore) and mapped to the NCBI yellow warbler reference genome (Sayers et al. 2022) (accession number JANCRA010000000) using the Burrows-Wheeler Aligner software version 0.7.17 (Li & Durbin 2009). After mapping, the resulting SAM files were sorted, converted to BAM files, and indexed using Samtools version 1.9 (Li et al. 2009). We used MarkDuplicates from GATK version 4.1.4.0 (McKenna et al. 2010) to mark read duplicates and clipped overlapping reads with the clipOverlap function from bamUtil (https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap). To reduce sequencing depth variation, we used the DownsampleSam function from GATK to down-sample reads from BAM files with greater than 2X coverage, to 2X coverage. To identify genetic markers from low-coverage WGS data, we used stringent filtering options in ANGSD version 0.9.40 (Korneliussen et al. 2014). We retained reads with a mapping quality of at least 30 and base quality of at least 33. We retained SNPs that had read data in at least 50% of individuals and a minor allele frequency greater than 0.05. The filtered variants were output as genotype likelihoods and stored in a Beagle-formatted file.