Skip to main content
Dryad logo

Geographic allele frequency variation in the 1000 Genomes hg38 NYGC dataset

Citation

Biddanda, Arjun; Rice, Daniel; Novembre, John (2020), Geographic allele frequency variation in the 1000 Genomes hg38 NYGC dataset, Dryad, Dataset, https://doi.org/10.5061/dryad.rjdfn2z7v

Abstract

A key challenge in human genetics is to describe and understand the distribution of human genetic variation. Often genetic variation is described by showing rela tionships among populations or individuals, in each case drawing inferences over a large number of variants. Here, we present an alternative representation of human genetic variation that reveals the relative abundance of different allele frequency patterns across populations. This approach allows viewers to easily see several features of human genetic structure: (1) most variants are rare and geographically localized, (2) variants that are common in a single geographic region are more likely to be shared across the globe than to be private to that region, and (3) where two individuals differ, it is most often due to variants that are common globally, regardless of whether the individuals are from the same region or different regions. To guide interpretation of the results, we also apply the visualization to contrasting theoretical scenarios with varying levels of divergence and gene flow. Our variant-centric visualization clarifies the major geographic patterns of human variation and can be used to help correct potential misconceptions about the extent and nature of genetic differentiation among populations.

Methods

This data has been processed using a reproduceable pipeline that can be found at the repository : https://github.com/aabiddanda/geovar_rep_paper. The dataset was processed by downloading VCF files from the New York Genome Center, filtered to biallelic single-nucleotide polymorphisms, and those that "PASS" in the filter column. We then calculated the allele frequency in separate populations using PLINK v1.90

Usage Notes

Please see attached README.md for additional information. 

Funding

National Institute of General Medical Sciences, Award: GM132383