Plink files from NYGC 1KG to be used for Cancer GWAS project QBIO475
Data files
Oct 30, 2025 version files 29.26 GB
Abstract
Cancer risk is influenced by genetic variation and environment. However, allele frequencies for many cancer-associated variants remain poorly characterized across global populations. To address this gap and provide a framework for teaching population genetics using real human genomic data, undergraduate researchers analyzed population-level allele frequency variation for a curated set of cancer-associated single nucleotide polymorphisms (SNPs). We assembled a set of variants from the GWAS Catalog databases based on reported associations with hereditary cancers, including breast, ovarian, colorectal, and lung cancer. Allele frequencies were extracted from the 1000 Genomes Project across five major continental groups. Students quantified differences in allele frequencies across these populations. The dataset includes curated PLINK files that may be used for future research or educational purposes in human genetics and bioinformatics.
Dataset DOI: 10.5061/dryad.hhmgqnkq7
Description of the data and file structure
Plink dataset for QBIO475 Cancer GWAS project:
Biallelic SNPs from 1000 Genomes NYGC hg38
Filtered to unrelated individuals
Highly autozygous individuals also removed
Regions with poor mappability removed
Strict mask applied
Label with super population
Plink files are:
allPops.allChroms.snps.QCIndivsForAuto_UnrelsOnly_superPopLabel.bed, allPops.allChroms.snps.QCIndivsForAuto_UnrelsOnly_superPopLabel.bim,
allPops.allChroms.snps.QCIndivsForAuto_UnrelsOnly_superPopLabel.fam
Access information
Other publicly accessible locations of the data:
- N/A
Data was derived from the following sources:
