Splice altering variant predictions in four archaic hominin genomes
Data files
Aug 08, 2022 version files 829.90 MB
-
archaic_data_with_constraint_moderns_introgression_sQTLs.txt
829.89 MB
-
README.txt
10.11 KB
Dec 16, 2022 version files 871.96 MB
Abstract
This file contains high-quality autosomal SNVs that occur among four high-coverage archaic genomes aligned to the hg19/GRCh37 reference genome. Each entry corresponds to a single variant with a distinct GENCODE, Human Release 24, annotation per genomic position. Data per variant includes the genomic position, reference/alternate alleles, archaic genotypes, gene annotation, and additional data relevant to the analysis of splicing variants:
- SpliceAI annotations
- gene constraint measured using data from gnomAD
- variant conservation measured using phyloP
- allele origin
- allele frequencies in modern humans from the Thousand Genomes Project and gnomAD
- introgression metadata
- sQTL data from GTEx
Methods
All data in this file are publicly available (see below). Archaic variants were filtered using bcftools to retain high-quality sites and high-quality genotypes. Missing data and irrelevant fields per variant are marked as "n/a". Only variants matching a filtered archaic variant were included from the other datasets (see below). The dataframe was created using Pandas in a Python Jupyter notebook.
Data used in this notebook:
- Altai Neanderthal SNVs (http://ftp.eva.mpg.de/neandertal/Vindija/VCF/Altai/)
- Browning et al. 2018 Introgressed Variants (https://data.mendeley.com/datasets/y7hyt83vxr/1)
- Chagyrskaya Neanderthal SNVs (http://ftp.eva.mpg.de/neandertal/Chagyrskaya/VCF/)
- Denisovan SNVs (http://ftp.eva.mpg.de/neandertal/Vindija/VCF/Denisova/)
- gnomAD allele frequencies (https://gnomad.broadinstitute.org/downloads#v3-variants)
- gnomAD constraint (https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz)
- phyloP (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phyloP46way/primates.phyloP46way.bw)
- sQTLs (https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL.tar)
- Thousand Genomes Project (http://hgdownload.soe.ucsc.edu/gbdb/hg38/1000Genomes/)
- Vernot et al. 2016 Introgressed Tag SNPs (https://drive.google.com/drive/folders/0B9Pc7_zItMCVM05rUmhDc0hkWmc?resourcekey=0-zwKyJGRuooD9bWPRZ0vBzQ)
- Vindija Neanderthal SNVs (http://ftp.eva.mpg.de/neandertal/Vindija/VCF/Vindija33.19/)
Usage notes
Any text editor can be used to open this file. We recommend using software that can handle large dataframes well such as R or Python.