Most damaging CADD scores for hg19 human genome build (CADD scores generated with bStatistic removed)
Data files
Aug 31, 2023 version files 7.36 GB
-
chr1.cadd.scores.npz
-
chr10.cadd.scores.npz
-
chr11.cadd.scores.npz
-
chr12.cadd.scores.npz
-
chr13.cadd.scores.npz
-
chr14.cadd.scores.npz
-
chr15.cadd.scores.npz
-
chr16.cadd.scores.npz
-
chr17.cadd.scores.npz
-
chr18.cadd.scores.npz
-
chr19.cadd.scores.npz
-
chr2.cadd.scores.npz
-
chr20.cadd.scores.npz
-
chr21.cadd.scores.npz
-
chr22.cadd.scores.npz
-
chr3.cadd.scores.npz
-
chr4.cadd.scores.npz
-
chr5.cadd.scores.npz
-
chr6.cadd.scores.npz
-
chr7.cadd.scores.npz
-
chr8.cadd.scores.npz
-
chr9.cadd.scores.npz
-
README.md
Abstract
Analyses of genetic variation in many taxa have established that neutral genetic diversity is shaped by natural selection at linked sites. Whether the mode of selection is primarily the fixation of strongly beneficial alleles (selective sweeps) or purifying selection on deleterious mutations (background selection) remains unknown, however. We address this question in humans by fitting a model of the joint effects of selective sweeps and background selection to autosomal polymorphism data from the 1000 Genomes Project. After controlling for variation in mutation rates along the genome, a model of background selection alone explains ~60% of the variance in diversity levels at the megabase scale. Adding the effects of selective sweeps driven by adaptive substitutions to the model does not improve the fit, and when both modes of selection are considered jointly, selective sweeps are estimated to have had little or no effect on linked neutral diversity. The regions under purifying selection are best predicted by phylogenetic conservation, with ~80% of the deleterious mutations affecting neutral diversity occurring in non-exonic regions. Thus, background selection is the dominant mode of linked selection in humans, with marked effects on diversity levels throughout autosomes.
README: Most damaging CADD scores for hg19 human genome build (CADD scores generated with bStatistic removed)
This dataset contains CADD scores separated into each of the human autosomes in NumPy zipped files (.npz format). The scores are based on the most damaging CADD score for each site in the genome. These most damaging scores per site were based on a custom set of CADD scores generated by the Kircher Lab, which maintains and updates the CADD project. They removed the bStatistic input (based on McVicker's B) from the set of annotations used to generate CADD scores, since some of our work infers new B scores using CADD as an input, and we wanted to avoid the circularity of building new B scores using an annotation that includes the old B scores.
Original CADD scores that these were derived from can be found at: https://cadd.gs.washington.edu/
These files must be opened using NumPy. They are formatted such that for each chromosome the most damaging CADD score is given for each chromosomal site (0 for missing data). Thus, for a chromosome 50M bp long, the file is an array of 50M scores.
Note that these are not identical to raw CADD scores: as described in the appendix, we took the most damaging of the three possible single base substitutions for each site and kept that score (positions with no annotated score are given 0). These are formatted as zipped numpy arrays (.npz files). Each array is the length of the chromosome given in the filename with a score at each position (in hg19 build). To unzip the scores, you use the key cadd.
For each chromosome, the files contain one score per site in that chromosome, where the first position in the chromosome is at index 0, the last position at position (chromosome length)-1.
NOTE: CADD scores were made using v1.6 with the bStatistic removed
how to open files:
import numpy as np
a = np.load('/Users/MURPHYD/Downloads/chr22.cadd.scores.npz')['cadd']
len(a)
51304566
Methods
This dataset contains CADD scores separated into each of the human autosomes in NumPy zipped files (.npz format). The scores are based on the most damaging CADD score for each site in the genome. These most damaging scores per site were based on a custom set of CADD scores generated by the Kircher Lab, which maintains and updates the CADD project. They removed the bStatistic input (based on McVicker's B) from the set of annotations used to generate CADD scores, since some of our work infers new B scores using CADD as an input, and we wanted to avoid the circularity of building new B scores using an annotation that includes the old B scores.