Negative linkage disequilibrium between amino acid changing variants reveals interference among deleterious mutations in the human genome
Data files
Mar 17, 2021 version files 905.18 MB
Mar 30, 2021 version files 935.21 MB
Abstract
While there has been extensive work on patterns of linkage disequilibrium (LD) for neutral loci, the extent to which negative selection impacts LD is less clear. Forces like Hill-Robertson interference and negative epistasis are expected to lead to deleterious mutations being found on distinct haplotypes. However, the extent to which these forces depend on the selection and dominance coefficients of deleterious mutations and shape genome-wide patterns of LD in natural populations with complex demographic histories has not been tested. In this study, we first used forward-in-time simulations to generate predictions as to how selection impacts LD. Under models where deleterious mutations have additive effects on fitness, deleterious variants less than 10 kb apart tend to be carried on different haplotypes, generating an excess of negative LD relative to pairs of synonymous SNPs. In contrast, for recessive mutations, there is no consistent ordering of how selection coefficients affect r2 decay. We then examined empirical data of modern humans from the 1000 Genomes Project. LD between derived nonsynonymous SNPs is more negative compared to pairs of derived synonymous variants. This result holds when matching SNPs for frequency in the sample (allele count), physical distance, magnitude of background selection, and genetic distance between pairs of variants, suggesting that this result is not due to these potential confounding factors. Lastly, we introduce a new statistic HR(j) which allows us to detect interference using unphased genotypes. Application of this approach to high-coverage human genome sequences confirms our finding that deleterious alleles tend to be located on different haplotypes more often than are neutral alleles. Our findings suggest that either interference or negative epistasis plays a pervasive role in shaping patterns of LD between deleterious variants in the human genome, and consequently influencing genome-wide patterns of LD.
Methods
Forward Simulations
The distribution of genomic elements in our forward simulations followed the specification in the SLiM 4.2.2 manual (7.3), which is modeled after the distribution of intron and exon lengths in Deutsch and Long [57]. Within exonic regions, NS and S mutations were set to occur at a ratio of 2.31:1 [58]. For simulations using a DFE (-999 in the selection_coefficient column), the selection coefficients (s) of NS mutations were drawn from a gamma-distributed DFE with shape parameter 0.186 and expected selection coefficient E[s] = −0.01314833 [59]. All NS mutations were either additive with h=0.5 or recessive with h=0.0. The per base pair per generation recombination rate was constant across each simulation region and was fixed at either r ∈{10−6, 10−7, 10−8, 10−9} while the per base pair per generation mutation rate was set to µ=1.5 x 10−8. No simulation parameters were scaled.
## constant_dfe_and_constant_selection_ld copy.csv
This .csv has 10 columns. The first is recombination_rate. It describes the recombinationation rate in basepairs per generation. The second is selection_coefficient. It describes the selection coefficient of deleterious variation in the forward simulations. The third is dominance_coefficient. The selection_coefficient s of a given mutation defines the mutation’s fitness effect when homozygous (1+s); when heterozygous, the fitness effect is modified by a dominance_coefficient h (1+hs). The fourth column is seed and defines the seed used for the random number generation of SLIM. The fifth and sixth columns are site1 and site2. They define the positions along the genome that variant 1 and variant 2 of the pairwise LD computation exist. The seventh column is the r2 computed and the eighth column is the distance beteen site1 and site2. The ninth and tenth column correspond to genome_length and number_chr. These columns define the simulated genome length 5mb, and chromosome number 1.
Usage notes
These data sets are used in the manuscript https://www.biorxiv.org/content/10.1101/2020.01.15.907097v1.full. Scripts for their analysis can be found on https://github.com/JesseGarcia562/garcia_and_lohmueller_2020.