Skip to main content
Dryad

Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans

Cite this dataset

Buffalo, Vince; Kern, Andrew (2024). Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans [Dataset]. Dryad. https://doi.org/10.5061/dryad.qnk98sfnv

Abstract

Across the human genome, there are large-scale fluctuations in genetic diversity caused by the indirect effects of selection. This can be thought of as a “linked selection signal" that reflects the impact of selection varying according to the placement of functional regions and recombination rates along the genome. Previous work has shown that negative selection against the steady influx of new deleterious mutations into conserved regions is the predominant mode of selection in humans. However, the theoretic model that underpins these results, classic Background Selection theory, is only applicable when new mutations are so deleterious that they cannot fix in the population. Here, we develop a statistical method based on a quantitative genetics view of the linked selection, which models the effects of weak draft created according to how polygenic additive fitness variance is distributed along the genome. We use a recent model that jointly predicts the equilibrium fitness variance and substitution rates due to both strong and weakly deleterious mutations, we estimate the distribution of fitness effects (DFE) and mutation rate across three human populations. While our model can accommodate weaker selection, we initially find evidence across three human populations of very strong selection against deleterious mutations consistent with previous work. However, the corollary predicted substitution rates for conserved regions are unreasonably low, and in disagreement with observed rates. We hypothesize this could be due to selected sites experiencing a further diminished population size due to selective interference. When we account for this in our method, we find evidence of weakly deleterious mutations in conserved regions which brings the predicted substitution rate into agreement with observations. However, these models lead to implausibly large mutation rate estimates. Overall, while our model of the genomic linked selection signal brings us a step towards uniting population and quantitative genetic selection models with the substitution process, our work suggests considerable uncertainty remains about the processes generating fitness variance in humans.

README: Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans

Usage Notes

All files are in in standard Python file formats. To load the pickle files, install the accompanying bprime software available on GiHub.

Note that all TSV files here were written by analyses in Jupyter notebooks that are available on the bprime GitHub page.

Files

Model Fits

There are pickle files of model results, generated by bgspy collect.

  • cadd6__decode__altgrid.pkl: CADD 6%
  • cadd8__decode__altgrid.pkl: CADD 8%
  • CDS_genes_phastcons__decode__altgrid.pkl: Feature Priority
  • phastcons_CDS_genes__decode__altgrid.pkl: PhastCons Priority

Files Produced by Sims

  • empiricalB_chr10__expansion_false__h_0.5__results.npz: simulation B
    "empirical" B maps for fixed demography

  • empiricalB_chr10__expansion_1.004_9.3__h_0.5__results.npz: simulation B
    "empirical" B maps for expansion

  • bmap_rescaled_hg38_sims_1e-8w_1e-3t_10000step_chr10.pkl: rescaled B maps

  • bmap__conserved_cds_utrs_phastcons_merged__hapmap__fixed_empiricalB.pkl:
    theoretic B and B' maps, generated by bgspy

Files Produced by Notebooks

main_fits.ipynb

  • main_fits_summary.tsv: all LOCO R2, R2, etc.

diversity_data.ipynb

  • accessibilty.tsv: genome wide accessibility stats

region_simulations.ipynb

  • figure_1_bmap_data.pkl: Python pickle file of the B maps in Figure 1
  • region_simulation_data.tsv.gz: simulation data containing ratchet rates, VA, etc
  • region_theory_data.pkl: theoretic predictions for segments
  • simulation_ratchet_data.tsv.gz: simulation ratchet data
  • theory_simulation_comparison.tsv: joined table for comparison

substitution.ipynb

  • phylofit_by_feature.tsv: divergence data from PhyloFit. Also, all raw files are in data/phylo/pfests_by_feature.

method_evaluation.ipynb

  • mle_sim_evaluation_results.tsv: the joined MLE/simulation results for method evaluation figure

Predicted B' maps for the CADD 6% model

  • cadd6_summary.tsv: contains predicted B' maps at Mb scale for each population, with pairwise diversity and average recombination rates.

Methods

These Python pickle files contain the model outputs from bgspy (http://github.com/vsbuffalo/bprime/) for the CADD 6%, CADD 8%, PhastCons Priority, and Feature Priority Models.

Funding

National Institutes of Health, Award: R35GM148253

National Institutes of Health, Award: R01HG010774