Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans
Cite this dataset
Buffalo, Vince; Kern, Andrew (2024). Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans [Dataset]. Dryad. https://doi.org/10.5061/dryad.qnk98sfnv
Abstract
Across the human genome, there are large-scale fluctuations in genetic diversity caused by the indirect effects of selection. This can be thought of as a “linked selection signal" that reflects the impact of selection varying according to the placement of functional regions and recombination rates along the genome. Previous work has shown that negative selection against the steady influx of new deleterious mutations into conserved regions is the predominant mode of selection in humans. However, the theoretic model that underpins these results, classic Background Selection theory, is only applicable when new mutations are so deleterious that they cannot fix in the population. Here, we develop a statistical method based on a quantitative genetics view of the linked selection, which models the effects of weak draft created according to how polygenic additive fitness variance is distributed along the genome. We use a recent model that jointly predicts the equilibrium fitness variance and substitution rates due to both strong and weakly deleterious mutations, we estimate the distribution of fitness effects (DFE) and mutation rate across three human populations. While our model can accommodate weaker selection, we initially find evidence across three human populations of very strong selection against deleterious mutations consistent with previous work. However, the corollary predicted substitution rates for conserved regions are unreasonably low, and in disagreement with observed rates. We hypothesize this could be due to selected sites experiencing a further diminished population size due to selective interference. When we account for this in our method, we find evidence of weakly deleterious mutations in conserved regions which brings the predicted substitution rate into agreement with observations. However, these models lead to implausibly large mutation rate estimates. Overall, while our model of the genomic linked selection signal brings us a step towards uniting population and quantitative genetic selection models with the substitution process, our work suggests considerable uncertainty remains about the processes generating fitness variance in humans.
README: Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans
Usage Notes
All files are in in standard Python file formats. To load the pickle files, install the accompanying bprime software available on GiHub.
Note that all TSV files here were written by analyses in Jupyter notebooks that are available on the bprime GitHub page.
Files
Model Fits
There are pickle files of model results, generated by bgspy collect
.
-
cadd6__decode__altgrid.pkl
: CADD 6% -
cadd8__decode__altgrid.pkl
: CADD 8% -
CDS_genes_phastcons__decode__altgrid.pkl
: Feature Priority -
phastcons_CDS_genes__decode__altgrid.pkl
: PhastCons Priority
Files Produced by Sims
empiricalB_chr10__expansion_false__h_0.5__results.npz
: simulation B
"empirical" B maps for fixed demographyempiricalB_chr10__expansion_1.004_9.3__h_0.5__results.npz
: simulation B
"empirical" B maps for expansionbmap_rescaled_hg38_sims_1e-8w_1e-3t_10000step_chr10.pkl
: rescaled B mapsbmap__conserved_cds_utrs_phastcons_merged__hapmap__fixed_empiricalB.pkl
:
theoretic B and B' maps, generated by bgspy
Files Produced by Notebooks
main_fits.ipynb
-
main_fits_summary.tsv
: all LOCO R2, R2, etc.
diversity_data.ipynb
-
accessibilty.tsv
: genome wide accessibility stats
region_simulations.ipynb
-
figure_1_bmap_data.pkl
: Python pickle file of the B maps in Figure 1 -
region_simulation_data.tsv.gz
: simulation data containing ratchet rates, VA, etc -
region_theory_data.pkl
: theoretic predictions for segments -
simulation_ratchet_data.tsv.gz
: simulation ratchet data -
theory_simulation_comparison.tsv
: joined table for comparison
substitution.ipynb
-
phylofit_by_feature.tsv
: divergence data from PhyloFit. Also, all raw files are indata/phylo/pfests_by_feature
.
method_evaluation.ipynb
-
mle_sim_evaluation_results.tsv
: the joined MLE/simulation results for method evaluation figure
Predicted B' maps for the CADD 6% model
-
cadd6_summary.tsv
: contains predicted B' maps at Mb scale for each population, with pairwise diversity and average recombination rates.
Methods
These Python pickle files contain the model outputs from bgspy (http://github.com/vsbuffalo/bprime/) for the CADD 6%, CADD 8%, PhastCons Priority, and Feature Priority Models.
Funding
National Institutes of Health, Award: R35GM148253
National Institutes of Health, Award: R01HG010774