Main model fits and substitution rate predictions for: A quantitative genetic model of background selection in humans
Data files
Jan 29, 2024 version files 42.95 GB
-
bmap__conserved_cds_utrs_phastcons_merged__hapmap__fixed_empiricalB.pkl
435.14 MB
-
bmap_rescaled_hg38_sims_1e-8w_1e-3t_10000step_chr10.pkl
733.92 MB
-
cadd6__decode__altgrid.pkl
3.89 GB
-
cadd6_summary.tsv
482.11 KB
-
cadd8__decode__altgrid.pkl
3.48 GB
-
CDS_genes_phastcons__decode__altgrid.pkl
7.38 GB
-
empiricalB_chr10__expansion_1.004_9.3__h_0.5__results.npz
3.61 GB
-
empiricalB_chr10__expansion_false__h_0.5__results.npz
3.61 GB
-
figure_1_bmap_data.pkl
443.69 MB
-
main_fits_summary.tsv
8.10 KB
-
mle_sim_evaluation_results.tsv
97.36 KB
-
phastcons_CDS_genes__decode__altgrid.pkl
19.24 GB
-
phylofit_by_feature.tsv
10.95 KB
-
README.md
2.32 KB
-
region_simulation_data.tsv.gz
29.54 MB
-
region_theory_data.pkl
96.10 MB
-
simulation_ratchet_data.tsv.gz
160.92 KB
-
theory_simulation_comparison.tsv
13.46 KB
Abstract
Across the human genome, there are large-scale fluctuations in genetic diversity caused by the indirect effects of selection. This can be thought of as a “linked selection signal" that reflects the impact of selection varying according to the placement of functional regions and recombination rates along the genome. Previous work has shown that negative selection against the steady influx of new deleterious mutations into conserved regions is the predominant mode of selection in humans. However, the theoretic model that underpins these results, classic Background Selection theory, is only applicable when new mutations are so deleterious that they cannot fix in the population. Here, we develop a statistical method based on a quantitative genetics view of the linked selection, which models the effects of weak draft created according to how polygenic additive fitness variance is distributed along the genome. We use a recent model that jointly predicts the equilibrium fitness variance and substitution rates due to both strong and weakly deleterious mutations, we estimate the distribution of fitness effects (DFE) and mutation rate across three human populations. While our model can accommodate weaker selection, we initially find evidence across three human populations of very strong selection against deleterious mutations consistent with previous work. However, the corollary predicted substitution rates for conserved regions are unreasonably low, and in disagreement with observed rates. We hypothesize this could be due to selected sites experiencing a further diminished population size due to selective interference. When we account for this in our method, we find evidence of weakly deleterious mutations in conserved regions which brings the predicted substitution rate into agreement with observations. However, these models lead to implausibly large mutation rate estimates. Overall, while our model of the genomic linked selection signal brings us a step towards uniting population and quantitative genetic selection models with the substitution process, our work suggests considerable uncertainty remains about the processes generating fitness variance in humans.
Usage Notes
All files are in in standard Python file formats. To load the pickle files, install the accompanying bprime software available on GiHub.
Note that all TSV files here were written by analyses in Jupyter notebooks that are available on the bprime GitHub page.
Files
Model Fits
There are pickle files of model results, generated by bgspy collect
.
cadd6__decode__altgrid.pkl
: CADD 6%cadd8__decode__altgrid.pkl
: CADD 8%CDS_genes_phastcons__decode__altgrid.pkl
: Feature Priorityphastcons_CDS_genes__decode__altgrid.pkl
: PhastCons Priority
Files Produced by Sims
-
empiricalB_chr10__expansion_false__h_0.5__results.npz
: simulation B
“empirical” B maps for fixed demography -
empiricalB_chr10__expansion_1.004_9.3__h_0.5__results.npz
: simulation B
“empirical” B maps for expansion -
bmap_rescaled_hg38_sims_1e-8w_1e-3t_10000step_chr10.pkl
: rescaled B maps -
bmap__conserved_cds_utrs_phastcons_merged__hapmap__fixed_empiricalB.pkl
:
theoretic B and B’ maps, generated by bgspy
Files Produced by Notebooks
main_fits.ipynb
main_fits_summary.tsv
: all LOCO R2, R2, etc.
diversity_data.ipynb
accessibilty.tsv
: genome wide accessibility stats
region_simulations.ipynb
figure_1_bmap_data.pkl
: Python pickle file of the B maps in Figure 1region_simulation_data.tsv.gz
: simulation data containing ratchet rates, VA, etcregion_theory_data.pkl
: theoretic predictions for segmentssimulation_ratchet_data.tsv.gz
: simulation ratchet datatheory_simulation_comparison.tsv
: joined table for comparison
substitution.ipynb
phylofit_by_feature.tsv
: divergence data from PhyloFit. Also, all raw
files are indata/phylo/pfests_by_feature
.
method_evaluation.ipynb
mle_sim_evaluation_results.tsv
: the joined MLE/simulation results for
method evaluation figure
Predicted B’ maps for the CADD 6% model
cadd6_summary.tsv
: contains predicted B’ maps at Mb scale for each population, with pairwise diversity and average recombination rates.
These Python pickle files contain the model outputs from bgspy (http://github.com/vsbuffalo/bprime/) for the CADD 6%, CADD 8%, PhastCons Priority, and Feature Priority Models.