Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.
Main README
This dryad page provides the data and the source code for the figures in " Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences " by Pouyet F,* Aeschbacher S*, Thiery A and Excoffier L.
* first co-authors
It contains as compressed files:
-- 1 -- Genotype tables and annotations from 1000G and SGDP data 1000G_genot-table.zip SGDP_genotypetable-annotation.zip with README files that describe the data and how to use them.
-- 2 -- the R scripts to perform the figures of INDA figures_INDA_Rscripts.zip
-- 3 -- the confidence intervals of the estimates as well as the source code to estimate them confidenceIntervals.zip
-- 4 -- the scripts (R and python) to make the SFS figures as well as estimate confidence intervals and compute the SFS figures_data_SFS_Rscripts.zip
-- 5 -- the setting parameters for FastSimCoal demographic inferences Supplementary file - settings files for demographi ...
-- 6 -- the scripts for SLiM simulations and analyses (bash and R code) + a table with parameters used in the SLiM simulations (.csv) + the scripts and raw data that I used to make the figures (.txt and .R) si_files_SA.tar.gz
Extra READMEs describe the files and how to launch them to compute the statistics or to make the figures.
SGDP_genotypetable-annotation
It contains as compressed files:
- Genotype tables and annotations from SGDP see the file README_columns genotypeTables
README files describe the files and how to launch them.
Supplementary file - settings files for demographic inferences
confidence Intervals
Files with 95% CI
confidenceIntervals.zip
si_files_SA.tar
SLiM simulations
1000G.extraData_distanceHotspotPhastcons
Annotations of distance to hotspots and distance to conserved elements (phastcons) in centimorgans for 1000G data.
1000G_genot-table
Contains as a compressed file the genotype table with annotation for 1000G data. Please see the README file for further description and information on how to make the figures. Author: Fanny Pouyet
1000G_Nb_DerAll_byPop
Files from which the SFS are computed. Author: Fanny Pouyet
figures_data_SFS_Rscripts
Files to make figures of SFS and scripts to compute the SFS too. The SFS are done from files in "1000G_Nb_derAll.zip". For further details, see README file
regions.YRIrecomb1p5
Regions from hg19 with recombination rate above 1.5cM/Mb (using YRI recombination map - see the publication). Mutations in these regions that are GC-conservative (A to T, T to A, G to C
and C to G) can be considered as neutral for demographic inferences.
regions.YRIrecomb1p5.nophastConsElements46wayPrimates100bp
Same as "regions.YRIrecomb1p5.bed" + the removal of regions close to conserved elements (see Fig1S9 and Fig2S4 for the impact of conserved elements on DAFi and on the SFS). For this file, the removed regions are less than 100bp from conserved elements.