Identifying and classifying shared selective sweeps from multilocus data
Harris, Alexandre; DeGiorgio, Michael (2020), Identifying and classifying shared selective sweeps from multilocus data, Dryad, Dataset, https://doi.org/10.5061/dryad.280gb5mm4
Positive selection causes beneficial alleles to rise to high frequency, resulting in a selective sweep of the diversity surrounding the selected sites. Accordingly, the signature of a selective sweep in an ancestral population may still remain in its descendants. Identifying signatures of selection in the ancestor that are shared among its descendants is important to contextualize the timing of a sweep, but few methods exist for this purpose. We introduce the statistic SS-H12, which can identify genomic regions under shared positive selection across populations and is based on the theory of the expected haplotype homozygosity statistic H12, which detects recent hard and soft sweeps from the presence of high-frequency haplotypes. SS-H12 is distinct from comparable statistics because it requires a minimum of only two populations, and properly identifies and differentiates between independent convergent sweeps and true ancestral sweeps, with high power and robustness to a variety of demographic models. Furthermore, we can apply SS-H12 in conjunction with the ratio of statistics we term H2Tot and H1Tot to further classify identified shared sweeps as hard or soft. Finally, we identified both previously-reported and novel shared sweep candidates from human whole-genome sequences. Previously-reported candidates include the well-characterized ancestral sweeps at
Usage Note s
Contained are the scripts, simulations, and summary data necessary for confirming the results of our GENETICS manuscript, entitled "Identifying and classifying shared selective sweeps from multilocus data." For each research direction we have undertaken, we provide resources for interested parties to confirm our results. These include simulated replicates, summary statistics, and the raw script files used for internal data analysis. Please note that data are divided into four .tar.gz archives, titled *_p1*, *_p2*, *_p3*, and *_p4*. P-value results in *_p4* within the directory whole_genome_scans_singles supersede those in *_p1*, while data within larger_simulated_sequences_1Mb in *_p4* were simulated following the same protocol as those within ceugihyri_sims_95 in *_p1*, simply with a larger simulated chromosome.