Data from: The effect of autopolyploidy on population genetic signals of hard sweeps
Monnahan, Patrick; Brandvain, Yaniv (2020), Data from: The effect of autopolyploidy on population genetic signals of hard sweeps, Dryad, Dataset, https://doi.org/10.5061/dryad.12jm63xt1
Searching for population genomic signals left behind by positive selection is a major focus of evolutionary biology, particularly as sequencing technologies develop and costs decline. The effect of the number of chromosome copies (i.e. ploidy) on the manifestation of these signals remains an outstanding question, despite a wide appreciation of ploidy being a fundamental parameter governing numerous biological processes. We clarify the principal forces governing the differential manifestation and persistence of the signal of selection by separating the effects of polyploidy on rates of fixation versus rates of diversity (i.e. mutation and recombination) with a set of coalescent simulations. We explore what the major consequences of polyploidy, such as a more localized signal, greater dependence on dominance, and longer persistence of the signal following fixation, mean for within- and across-ploidy inference on the strength and prevalence of selective sweeps. As genomic advances continue to open doors for interrogating natural systems, studies such as this aid our ability to anticipate, interpret, and compare data across ploidy levels.
Code for all simulation, analysis, and visualizations is available at https://github.com/pmonnahan/PloidySim
We simulate selective sweeps with mssel, assuming that polyploidy simply increases the number of chromosome copies, k, and thus the population mutation and recombination rates (θ=2Nkμ and ρ=2Nkr, respectively). We assume no preferential pairing between homologs, random mating (no self-fertilization/inbreeding), and no “double-reduction”. We further assume populations are at equilibrium prior to selection, remain constant in size, and that equal numbers of individuals, n, are sampled for all ploidies (i.e. n * k haplotypes; see Supplemental Figure 1 for additional simulation details). We simulate a simple demographic scenario in which an ancestral population splits in two, at which point a beneficial mutation arises in the middle of a 1Mb sequence (freely-combining, non-centromeric) in one population and ultimately fixes. We sample haplotypes from both populations a specified time following fixation, using the non-selected population as a neutral baseline and for calculation of between-population measures of selection (FST & XP-EHH). Please see the manuscript for information on how allele frequency trajectories were generated for different ploidy levels.
We chose two common metrics based on nucleotide diversity (Tajima’s D and FST) and two haplotype-based metrics: iHS and XPEHH. We calculated pairwise nucleotide diversity (π), FST, and Tajima’s D in overlapping windows (step size of ½ full window size) using the R package PopGenome. After experimenting, we found 1 Mb / (N / 50) windows, where N is population size, captured sufficient polymorphism for robust calculation of summary statistics across all parameters investigated. To ease comparison among metrics, we mean-standardize FST within each replicate and multiply Tajima’s D by -1 (so all metrics are on positive scale). We calculated iHS and XPEHH with the R package, rehh, using the parse_ms() function from msr (https://github.com/vsbuffalo/msr) for file format conversion.
We describe our results in terms of the visual manifestation of diversity or other metrics, as such visualizations are routinely utilized in modern genome scan approaches. For diversity, ‘Magnitude’ is calculated as the difference between diversity at bottom of dip and baseline levels in a non-selected population, ‘Breadth’ as the distance where diversity recovers to ½ baseline levels [divided by 10kb]), and ‘Area’ as ‘Magnitude’ * ‘Breadth’ / 2. For remaining metrics, we calculate area under the peak (± 100kb from selected site), using the auc() function (R package MESS) and scaled by 100,000 for ease of visualization.
Additional details can be found in the supplementary file associated with the manuscript. Specifically, Supplemental Figure 1 illustrates the scenario which we were simulating via the following command to mssel:
~/.mssel <nsam> <nreps> <nanc> <nder> <trajectory_file> <sel_spot> -r <rho> <length> -t <theta> -I 2 0 <nder> <nanc> 0 -ej <fuseTime> 2 1
Where brackets denote specification of a parameter and parameters that were varied are,
nsam = Number of sample alleles/haplotypes; equals n * k, where n equals the number of sampled individuals and k equals the ploidy level.
nreps = Number of replicates; though this was set to 1 for all simulations because we only want one simulation per trajectory file, and we achieve replication by using multiple independent-generated trajectory files.
nanc = Number of haplotypes with the ancestral (non-selected) allele; equals (n * k) / 2
nder = Number of haplotypes with the derived (selected) allele; also equals (n * k) / 2
trajectory_file = path to text file containing the allele frequency trajectory simulated via scheme in Methods. Also, see PloidyHitch.R at https://github.com/pmonnahan/PloidySim
sel_spot = position of the selected mutation
rho = population recombination rate; ρ=2NkrL
length = number of sites
theta = population diversity rate; θ=2NkμL
The parameters following the flag, -I, specify the demographic scenario with the form “npop n1anc n1der n2anc n2der”, where,
npop = number of populations (always 2 in our case)
n1anc = number of ancestral alleles sampled from population 1. We set this to 0 because selection has fixed the derived allele in population 1.
n1der = number of derived alleles sampled from population 1. Since the derived allele is fixed in this population, we want to sample entirely derived alleles from this population.
n2anc = number of ancestral alleles sampled from population 2. Since the derived allele arises in population 1 following the split of the ancestral population, the derived allele is not present in population 2, and thus we specify only sampling haplotypes with the ancestral allele in population 2
n2der = number of derived alleles sampled from population 2. Always set to 0.
The parameters following the flag, -ej, specify the coalescent time (fuseTime in the figure and command; in units k * N) at which the two populations fused. The “2 1” argument specifies that population 2 fuses into population 1, although this choice is arbitrary and does not affect the results in any way.