Data from: Multilocus approaches for the measurement of selection on correlated genetic loci

Gompert, Zachariah1; Egan, Scott P.2; Barrett, Rowan D. H.3; Feder, Jeffrey L.4 5; Nosil, Patrik6

Published Sep 29, 2016 on Dryad. https://doi.org/10.5061/dryad.ns47q

Data files

Sep 29, 2016 version files 914.44 MB

Abstract

The study of ecological speciation is inherently linked to the study of selection. Methods for estimating phenotypic selection within a generation based on associations between trait values and fitness (e.g. survival) of individuals are established. These methods attempt to disentangle selection acting directly on a trait from indirect selection caused by correlations with other traits via multivariate statistical approaches (i.e. inference of selection gradients). The estimation of selection on genotypic or genomic variation could also benefit from disentangling direct and indirect selection on genetic loci. However, achieving this goal is difficult with genomic data because the number of potentially correlated genetic loci (p) is very large relative to the number of individuals sampled (n). In other words, the number of model parameters exceeds the number of observations (p ≫ n). We present simulations examining the utility of whole-genome regression approaches (i.e. Bayesian sparse linear mixed models) for quantifying direct selection in cases where p ≫ n. Such models have been used for genome-wide association mapping and are common in artificial breeding. Our results show they hold promise for studies of natural selection in the wild and thus of ecological speciation. But we also demonstrate important limitations to the approach and discuss study designs required for more robust inferences.

Null model simulations

This R script was used to simulate random mortality and then infer total selection on genotypes based on allele frequency changes while ignoring the effect of genetic drift.

nullSim.R

Simulated phenotypes based on Timema GBS data

This compressed folder contains the simulated data sets where Timema cristinae GBS data were used as the starting point. It contains three gemma formatted genotype files (geno_sims.txt, geno2500_sims.txt, and geno592r_sims.txt) that contain the underlying genotypic data for the base simulations, those with 2500 individuals and those with replicated genotypes, respectively. The remaining files are named based on the simulations condition (simsN): 1, h2=0.3, L = 100; 2, h2=0.3, L=10; 3, h2=0.05, L=100; 4, h2=0.05, L=10; 5, h2=0.3, L=1000; and 6, h2=0.05, L=1000. Files with 'bin' in their name correspond to results with a binary measure of fitness (e.g., alive or dead). The general file types are as follows: pheno* = gemma formatted phenotype file with one column per simulated phenotypic data set, bv* = breeding values (expected phenotypic values) for each individual based on the genotypic data, effect* = phenotypic effect sizes for causal variants, functvar* = a list of the causal variants (by SNP number, and in the same order as in effect*).

simdata_timema.tar.gz

Simulated phenotypes based on the Rhagoletis GBS data

This compressed folder contains the simulated data sets where Rhagoletis pomonella GBS data were used as the starting point. It contains two gemma formatted genotype files (geno_sims.txt and geno592r_sims.txt) that contain the underlying genotypic data for the base simulations and those with replicated genotypes, respectively. The remaining files are named based on the simulations condition (simsN): 1, h2=0.3, L = 100; 2, h2=0.3, L=10; 3, h2=0.05, L=100; 4, h2=0.05, L=10; 5, h2=0.3, L=1000; and 6, h2=0.05, L=1000. The general file types are as follows: pheno* = gemma formatted phenotype file with one column per simulated phenotypic data set, bv* = breeding values (expected phenotypic values) for each individual based on the genotypic data, effect* = phenotypic effect sizes for causal variants, functvar* = a list of the causal variants (by SNP number, and in the same order as in effect*).

simdata_rhag.tar.gz

Timema simulation scripts

This compressed directory contains four R scripts. The first three sims.R, sims2500.R and sims592r.R were used to simulate the quantitative phenotypes (fitness values) from the Timema GBS data for the original genotypic data, data with 2500 individuals, and with replicated genotypes, respectively. The last script, liability2binary.R, was used to convert quantitative fitness values to binary data.

simscripts_timema.tar.gz

Rhagoletis simulation scripts

This compressed directory contains two R scripts. These scripts (sims.R and sims592r.R) were used to simulate the quantitative phenotypes (fitness values) from the Rhagoletis GBS data for the original genotypic data and data with replicated genotypes, respectively.

simscripts_rhag.tar.gz

Summary scripts

This compressed directory contains the perl scripts used to summarize (1) the posterior distribution of genome-level genetic architecture parameters (calcpost*), and (2) SNP specific selection coefficients (summarizeSNP*pl).

analysis_scripts.tar.gz