Skip to main content

Natural selection drives genome-wide evolution via chance genetic associations

Cite this dataset

Gompert, Zachariah; Feder, Jeff; Nosil, Patrik (2021). Natural selection drives genome-wide evolution via chance genetic associations [Dataset]. Dryad.


Understanding selection's impact on the genome is a major theme in biology. Functionally-neutral genetic regions can be affected indirectly by natural selection, via their statistical association with genes under direct selection. The genomic extent of such indirect selection, particularly across loci not physically linked to those under direct selection, remains poorly understood, as does the time scale at which indirect selection occurs. Here we use field experiments and genomic data in stick insects, deer mice and stickleback fish to show that widespread statistical associations with genes known to affect fitness cause many genetic loci across the genome to be impacted indirectly by selection. This includes regions physically distant from those directly under selection. Then, focusing on the stick insect system, we show that statistical associations between SNPs and other unknown, causal variants result in additional indirect selection in general and specifically within genomic regions of physically linked loci. This widespread indirect selection necessarily makes aspects of evolution more predictable. Thus, natural selection combines with chance genetic associations to affect genome-wide evolution across linked and unlinked loci and even in modest-sized populations. This process has implications for the application of evolutionary principles in basic and applied science.


Whole genome DNA sequence data were previously generated from 491 Timema cristinae stick insects that were part of a release-recapture selection experiment (available from the NCBI SRA PRJNA356801). For the current study, we aligned the whole genome DNA sequence data from each of these 491 T. cristinae to the T. cristinae reference genome (version 1.3) using the bwa (version 07.10-r789) mem algorithm with a band width of 100, a 20 bp seed length and a minimum score for output of 30. We then used samtools (version 1.5) to compress, sort and index the alignments, and to remove PCR duplicates. We then used the GATK HaplotypeCaller and GenotypeGVCFs modules (version 3.5) to call variants and calculate genotype likelihoods. We required a minimum base quality of 30, set the prior probability of heterozygosity to 0.001, and only called variants with a minimum phred-scaled confidence of 50.

The following filters were then applied using custom Perl scripts: minimum coverage of 1Xp er individual (i.e., 491X coverage across all individuals), a minimum ratio of variant confidence to non-reference read depth of 2, a minimum mapping quality of 40, a maximum phred-scaled P-value of Fisher's exact test for strand bias of 60, and a minimum minor allele frequency of 0.01. Further, we only retained SNPs mapped to one of the 13 T. cristinae linkage groups. This resulted in 7,243,463 SNPs, which were used in subsequent analyses.

Next, we obtained maximum likelihood estimates of allele frequencies for all experimental samples using an expectation-maximization (EM) algorithm as implemented in estpEM (version 0.1). For this, we used a convergence tolerance of 0.001 and allowed for a maximum of 30 EM iterations. We then used these allele frequency estimates and the genotype likelihoods from GATK to calculate empirical Bayesian genotype estimates. These point estimates range from zero to two, and are not constrained to be integer values.

Thus, this data set includes the genotype estimates for the 491 individuals as well as the data on survival, i.e., whether or not they were re-captured at the end of the experiment.

Usage notes

pntest_LG_*_mod_filtered1X_tcrExperimentVariants.txt.gz (* denotes 1, 2, ... ,13)

These text files contain the genotype estimates (Bayesian point estimates of the number of non-reference alleles). There is one file per chromosome (linkage group, numbered 1 to 13). Each contains one row per SNP locus and one column per individual.


These text files provide information about the genomic location of each SNP in the genotype files described above. There is one file per chromosome (linkage group, numbered 1 to 13). Each contains one row per SNP locus with the scaffold number (1st column), linkage group number (2nd column), 3 estimates of map position (in cM, columns 3-5), and the position in base pairs.


This text file contains one row per individual with 1 or 0 denoting whether the stick insect survived (1) or died (0).


This text file contains one row per individual stick insect with the first column denoting the block number and the second column denoting the host plant (A = Adenostoma, C = Ceanothus).


National Science Foundation, Award: DEB 1844941

European Research Council, Award: EE-Dynamics 770826