Skip to main content

Genomic time-series data show that gene flow maintains high genetic diversity despite substantial genetic drift in a butterfly species

Cite this dataset

Gompert, Zachariah et al. (2021). Genomic time-series data show that gene flow maintains high genetic diversity despite substantial genetic drift in a butterfly species [Dataset]. Dryad.


Effective population size affects the efficacy of selection, rate of evolution by drift, and neutral diversity levels. When species are subdivided into multiple populations connected by gene flow, evolutionary processes can depend on global or local effective population sizes. Theory predicts that high levels of diversity might be maintained by gene flow, even very low levels of gene flow, consistent with species long-term effective population size, but tests of this idea are mostly lacking. Here, we show that Lycaeides butterfly populations maintain low contemporary (variance) effective population sizes (e.g., ~200 individuals) and thus evolve rapidly by genetic drift. In contrast, populations harbored high levels of genetic diversity consistent with an effective population size several orders of magnitude larger. We hypothesized that the differences in the magnitude and variability of contemporary versus long-term effective population sizes were caused by gene flow of sufficient magnitude to maintain diversity but only subtly affect evolution on generational time scales. Consistent with this hypothesis, we detected low but non-trivial gene flow among populations. Furthermore, using short-term population-genomic time-series data, we documented patterns consistent with predictions from this hypothesis, including a weak but detectable excess of evolutionary change in the direction of the mean (migrant gene pool) allele frequencies across populations, and consistency in the direction of allele frequency change over time. The documented decoupling of diversity levels and short-term change by drift in Lycaeides has implications for our understanding of contemporary evolution and the maintenance of genetic variation in the wild.


A genotyping-by-sequencing approach was used to generate the DNA sequence data. These were then aligned to the Lycaeides melissa genome. Variant (SNP) calling was done using samtools and bcftools. This, followed by quality filtering, resulted in the included vcf file. Genotypes were then inferred from genotype likelihoods using the (ad)mixture model in Entropy. Allele frequencies for each population and year were estimated using an expectation-maximization algorithm.

Usage notes

This data set contains the following files.

G_SAM_sub.txt = A comma-delimited text file with the genotype estimates from Entropy. There is one row per individual (1536) and one column per SNP (12886). These are non-integer Bayesian point estimates of the number of non-reference allele copies.

indIds.txt = IDs for the individuals in G_SAM_sub.txt. These are in the same order as the individuals in the genotype file. Each row gives an individuals population ID followed by the collection year.

p_combined_SAM_sub.txt = A text file with allele frequency estimates. There is one row per SNP (12886) and one column for each population and year. These are in alphanumeric order: BCR 2013, BCR 2015, BCR 2017, BNP 2013, BNP 2015, BNP 2017, BTB 2013, BTB 2014, BTB 2015, BTB 2017, GNP 2013, GNP 2015, GNP 2017, HNV 2013, HNV 2014, HNV 2015, HNV 2017, MRF 2013, MRF 2015, MRF 2017, PSP 2013, PSP 2015, PSP 2017, RNV 2013, RNV 2014, RNV 2017, SKI 2013, SKI 2014, SKI 2015, SKI 2017, USL 2013, USL 2015, USL 2017.

morefilter_filtered2x_lyc_timeseries_samtbcft_vcf.gz = This is the filtered and compressed vcf (variant) file from samtools/bcftools. This includes the full set of SNPs from samtools/bcftools, not just those also called by GATK.