A combination of selective and neutral evolutionary forces shape patterns of genetic diversity in nature. Among the insects, most previous analyses of the roles of drift and selection in shaping variation across the genome have focused on the genus Drosophila. A more complete understanding of these forces will come from analysing other taxa that differ in population demography and other aspects of biology. We have analysed diversity and signatures of selection in the neotropical Heliconius butterflies using resequenced genomes from 58 wild-caught individuals of H. Melpomene, and another 21 resequenced genomes representing 11 related species. By comparing intra-specific diversity and inter-specific divergence, we estimate that 31% of amino acid substitutions between Heliconius species are adaptive. Diversity at putatively neutral sites is negatively correlated with the local density of coding sites as well as non-synonymous substitutions, and positively correlated with recombination rate, indicating widespread linked selection. This process also manifests in significantly reduced diversity on longer chromosomes, consistent with lower recombination rates. Although hitchhiking around beneficial non-synonymous mutations has significantly shaped genetic variation in H. Melpomene, evidence for strong selective sweeps is limited overall. We did however identify two regions where distinct haplotypes have swept in different populations, leading to increased population differentiation. On the whole, our study suggests that positive selection is less pervasive in these butterflies as compared to fruit flies; a fact that curiously results in very similar levels of neutral diversity in these very different insects.
genotyping_summaries
[4 files] Genotyping summaries (numbers of genotyped sites, heterozygosity etc.) for all samples. Summaries are provided for all sites, and also for codon positions 1, 2 and 3.
structure results
[4 files] Results from STRUCTURE analyses, with k = 5,6,7 and 8.
structure.tar.gz
PCA results
Results of Eigenstrat PCA analysis.
mel58.Zupdated.realign80.GQ30.4D.min50.minVar5.eigenstrat.PCA.evec
dxy (absolute divergence) for 100kb windows
[8 files] Absolute divergence between all pairs of taxa for 100kb windows. Eight types of sites were considered: All sites, intergenic, intronic, codon positions 1, 2, and 3, 4D sites, and 4D sites in low-codon-usage-bias genes.
dxy.tar.gz
Heterozygosity in windows
[2 files] heterozygosity (here called "indPi") calculated for non-overlapping 100kb windows. One file corresponds to all samples and the other to two selected Panamanian samples: The inbred reference strain, and two outbred individuals.
heterozygosity.tar.gz
Linkage Disequilibrium
[3 files] Linkage disequalibrium data for the Eastern and Western populations. "Background LD" refers to that between unlinked SNPs on different chromosomes. "top100" refers to LD calculated for all SNP pairs on the same scaffold, averaged over the top 100 scaffolds.
LD.tar.gz
Codon Usage
Effective number of codons and GC content at the third codon position for each gene.
final.codonW_with_positions.csv.gz
Alpha
Estimates of alpha, the genome-wide proportion of adaptive substitutions, at different minor allele frequency thresholds.
alpha_by_threshold.csv
PSMC results
[12 files] PSMC results for twelve selected samples.
psmc.tar.gz
SweeD Results
[42 files] SweeD output for each of the 21 chromosomes, run for the Eastern and Western populations separately.
SweeD.tar.gz
Analyses of chromosomes 11 and 12
[2 files] Statistics calculated in 50kb sliding windows for chromosomes 11 and 12. Nucleotide diversity, absolute divergence and Kst.
chr11_chr12.tar.gz
Genotypes 4D sites
Genotype calls for all individuals at all fourfold degenerate (4D) sites.
set80.Zupdated.realign80.GQ30.ALLSITES.4D.geno.gz
Genotypes Codon Pos 1
Genotype calls for all individuals at first codon positions.
set80.Zupdated.realign80.GQ30.CODON1.ALLSITES.geno.gz
Genotypes Codon Pos 2
Genotype calls for all individuals at second codon positions.
set80.Zupdated.realign80.GQ30.CODON2.ALLSITES.geno.gz
Genotypes Codon Pos 3
Genotype calls for all individuals at third codon positions.
set80.Zupdated.realign80.GQ30.CODON3.ALLSITES.geno.gz
Genotypes Intronic
Genotype calls for all individuals at intronic sites.
set80.Zupdated.realign80.GQ30.INTRON.ALLSITES.geno.gz
Genotypes 4D lowCUB
Genotypes for all individuals at 4D sites in genes showing minimal codon usage bias.
set80.Zupdated.realign80.GQ30.4D.lowCUB.geno.gz
Genotypes All Sites Part 1 of 4
Genotype calls for all individuals at all sites. Part 1 of 4.
set80.Zupdated.realign80.GQ30.ALLSITES.geno.part1.gz
Genotypes All Sites Part 2 of 4
Genotype calls for all individuals at all sites. Part 2 of 4.
set80.Zupdated.realign80.GQ30.ALLSITES.geno.part2.gz
Genotypes All Sites Part 3 of 4
Genotype calls for all individuals at all sites. Part 3 of 4.
set80.Zupdated.realign80.GQ30.ALLSITES.geno.part3.gz
Genotypes All Sites Part 4 of 4
Genotype calls for all individuals at all sites. Part 4 of 4.
set80.Zupdated.realign80.GQ30.ALLSITES.geno.part4.gz
Genotypes Intergenic Part 1 of 3
Genotype calls for all individuals at intergenic sites. Part 1 of 3.
set80.Zupdated.realign80.GQ30.INTERGENIC.ALLSITES.geno.part1.gz
Genotypes Intergenic Part 2 of 3
Genotype calls for all individuals at intergenic sites. Part 2 of 3.
set80.Zupdated.realign80.GQ30.INTERGENIC.ALLSITES.geno.part2.gz
Genotypes Intergenic Part 3 of 3
Genotype calls for all individuals at intergenic sites. Part 3 of 3.
set80.Zupdated.realign80.GQ30.INTERGENIC.ALLSITES.geno.part3.gz
asymptotic_alpha
[5 files] Four files give site frequency spectra for each gene separately, sampling down to 16 individuals from the Western population. The four files correspond to i) Autosomal genes, synonymous SNPs, ii) Autosomal genes, non-synonymous SNPs, iii) Z chromosome genes, synonymous SNPs, iv) Z chromosome genes, non-synonymous SNPs. The R script was used to calulcate aplpha and produce the plot in Fig. 4. See the text for further details.
sfs_v2
[18 files] Site frequency spectra for autosomal scaffolds considering different site classes: 4D sites, introns and intergenic sites. Each site was downsampled to 5 individuals per site, either considering all samples, or only those from a similar locality ('close'). Spectra for the Western and Eastern populations down sampling to 20 individuals per site are also given. See the text for further details.
multiple_regression_v2
[14 files] Data and R code used for multiple regression analysis of neutral diversity. The raw unprocessed data are given in the files "set80.Zupdated.realign80.GQ30.4D.lowCUB.autoScafs.PiDxyGC.w100m150s100.csv" (pi, dxy and gc content for 100 kb windows); "set80.Zupdated.realign80.GQ30.4D.autoScafs.PiDxyGC.w100m250s100.csv" (the same but only for low-CUB genes); "mel1_hec1_wal1_hecu1_era1.realign80.GQ30.ALLSITES.cons.codeml_nW.w100m500.csv" (Paml analysis results, where branch 8..1 refers to the branch leading to H. melpomene); and "MK_gene_a.csv" (gene by gene a estimates). The "model data" files give the processed data that is imported by the R scripts to run the model, as described in the text.
diversity_around_substitutions.tar
[5 files]. Nucleotide diversity (pi) and divergece (dx) for each 4D site. The "distToNearest" files give, for each 4D site, the distance to the nearest substitution, either synonymous or nonsynonymous. For synonymous substitutions, the there are 100 bootstrapped distances, obtained by subsampling sysnonymous substitutions. There are distance files for two groups of substitutions, identified using different outgroups, as described in the text.
simulated_error_rates
[20 files] Counts of paires of actual and inferred genotype patterns for simulated data at different percentage divergences and sequence coverage depths. Foer example "counts_01_00" refers to the number of sites at which the actual genotype was 0/1 and the inferred genotype was 0/0.
simulated_error.tar.gz
Nucleotide diversity for 100 kb windows V2
[17 files] Nucleotide diversity for each population calculated in 100kb windows. Seven different types of sites were considered: all sites, intergenic, intron, codon positions 1, 2 and 3, and four-fold degenerate (4D) sites. Dxy values between populations are also given, but these are based on just a single pair of samples and may be less accurate than those given in the dxy data. There are 8 files giving nucleotide diversity statistics considering only samples with at least 25x coverage - which should be more reliable. These correspond to the same seven site types listed above, plus one file for 4D sites only in genes showing minimal codon usage bias (CUB). There are also two files giving nucleotide diversity at intergenic sites and 3rd codon positions that were used to test whether the proportion of missing data in H. melpomene samples was correlated with nucleotide diversity.
pi_v2.tar.gz