Most speciation events probably occur gradually, without complete and immediate reproductive isolation, but the full extent of gene flow between diverging species has rarely been characterized on a genome-wide scale. Documenting the extent and timing of admixture between diverging species can clarify the role of geographic isolation in speciation. Here we use new methodology to quantify admixture at different stages of divergence in Heliconius butterflies, based on whole genome sequences of 31 individuals. Comparisons between sympatric and allopatric populations of H. melpomene, H. cydno and H. timareta revealed a genome-wide trend of increased shared variation in sympatry, indicative of pervasive interspecific gene flow. Up to 40% of 100 kb genomic windows clustered by geography rather than by species, demonstrating that a very substantial fraction of the genome has been shared between sympatric species. Analyses of genetic variation shared over different time intervals suggested that admixture between these species has continued since early in speciation. Alleles shared between species during recent time intervals displayed higher levels of linkage disequilibrium than those shared over longer time intervals, suggesting that this admixture took place at multiple points during divergence and is probably ongoing. The signal of admixture was significantly reduced around loci controlling divergent wing patterns, as well as throughout the Z chromosome, consistent with strong selection for Müllerian mimicry and with known Z-linked hybrid incompatibility. Overall these results show that species divergence can occur in the face of persistent and genome-wide admixture over long periods of time.
set31.Zupdated.autoANDchrZ.union.SNP.Fst.w100m2.5s20.csv
Pairwise Fst values for 100 kb sliding windows, sliding in increments of 20 kb. Fst for each pair of populations, for each window was calculated using the EGGLIB Python library.
PANAMA.ALLSHARED.w100.rerooted
Maximum likelihood trees for all non-overlapping 100 kb windows for the cydno-melpomene data set. Genomic locations of trees, and the designated topologies are given by the file PANAMA.ALLSHARED.w100.topology.csv.
PERU.ALLSHARED.w100.rerooted
Maximum likelihood trees for all non-overlapping 100 kb windows for the timareta-melpomene data set. Genomic locations of trees, and the designated topologies are given by the file PERU.ALLSHARED.w100.topology.csv.
PANAMA.ALLSHARED.w100.topology
Genomic locations of non-overlapping 100 kb windows and topologies supported by the corresponding maximum likelihood tree in the cydno-melpomene dataset. The full newick trees are given in the file PANAMA.ALLSHARED.w100.nwk.
PERU.ALLSHARED.w100.topology
Genomic locations of non-overlapping 100 kb windows and topologies supported by the corresponding maximum likelihood tree in the timareta-melpomene dataset. The full newick trees are given in the file PERU.ALLSHARED.w100.nwk.
set31.Zupdated.autoANDchrZ.ALLSHARED.SNP.derFreqs.csv
Derived allele frequencies for each population sample at each variable site for which there was a high-quality genotype call for all 31 individuals, and the outgroup species were fixed for the ancestral state. Columns with the suffixes "_1" and "_2" give the allele frequencies for sample subsets of two individuals, used for calculating the denominator of the fraction introgression f statistic. Values were calculated using the script freq.py. Values for Z and autosomes were calculated separately due to ploidy differences between the sexes, and then the outputs were combined.
Hmel1-1_Scaff-Chrom_Zcorrected_20120730
Chromosomal assignments of scaffolds and scaffold fragments. This file builds upon the chromosomal assignments provided by Dasmahapatra et al. 2012 doi:10.1038/nature11041, based on our read-depth analysis described in the supplementary information. Each scaffold is assigned to one or more scaffold "fragments", which are each assigned to a chromosome. Most scaffolds correspond to a single fragment that spans the length of the scaffold, but those that were found to be chimeric were split into 2 or 3 fragments, each of which then received a chromosomal assignment.
Hmel1-1_hox_RAD_matepair_chromosomes_Zupdated
File providing the chromosomal assignment and scaffolds mapped to chromosomes in the Hmel1-1 assembly. This version has been modified to accommodate the new assignments of Z-linked scaffolds and those that that have been broken and re-named based on the read-depth analysis described in the supplementary information. See the file Hmel1-1_Scaff-Chrom_Zcorrected_20120730.txt submitted herewith for a summary of this information. Newly-assigned Z-linked scaffolds have been labled chrZ_unmapped, as their chromosomal position has not been mapped.
ABBA_BABA_tests
This is a necessary input for the script auto_ABBA_BABA_jackknife.r. It defines the populations that will form p1 (non-recipient), p2 (recipient) and p3 (donor) in each test.
freq.py
Script to calculate allele frequencies from a filtered VCF (.geno) file. Open in a text editor to see instructions on how to run it.
auto_ABBA_BABA_jackknife.r
Code for running multiple ABBA BABA tests to calculate D an f statistics, using block-jackknifing to estimate variance.
four_pop_test.r
Code to run the four-population test for admixture, with block-jackknifing to estimate variance.
LD_with_snpStats.r
Code to estimate linkage disequilibrium between pairs of sites.
filterTrees.py
Code used to filter newick-format trees based on their topology.
set31.Zupdated.union.geno.part_1_of_2
Filtered VCF or "calls" format file. Columns 1 and 2 are scaffold and position and columns 3 onwards give genotypes for each individual. Ambiguous bases are given for heterozygous sites. Only calls that passed the quality filtering described in the manuscript are given. Missing data or low-quality calls are given as Ns. This is the first half of a file that was split into two for uploading purposes.
set31.Zupdated.union.geno.part_2_of_2
This is the second part of a two part file. Please see the description for set31.Zupdated.union.geno.part_1_of_2.
set31.Zupdated.autoAndZ.ALLSHARED.PiDiv.w100m10s20.csv
Nucleotide diversity for each population and divergence (dxy) between each population pair for 100 kb sliding windows, sliding in increments of 20 kb.