Data from: Population genomic evidence of selection on structural variants in a natural hybrid zone
Data files
Apr 14, 2022 version files 2.02 GB
-
Lmel_dovetailPacBio_genome.fasta
-
merged_SURVIVOR_1kbpdist_typesave_new.vcf
-
README
Abstract
Structural variants (SVs) can promote speciation by directly causing reproductive isolation or by suppressing recombination across large genomic regions. Whereas examples of each mechanism have been documented, systematic tests of the role of SVs in speciation are lacking. Here, we take advantage of long-read (Oxford nanopore) whole-genome sequencing and a hybrid zone between two Lycaeides butterfly taxa (L. melissa and Jackson Hole Lycaeides) to comprehensively evaluate genome-wide patterns of introgression for SVs and relate these patterns to hypotheses about speciation. We found >100,000 SVs segregating within or between the two hybridizing species. SVs and SNPs exhibited similar levels of genetic differentiation between species, with the exception of inversions, which were more differentiated. We detected credible variation in patterns of introgression among SV loci in the hybrid zone, with 562 of 1419 ancestry-informative SVs exhibiting genomic clines that deviated from null expectations based on genome-average ancestry. Overall, hybrids exhibited a directional shift towards Jackson Hole Lycaeides ancestry at SV loci, consistent with the hypothesis that these loci experienced more selection on average than SNP loci. Surprisingly, we found that deletions, rather than inversions, showed the highest skew towards excess ancestry from Jackson Hole Lycaeides. Excess Jackson Hole Lycaeides ancestry in hybrids was also especially pronounced for Z-linked SVs and inversions containing many genes. In conclusion, our results show that SVs are ubiquitous and suggest that SVs in general, but especially deletions, might disproportionately affect hybrid fitness and thus contribute to reproductive isolation.
Methods
This data set includes a new, whole genome reference sequence for Lycaeides melissa. We used PacBio SMRT sequencing to improve our existing L. melissa reference genome, which was assembled from Chicago and Hi-C genomic libraries and comprised 24 large scaffolds corresponding to the 23 L. melissa linkage groups (22 autosomes and the Z sex chromosome, with a all but one chromosome represented by a single scaffold). High molecular weight DNA was extracted from a single female L. melissa. Following quality control, a HiFi library was constructed from purified DNA and sequenced using a PacBio Sequell 2 with a 30-hour HiFi SMRT cell. This generated 978,737 (>Q20) reads with a mean length of 18,683 bps (total length = 18,286,022,199 >Q20 bps). The BYU DNA Sequencing Center bioinformatics team then used these data to improve (extend and replace Ns with known bases) our existing genome. Gaps were filled in the assembly from Dovetail Genomics using PacBio HiFi reads using TGS-GapCloser v1.0.1. Given that the reads were Q20 or higher, the "‑‑ne" (i.e., skip error correction) option was used. Other than required parameters (e.g., to provide the input reads and assembly), the only other non-default parameter was "‑‑tgstype pb" to specify the read type. Our input reference genome comprised 362 Mbps with 56% Ns, whereas after the gap-filling process our new reference genome spanned 521 Mbps with only 6.8% Ns.
The data set also includes a vcf file with structural variant calls for Lycaeides. These variants were identified from Oxford nanopore DNA sequence data generated with a MinION. Genomic libraries for 37 butterflies were loaded onto R9.4 flow cells (FLO-Min106, ONT) for sequencing for full 72 hour runs on a MinION. This generated an average of 7 GB of sequence data per library, with a N50 read length of 1114 to 20,944 bps (mean read length = 9073 bp; maximum read length of 342,840 bp). Base calling was performed using the ont-guppy algorithm from guppy_basecaller (version 4.2.2). We then used guppy_barcoder (version 4.2.2) to demultiplex the nanopore sequence reads and subsequently remove the barcode sequences. We used Minimap2 (version 2.17) to align the nanopore reads to the reference genome, and we used Sniffles (version 1.0.12) and SURVIVOR (version 1.03) for structural variant calling. We then discarded all structural variants less than 1000 bp in length.
Usage notes
Lmel_dovetailPacBio_genome.fasta = genome assembly for Lycaeides melissa in fasta format.
merged_SURVIVOR_1kbpdist_typesave_new.vcf = structural variant file in vcf format.