Whole-genome sequencing reveals asymmetric introgression between two sister species of cold-resistant leaf beetles
Cite this dataset
Lukicheva, Svitlana; Mardulyn, Patrick (2021). Whole-genome sequencing reveals asymmetric introgression between two sister species of cold-resistant leaf beetles [Dataset]. Dryad. https://doi.org/10.5061/dryad.x3ffbg7jz
A large number of genetic variation studies have identified cases of mitochondrial genome introgression in animals, indicating that reproductive barriers among closely related species are often permeable. Because of its sheer size, the impact of hybridization on the evolution of the nuclear genome is more difficult to apprehend. Only a few studies have explored it recently thanks to recent progresses in DNA sequencing and genome assembly. Here, we analysed whole-genome sequence variation among multiple individuals of two sister species of leaf beetles inside their hybrid zone, in which asymmetric mitochondrial genome introgression had previously been established. We used a machine learning approach based on computer simulations for training to identify regions of the nuclear genome that were introgressed. We inferred asymmetric introgression of ≈2% of the genome, in the same direction than was observed for the mitochondrial genome. Because a previous study based on a reduced-representation sequencing approach was not able to detect this introgression, we conclude that whole-genome sequencing is necessary when the fraction of the introgressed genome is small. We also analysed the whole-genome sequence of a hybrid individual, demonstrating that hybrids have the capacity to backcross with the species for which virtually no introgression was observed. Our data suggest that one species has recently invaded the range of the other and/or that positive selection has favoured the mitochondrial genes of foreign origin for the invading species, along with a series of nuclear genes, some of which possibly being co-adapted with the mitochondrial genes.
The reference genome of G. quinquepunctata (gq__wtdbg2.fa) was assembled from a single individual (pupa). We used 2 sequencing libraries: 85x Illumina 2x250 bp and 27x ONT. The ONT reads were assembled using wtdbg2 and the resulting assembly was polished twice using the Illumina sequences.
The file gq__repeats.gff contains the coordinates of the repeated regions identified within the reference genome of G. quinquepunctata. It was computed by first creating a species-specific repeat library using RepeatModeler v.1.0.11 and then combining this library with the Repbase library (RepeatMasker edition 20181026) to identify the coordinates of the repeated regions using RepeatMasker v.4.0.9.
To detect introgression, we used 10 G. intermedia individuals, 9 G. quinquepunctata individuals and one hybrid. The corresponding Illumina libraries are available on the NCBI Sequence Read Archive (SRA) under the accession number XXXX.
The VCF files were computed by aligning these reads to the reference assembly and performing the variant calling using GATK v.18.104.22.168. The file full-dataset.vcf contains the variant calls computed for all the individuals. The files gi__snp.vcf and gq__snp.vcf contain the SNPs computed separately for Gonioctena intermedia and Gonioctena quinquepunctata individuals (hybrid excluded). The files gi__all-positions.vcf and gq__all-positions.vcf were computed with the option "--emitRefConfidence" and contain confidence scores for each site the the genome, computed separately for Gonioctena intermedia and Gonioctena quinquepunctata individuals (hybrid excluded).
The simulated datasets (training__no-introgression.txt, training__introgression12.txt, training__introgression21.txt) were generated using msmove. The first dataset contains simuated data without introgression, the second dataset contains simulated data with the introgression in the direction G. intermedia -> G. quinquepunctata and the last dataset contains simulated data with the introgression in the direction G. quinquepunctata -> G. intermedia. For each parameter provided to msmove, we defined an interval around the estimated value v as follows: [v - v/2; v + v/2] and generated values uniformly distributed across this interval. The parameter values (estimated by GADMA) were the following:
- Ancestral population size (Nanc) = 9,808,986
- Divergence time (Tdiv) = 501,675
- Initial population size of G. intermedia (Nint0) = 9,673,164
- Final population size of G. intermedia (Nint) = 107,758
- Initial population size of G. quinquepunctata (Nqui0) = 135,823
- Final population size of G. quinquepunctata (Nqui) = 1,292,448
For the datasets with introgression, we defined an interval of [0; Tdiv/4] for the introgression time and in interval of [0.01; 1] for the probability of introgression.
The file filet-classification.zip contains the final classification produced by FILET. FILET was run following the example given at https://github.com/kr-colab/FILET/blob/master/examplePipeline.sh. We used the sliding windows approach. The window size was set to 10k, the sliding step to 1k.
The reference genome, the file containing repeats coordinates and the four VCF files computed separately for G. intermedia and G. quinquepunctata are used during the first two steps of the data masking procedure explained here: https://github.com/kr-colab/FILET/tree/master/maskingScripts . These files should be used as follows:
python makeMaskArmFilesFromQualAndRM.py gi__all-positions.vcf gq__all-positions.vcf gq__repeats.gff gq__wtdbg2.fa 20 0.7 maskedRefFaFileName
python makeFilteredSnpVcfsForEachArm.py gi__snp.vcf gq__snp.vcf gi__all-positions.vcf gq__all-positions.vcf maskedRefFaFileName 20 filteredVcfPathPrefix_
Fund for Scientific Research, Award: CDR J.0075.18