Data from: Cryptic hybridization between the ancient lineages of Natterer's bat (Myotis nattereri)
Data files
May 14, 2024 version files 104.21 GB
-
465.bam
-
9982_psmc.bam
-
9992_psmc.bam
-
all_samples_all_scaffolds_filtered.vcf.gz
-
README.md
Abstract
Studying hybrid zones that form between morphologically cryptic taxa offers valuable insights into the mechanisms of cryptic speciation and the evolution of reproductive barriers. Although hybrid zones have long been the focus of evolutionary studies, the awareness of cryptic hybrid zones increased recently due to rapidly growing evidence of biological diversity lacking obvious phenotypic differentiation. The characterization of cryptic hybrid zones with genome-wide analysis is in its early stages and offers new perspectives for studying population admixture and thus the impact of gene flow. In this study, we investigate the population genomics of the Myotis nattereri complex in one of its secondary contact zones, where a putative hybrid zone is formed between two of its cryptic lineages. By utilizing a whole genome shotgun sequencing approach, we aim to characterize this cryptic hybrid zone in detail. Demographic analysis suggests that the cryptic lineages diverged during the Pliocene, approximately 3.6 million years ago. Despite this ancient separation, the populations in the contact zone exhibit mitochondrial introgression and a considerable amount of mixing in nuclear genomes. The genomic structure of the populations corresponds to geographic locations and the genomic admixture changes along a geographic gradient. These findings suggest that there is no effective hybridization barrier between both lineages, nevertheless their population structure is shaped by dispersal barriers. Our findings highlight how such deeply diverged cryptic lineages can still readily hybridize in secondary contact.
Methods
DNA extraction
DNA was extracted either by using QIAGEN DNeasy Blood and Tissue kit following the manufacturer’s recommendation or with a salt-chloroform extraction method (Dietz et al., 2016; Müllenbach et al., 1989). Tissue samples were digested in a lysis buffer for one hour at 56°C followed by 37°C overnight. In the end, DNA was eluted in 90 µl of AE buffer (Qiagen). We checked the DNA concentration and purity by using a NanoDrop spectrophotometer (Thermo Fisher Scientific Inc). The degree of DNA degradation was visualized by agarose gel (1%) electrophoresis.
For genomic library preparation 500 ng of DNA was mechanically sheared by a S220 Focused-ultrasonicator (Covaris Inc) to achieve a fragment length below 800 bp. The fragment size distribution of sheared DNA was checked with Bioanalyzer (High Sensitivity DNA Chip, Agilent Inc) and concentration was determined using Qubit 2.0 Fluorometer and dsDNA HS Assay Kit (Thermo Fisher Scientific Inc). 80 ng of fragmented DNA was used to prepare libraries for paired-end Illumina sequencing with NEXTflex Rapid DNA Seq Kit 2.0 (PerkinElmer Inc) (Table S2).
During the study, a switch was made to the NEXTflex Rapid XP Library Preparation Kit (PerkinElmer Inc) employing an enzymatic fragmentation. 100 ng of unsheared DNA was used for preparing lllumina libraries (Table S2).
Genome shotgun sequencing
In total, three different NEXTFLEX library preparation kits (PerkinElmer Inc) were used following the manufacturer’s instructions: NEXTflex Rapid XP DNASeq Kit for 45 enzymatically sheared samples, NEXTflex Rapid DNASeq Kit for six mechanically sheared samples, and NEXTflex Rapid DNASeq Kit 2.0 for 11mechanically sheared samples. All libraries were made according to the manufacturer's “bead size selection” protocol aiming for a library size of approximately 500 bp and libraries were barcoded for multiplexing with NEXTflex HT Barcodes (PerkinElmer Inc).
Library size and quantity were determined with a Bioanalyzer or a TapeStation (High Sensitivity DNA Chip or D1000 Kit, Agilent Technologies Inc) and a Qubit 2.0 Fluorometer (dsDNA HS Assay Kit, Thermo Fisher Scientific Inc) following the manufacturer’s recommendations.
Genomic libraries were all sequenced at 150 bp paired-end on three different Illumina sequencing instruments NextSeq500, HiSeq2500, and NovaSeq6000. Five samples were additionally sequenced as 150bp single-end on a NextSeq500 instrument (Table S2).
Merging and cleaning reads
Following sequencing, the quality of raw reads was inspected with FastQC v0.11.8 (Andrews, 2010) after which the adapters were removed using Trimmomatic v0.36 (Bolger et al., 2014). Overlapping reads were then merged using PEAR v0.9.11 (J. Zhang et al., 2014). The p-value for the assembly using PEAR had to exceed the lowest possible value of 0.0001, the minimum overlap size had to be at least 20 bp, and a minimum possible read length of 30 bp was required. Next, quality trimming was done by Trimmomatic v0.36 (Bolger et al., 2014) to trim read ends with phred base quality below 15 and to discard reads shorter than 30 bp. Following cleaning, the polished reads data were then re-checked with FastQC v0.11.8 to ensure that all possible adapters and low-quality sequences were removed. Parameters not mentioned in this section were left at their default or as recommended in the programs’ manuals.
Mapping to the Myotis myotis reference genome
We mapped the cleaned reads to the M. myotis reference genome (GenBank assembly accession: GCA_014108235.1) that was provided by the Bat1K genome project allowing us early access to the non-annotated 92 scaffold assembly (Jebb et al., 2020). We used Burrows-Wheeler Aligner (Li & Durbin, 2009)BWA v0.7.17 mem algorithm, using the -M parameter to ensure compatibility with Picard Tools(broadinstitute.github.io/picard/). Both paired and unpaired reads were mapped separately to produce a bam (binary alignment) file for each individual. We used Picard Tools MarkDuplicates v2.18.25 to remove duplicates. We then used samtools-1.9 (Li et al., 2009) to sort and merge the bam files with paired and unpaired reads. After merging, we used Qualimap v2.2.1 (Okonechnikov et al., 2015) to evaluate the mapping results for each individual.
Variant calling
We used two parallel workflows for variant calling, using hard-called genotypes and genotype likelihoods to assess whether the heterogeneity in coverage between samples (Table S1) influenced the inference of population structure.
For the pipeline based on genotypes, we used Freebayes v1.3.1 (Garrison & Marth, 2012). Only reads with a mapping quality score above 10 were used to generate a multi-sample VCF file. Subsequently, we filtered with VCFtools v0.1.17 (Danecek et al., 2011) using the following parameters: only biallelic sites, coverage per site between 2 and 40 for each individual, and mean coverage between 5 and 60 over all included individuals. Minimal mapping quality was set to 20, and we allowed for 10% missing data per site. Finally, indels were removed and the minor allele frequency threshold was set to 0.03.
Genotype likelihoods (GLs) were estimated using the GATK model in ANGSD v0.928 (Korneliussen et al., 2014; McKenna et al., 2010), which only uses the observed bases that overlap at a specific position together with their associated sequencing quality scores. Only sites with a minimum mapping quality of 20 and minimum coverage of 2 were considered. The SNP p-value threshold for the likelihood ratio test was set to 1e-6. We excluded sites that showed a significant p-value for being triallelic, and those that had a minor allele frequency below 0.05.