Data from: Genome-wide support for incipient Tula orthohantavirus species within a single rodent host lineage
Data files
Jan 12, 2024 version files 28.64 KB
-
Central_Porcelain_keys.txt
27.50 KB
-
README.md
1.14 KB
Abstract
Evolutionary divergence of viruses is most commonly driven by co-divergence with their hosts or through isolation of transmission after host-shifts. It remains mostly unknown, however, whether divergent phylogenetic clades within named virus species represent functionally equivalent byproducts of high evolutionary rates or rather incipient virus species. Here, we test these alternatives with genomic data from two widespread phylogenetic clades in Tula orthohantavirus (TULV) within a single evolutionary lineage of their natural rodent host, the common vole Microtus arvalis. We examined voles from 42 locations in the contact region between clades for TULV infection by RT-PCR. Sequencing yielded 23 TULV Central North and 21 TULV Central South genomes which differed by 14.9-18.5% at the nucleotide and 2.2-3.7% at the amino acid level without evidence of recombination or reassortment. Geographic cline analyses demonstrated an abrupt (<1 km wide) transition between the parapatric TULV clades in continuous landscape. This transition was located within the Central mitochondrial lineage of M. arvalis and genomic SNPs showed gradual mixing of host populations across it. Genomic differentiation of hosts was much weaker across the TULV Central North to South transition than across the nearby hybrid zone between two evolutionary lineages in the host. We suggest that these parapatric TULV clades represent functionally distinct, incipient species which are likely differently affected by genetic polymorphisms in the host. This highlights the potential of natural viral contact zones as systems for investigating of the genetic and evolutionary factors enabling or restricting the transmission of RNA viruses.
README: Genome-wide support for incipient Tula orthohantavirus species within a single rodent host lineage
https://doi.org/10.5061/dryad.w0vt4b905
Raw data belonging to: Labutin, A. and Heckel, G. (2024) Genome-wide support for incipient Tula orthohantavirus species within a single rodent host lineage. Virus Evolution
Metadata for tab "Central_Porcelain_keys.txt"
Column entry and explanation:
Flowcell: Identifier of the Flowcell containing this sample
Lane: Number of lane for sequencing the sample
Barcode: Unique Barcode within the Flowcell for identifying the sample
FullSampleName: Same as DNASample
LibraryPlateID: Identifier for every library plate
Row: Identifier of Flowcell row of each sample
Col: Identifier of Flowcell column of each sample
Enzyme: Restriction enzymes used for the digestion of genomic DNA during GBS
Note: The keyfile is necessary for assembly and SNP calling of the RAW GBS data of the corresponding Manuscript, found in SRA links: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA869681 and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA767008. The files from both repositories are required.
Methods
Genotyping of host nuclear DNA
Genome-wide nuclear DNA (nucDNA) was used to infer the genetic structure of hosts via Genotyping by Sequencing (GBS) (Elshire et al., 2011). We sequenced at least five individuals per sampling site across the Central transect whenever possible, for a total of 200 individuals. Sequencing was carried out by LGC Biosearch Technologies (Berlin, Germany) using Illumina NovaSeq 6000 and PstI/MspI as restriction enzymes. We combined our dataset with GBS data of 216 additional individuals from the Porcelain transect (Saxenhofer et al., 2022) processed under the same conditions. This separate dataset consisted of voles from the Central and Eastern lineage as well as admixed individuals and served as a reference for the assignment of the newly genotyped 200 individuals to the evolutionary lineages. SNP calling was performed for all 416 individuals together using the GBS v2 pipeline (Tassel 5) (Glaubitz et al., 2014) with the M. arvalis genome (BioProject ID: PRJNA737461, Gouy et al., submitted) as reference. We utilized default parameters, except requiring a minimum of five reads to identify a unique tag.. We only retained bi-allelic SNPs and called genotypes if individuals had a read depth of at least five at the locus. After SNP calling, we removed all loci with complex indels, a minor allele frequency of less than 5%, more than 20% missing data or observed heterozygosity greater than 50%, which may indicate loci that contain paralogs merged together (White et al., 2013). Individuals with more than 50% missing data were also removed (seven individuals, all from Saxenhofer et al. (2022)). We performed a LD-KNNi imputation in TASSEL 5 (Glaubitz et al., 2014) for remaining missing data based on the most common state of the allele across the 10 closest genetic neighbors, calculated across the 30 SNPs with the highest LD towards the missing site and keeping Ns in case of ties. A total of 12.8% of data were missing in the dataset of 409 individuals, of those 99.93% were imputed. Sites which still contained missing data after imputation were discarded. To address potential batch effects of combining two independent GBS datasets, six of the 200 sequenced individuals were replicates of samples from Saxenhofer et al. (2022). One of the replicates was among the seven samples which failed quality control, leaving a total of five effective replicates. We performed all analyses of host population structure with the original data set before imputation, a second dataset after imputation of missing data and a third, very stringently filtered data set in which we removed any loci from our analysis at which SNPs differed between the originals and replicates. We found minor quantitative differences between the three datasets but identical qualitative patterns across all analyses.