Raw motif mapping bedfile data and model training set class probabilities
Data files
Jun 30, 2023 version files 10.96 GB
-
plos_comp_raw_data.tar.gz
10.96 GB
-
README.md
911 B
Abstract
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (SADS-CoV) genome sequences as non-human pathogens. Feature selection using L1 regularization identified several degenerate nucleotide predictor motifs with high model coefficients for the human pathogen class that were present across widely disparate clades of coronaviruses. However, these motifs differed in which genes they were present in, what specific codons were used to encode them, and what the translated amino acid motif was. This emphasizes the importance of a phenetic view of emerging pathogenic RNA viruses, as opposed to the canonical phylogenetic interpretations most commonly used to track and manage viral zoonoses. Applying our model to more recent Orthocoronavirinae genomes deposited since October 2018 yields a novel contextual view of pathogen potential across bat-related, canine-related, porcine-related, and rodent-related coronaviruses and critical adaptations which may have contributed to the emergence of the pandemic SARS-CoV-2 virus. Finally, we discuss the next steps to achieve robust predictive ensembles and the utility of these models (and their associated predictor motifs) to novel biosurveillance protocols that substantially increase the ‘pound-for-pound’ information content of field-collected sequencing data and make a strong argument for the necessity of routine collection and sequencing of zoonotic viruses.
The data was produced by hierarchically clustering K-mers to produce degenerate motifs that were then mapped back to the originating 2276 coronavirus genomes they were drawn from. Class probabilities we produced by regularized logistic regression models fit for a human pathogen phenotype label applied to those input genomes.
tar, gzip. Files can be read using any text reader, and processed with bedtools compatible software.