Raw motif mapping bedfile data and model training set class probabilities

Published Jun 30, 2023 on Dryad. https://doi.org/10.5061/dryad.tdz08kq3w

Data files

Jun 30, 2023 version files 10.96 GB

plos_comp_raw_data.tar.gz

10.96 GB
README.md

911 B

Abstract

Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (SADS-CoV) genome sequences as non-human pathogens. Feature selection using L1 regularization identified several degenerate nucleotide predictor motifs with high model coefficients for the human pathogen class that were present across widely disparate clades of coronaviruses. However, these motifs differed in which genes they were present in, what specific codons were used to encode them, and what the translated amino acid motif was. This emphasizes the importance of a phenetic view of emerging pathogenic RNA viruses, as opposed to the canonical phylogenetic interpretations most commonly used to track and manage viral zoonoses. Applying our model to more recent Orthocoronavirinae genomes deposited since October 2018 yields a novel contextual view of pathogen potential across bat-related, canine-related, porcine-related, and rodent-related coronaviruses and critical adaptations which may have contributed to the emergence of the pandemic SARS-CoV-2 virus. Finally, we discuss the next steps to achieve robust predictive ensembles and the utility of these models (and their associated predictor motifs) to novel biosurveillance protocols that substantially increase the ‘pound-for-pound’ information content of field-collected sequencing data and make a strong argument for the necessity of routine collection and sequencing of zoonotic viruses.

Raw motif mapping bedfile data and model training set class probabilities

Data files

Abstract

Methods

Usage notes