Skip to main content

Predicting reservoir hosts based on early SARS-CoV-2 samples and analyzing later world-wide pandemic

Cite this dataset

Guo, Qian et al. (2020). Predicting reservoir hosts based on early SARS-CoV-2 samples and analyzing later world-wide pandemic [Dataset]. Dryad.


The SARS-CoV-2 pandemic has raised the concern for reservoir hosts of the virus since the early-stage outbreak. To address this problem, we proposed a deep learning method, DeepHoF, based on extracting the viral genomic features, to calculate the infection likelihoods and further predict the probable hosts of novel viruses. Overcoming the limitation of sequence similarity-based methods, DeepHoF was applied to the analysis of SARS-CoV-2 in the 2020 pandemic. Using the isolates sequenced in the earliest stage of COVID-19, DeepHoF identified minks, bats, dogs and cats can be highly susceptible to SARS-CoV-2, while minks might be one of the most noteworthy reservoir hosts. Several genes of SARS-CoV-2 demonstrated their significance in determining the infection likelihood on human or the host range. With a large-scale genome analysis based on DeepHoF’s computation for the later world-wide pandemic, it should not be slighted for the probably bidirectional transmission of SARS-CoV-2 between humans and minks.


Datasets construction for training and test

We downloaded 63,049 whole viral genomes from GenBank by 9 July, 2019, and tagged them with five host labels (plant, germ, invertebrate, non-human vertebrate and human), which were integrated from the host metadata provided by GenBank (Supplementary Table 3). For viruses infecting multiple host types, multiple labels were given. Following the data collection procedure, short fragments were generated randomly from those tagged whole genomes because of the computational cost in long sequence processing. The training set was constructed with short fragments from 55,283 genomes released before 1 January, 2018, and the test set was constructed with the rest (the Accession list and the host information of the genomes used for training and test are in Supplementary Data 6).

Mathematical representation of viral whole genomes

Due to the long-term adaptation to natural reservoirs, viruses share some evolutionary signatures in nucleotide sequences, such as codon pair, dinucleotide, codon, and amino acid biases, with their natural reservoirs17. Besides, viral proteins, especially the receptors that are effectively attached to the host cell membrane, are crucial factors for viruses to invade and infect the host cells39. In brief, the genome compositions of viruses can inform host-virus correlation.

Herein, we represent a given viral sequence with a base one-hot matrix (BOH) and a codon one-hot matrix (COH), digitizing the genetic information of the virus on nucleotide and codon level respectively. To start with, bases and codons are encoded with one-hot format to work with deep learning algorithms. In the coding of BOH, each consecutive base of a query sequence linked by its complementary strand is encoded by one-hot. For COH, we do not extract ORFs since coding sequences make up most of the viral genome. Instead, we directly concatenate the six phases of the input sequence (Supplementary Fig. 3), and then each consecutive codon of the joined sequences is encoded by one-hot. Consequently, for an input sequence of length L, it will be transformed to a BOH matrix, with the size of 2L×4, and a COH matrix, with the size of 2L×64.

BiPathCNN Model descriptions

In building the framework of DeepHoF, we firstly utilize a BiPathCNN40, containing two CNN paths, digging information from the BOH matrix and COH matrix respectively. The information is naturally corresponding to the viral genomic features for the viruses which infect the same kind of hosts. After independent convolution and pooling operations at the beginning, the two paths are combined by a concatenation layer. Following a normalization layer, five prediction scores will be provided by five sub-paths, corresponding to the probability of infecting plants, germs, invertebrates, non-human vertebrates and humans individually. The architecture of DeepHoF is shown in Supplementary Fig. 4 and the details of each layer in BiPathCNN are described in Supplementary Methods.

Implementation of DeepHoF

In the practical application for a viral whole genome sequence (or a partial genome sequence), a cut window moves along the long sequence without overlapping to separate it into suitable fragments for the pre-trained BiPathCNN model. DeepHoF calculates the final score by weighting and summing the predicted likelihoods of each fragment. For example, a 2,000 bp query sequence is separated into three consecutive fragments, corresponding to the first 800 bp, the middle 800 bp and the last 400 bp of the query sequence. Then DeepHoF predicts the three fragments independently and calculates the weighted average of the three predicted likelihood vectors with the weights of 800/2,000, 800/2,000, and 400/2,000 respectively. For each input sequence, DeepHoF outputs five scores corresponding to the probabilities of the virus to infect the five host types respectively. Besides, DeepHoF provides the p-values of each score, statistically measuring of how distinct the likelihoods are compared with those of non-infectious viruses22. For example, if an input virus has a probability of 0.4 to infect humans, we compare 0.4 with the scores of non-human-infecting viruses in our dataset and provide the p-values as a judgment basis. If the p-value is less than 0.05, we conclude the input virus can infect humans with a significantly different infection likelihood compared with non- human-infecting viruses.

As the infection likelihood profile of a virus, consisting of the five predicted scores given by DeepHoF, can be regarded as an infection-related feature vector extracted by DeepHoF, we utilize it to characterize the virus. It is logistical to regard the hosts that can be infected by viruses possessing similar profiles as the probable hosts of the given virus. To quantitatively compare infection likelihood profiles between viruses, we calculated the Euclidean distance between the profiles. In the case of SARS-CoV-2, we searched the detailed vertebrate host of the earliest detected isolates, which are closer to the most recent common ancestor of SARS-CoV-2. To start with, we added the host annotations provided by Virus-Host DB41 to the vertebrate-infecting viruses included in GenBank. Here, the average of infection likelihood profiles of 17 earliest sequenced isolates was used as the representation of SARS-CoV-2. We calculated the Euclidean distance between infection likelihood profile of SARS-CoV-2 and that of each vertebrate-infecting virus (discovered before the outbreak of SARS-CoV-2). We assumed the vertebrate infected by a virus possessing profile close to that of SARS-CoV-2 was the probable host of SARS-CoV-2.

Data filtering and trimming for SARS-CoV-2 genome sequences

There were 102,804 SARS-CoV-2 genomes released on GISAID EpiCoV Database as of 15th September 2020. We downloaded all the sequences and filtered them with the quality standard given by the Chinese Academy of Sciences34. Because the untranslated regions were not taken as seriously as the protein-coding regions and the lengths of sequenced UTR varied a lot in different SARS-CoV-2 genomes, we trimmed the 5′- and 3′-untranslated regions (UTR) according to the annotation of NC_045512 to get rid of noises. Thus, we finally got 53,759 clean sequences.

Usage notes

# DeepHoF: Predicting reservoir hosts based on early SARS-CoV-2 samples and analyzing later world-wide pandemic

* [Introduction](#introduction)
* [Version](#version)
* [Requirements](#requirements)
* [Installation](#installation)
* [Usage](#usage)
* [Output](#output)
* [Citation](#citation)
* [Contact](#contact)

## Introduction

DeepHoF (using deep learning to virus-host finder) is designed to predict the potential host types (plant, germ, invertebrate, vertebrate, human) of a given virus, which is represented by its nucleotide sequences. The tool will provide five scores and the corresponding p-values which reflect the propobilities of the virus infecting each host type. In addition, the infection likelihood profile the given virus is provided.

## Version
+ DeepHoF 1.0 (Tested on Ubuntu 16.04)

## Requirements
### To run the physical host version of DeepHoF, you need to install:
+ [Python 3.6.10](
+ [numpy 1.17.5](
+ [h5py 2.10.0](
+ [pandas 0.25.3](
+ [TensorFlow 1.4.0](
+ [Keras 2.1.3](
+ [MATLAB R2018a](

(1) DeepHoF should be run under Linux operating system.  
(2) For compatibility, we recommend installing the tools with the similar version as described above.  
(3) If GPU is available in your machine, we recommend installing a GPU version of the TensorFlow to speed up the program.  

## Installation

### 1. Prerequisites
  First, please install **numpy, h5py, pandas, TensorFlow** and **Keras** according to their manuals. All of these are python packages, which can be installed with ``pip``. If ``pip`` is not already installed in your machine, use the command ``sudo apt-get install python-pip python-dev`` to install ``pip``. Here are example commands of installing the above python packages using ``pip``.
    pip install numpy
    pip install h5py
    pip install pandas
    pip install tensorflow==1.4.0  #CPU version
    pip install tensorflow-gpu==1.4.0  #GPU version
    pip install keras==2.1.3

  Or you can use the command ``conda env create -p DeepHoF  -f DeepHoF_env.yaml`` to automatically install all the prerequisites of DeepHoF.
  If you are going to install a GPU version of the TensorFlow, specified NVIDIA software should be installed. See to know whether your machine can install TensorFlow with GPU support.  

  To run DeepHoF, please  see to install the MATLAB.  
### 2. Install DeepHoF using git
  Clone DeepHoF package
    git clone
  Change directory to DeepHoF:
    cd DeepHoF/DeepHoF
  All scripts are under the folder.

## Usage

### Input

  Nucleotide sequence
### Command

  Please execute the following command directly in MATLAB command window:
  For example, if you want to identify the sequences in "example.fna", please execute:
  Please remember to set the working path of MATLAB to DeepHoF folder before running the programme.
### Output

The output of DeepHoF consists of 11 columns:

Header | plant_score | germ_score | invertebrate_score | vertebrate_score | human_score | plant_pvalue | germ_pvalue | invertebrate_pvalue | vertebrate_pvalue | human_pvalue |
------ | ----------- | ---------- | ------------------ | ---------------- | ----------- | ------------ | ----------- | ------------------- | ----------------- | ------------ |

The content in `Header` column is the same with the header of corresponding sequence in the input file. With the input of viral nucleotide sequence, DeepHoF will output five scores for each host type, reflecting the infectivity within each host type respectively. Furthermore, DeepHoF provides five p-values, statistical measures of how distinct the infections are compared with non-infection events.

# Citation

# Note
DeepHoF is also available at our website and the Dryad git repository If you have some problems downloading DeepHoF from GitHub and if you want to use the big training and test datasets of DeepHoF, you can go to the alternatives. 

# Contact
Please direct your questions to us, or


Ministry of Science and Technology of the People's Republic of China, Award: 2017YFC1200205

National Natural Science Foundation of China, Award: 32070667

National Natural Science Foundation of China, Award: 31671366

High Performance Computing Platform of the Center for Life Science of Peking University