Skip to main content
Dryad

Information and sequences of the six lineages identified in the H1N1 influenza A virus

Cite this dataset

Cheng, Chaoyuan; Zhang, Zhibin (2021). Information and sequences of the six lineages identified in the H1N1 influenza A virus [Dataset]. Dryad. https://doi.org/10.5061/dryad.r2280gbcw

Abstract

The influenza virus mutates and spreads rapidly, making it an ideal model for studying evolutionary and ecological processes. The ecological factors and processes by which different lineages compete or coexist within hosts, through time and across geographical space are poorly known. We hypothesize that competition would be higher for influenza viruses sharing the same host than those sharing different hosts (Host Barrier Hypothesis), or for influenza viruses with a higher cross-region transmission intensity than those with a lower cross-region transmission intensity (Geographic Barrier Hypothesis). Using available sequences of influenza A (H1N1) virus in GenBank we identified six lineages of H1N1 and twelve clades with several replacement events. We found the human-hosted lineages had a higher cross-region transmission intensity than swine-hosted lineages. The estimated co-occurrence probability of lineages sharing the same host is much lower than those sharing different hosts, and human-hosted lineages had lower co-occurrence probability and genetic diversity than swine-hosted lineages. Our results indicate that H1N1 lineages sharing the same host or having a higher cross-region transmission intensity experienced a higher competition and extinction pressure. Our study highlights the significant roles of the host and geographic barriers in shaping the competition, extinction and coexistence patterns of H1N1 lineages or clades.

Methods

We collected H1N1 IAV sequence data from GenBank (https://www.ncbi.nlm.nih.gov/) and extracted the sequences encoding Hemagglutinin (HA). The datasets included 32,759 H1N1 virus records from the first case reported on May 11, 1918, to the latest case reported on October 30, 2018 and included the sampling location and date information for each report. We assigned the sampling location with the latitude and longitude based on the administrative center coordinates.

We excluded samples without accurate sampling dates (i.e. daily resolution) or strain codes, as well as those with sequence < 1600 bp (making up 29.8% of the total sequences) in length. To reduce the impacts of sampling effort bias on the research, we only used one sample from the same place in the same month (25.9% of total sequences were removed). Finally, 6097 samples of H1N1 virus from 1279 locations were used (Fig. S1). For subsequent analysis at the amino acid codon level, all sequences were aligned using an H1N1 sequence (A/swine/Hong Kong/61/1977) as a template sequence and all GAPs and redundant bases were manually deleted according to the template sequence so that the sequence could be converted to codons.

We divided the cleaned sequences into three host types: human, swine and avian based on the sample’s host type. We calculated the genetic distance (GD) between each sequence and the oldest sequence (A/swine/Hong Kong/61/1977) in all cleaned sequences. Then we plotted GD vs. sampling date separately for each host type. Based on the molecular clock theory, the mutation rate of H1 can be considered constant, and six lineages were readily identified using linear relationships between GD and sampling time (see Fig. 1A). According to the similarities and differences when influenza viruses infect their hosts, influenza virus hosts are generally divided into three categories: humans, pigs and birds,and most of the spread of influenza A viruses occurs between the same host categories. Thus, we use the following criteria to identify the lineage of the samples: (1) infect the same host categories; (2) the samples form close and continuous cluster on the two-dimensional space formed by GD and sampling time.

Usage notes

The explanation of each variable in the dataset is as follows:

id: the virus name

lat: latitude

lon: longitude

time: sampling time of the virus

seqs: aligned HA-gene sequences of the virus

host: host type

year: sampling year