Supporting data and code for: Phylogenetic identification of influenza virus candidates for seasonal vaccines

Hayati, Maryam1 ; Sobkowiak, Benjamin1 ; Stockdale, Jessica1 ; Colijn, Caroline1

Published Jun 16, 2023; Updated Dec 18, 2023 on Dryad. https://doi.org/10.5061/dryad.x95x69pqh

Data files

Jun 16, 2023 version files 116.05 MB

flu_vaccine-main.zip

116.04 MB
README.md

9.87 KB

Aug 16, 2023 version files 112.93 MB

flu_vaccine-main.zip

112.92 MB
README.md

9.87 KB

Dec 18, 2023 version files 112.93 MB

Abstract

The seasonal influenza (flu) vaccine is designed to protect against those influenza viruses predicted to circulate during the upcoming flu season, but identifying which viruses are likely to circulate is challenging. We use features from phylogenetic trees reconstructed from hemagglutinin (HA) and neuraminidase (NA) sequences, together with a support vector machine, to predict future circulation. We obtain accuracies of 0.75–0.89 (Area under the curve AUC 0.83–0.91) over 2016–2020. We explore ways to select potential candidates for a seasonal vaccine and find that the machine learning model has a moderate ability to select strains that are close to future populations. However, consensus sequences among the most recent three years also do well at this task. We identify similar candidate strains to those proposed by the World Health Organization, suggesting that this approach can help inform vaccine strain selection.

README: Phylogenetic identification of influenza virus candidates for seasonal vaccines - supporting data and code

This repository contains the code, data and materials developed for 'Phylogenetic identification of influenza virus candidates for seasonal vaccines'. The seasonal influenza (flu) vaccine is designed to protect against those influenza viruses predicted to circulate during the upcoming flu season, but identifying which viruses are likely to circulate is challenging. We use features from phylogenetic trees reconstructed from hemagglutinin (HA) and neuraminidase (NA) sequences, together with a support vector machine, to predict future circulation. We obtain accuracies of 0.75-0.89 (Area under the curve AUC 0.83-0.91) over 2016-2020. We explore ways to select potential candidates for a seasonal vaccine and find that the machine learning model has a moderate ability to select strains that are close to future populations. However, consensus sequences among the most recent three years also do well at this task. We identify similar candidate strains to those proposed by the World Health Organization (WHO), suggesting that this approach can help inform vaccine strain selection.

We downloaded all HA and NA human H3N2 influenza sequences collected from 1980 to February 2020 from the Global Initiative on Sharing Avian Influenza Data (GISAID). Accession numbers and references to the GISAID submitting laboratories for the sequences used in this study are included in this repository. As per GISAID access terms, the sequences used in this study are not reproduced here but may be downloaded from the GISAID server.

This repository contains all code to compute the features, train and test the machine learning models, predict the next year's flu vaccine candidates, and generate the plots for the paper. Derived data e.g. all the influenza trees for the experiments in years 2016 to 2020 are included.

Description of the data and file structure

This repository contains all the code to compute the features, train and test the machine learning models, predict the next year's flu vaccine candidates, and generate the plots for the paper. These are included in the 'code' folder.

'Code' folder

"AUC_subtree.R" code to make the support vector machine (SVM) models for each year.
"AllClassifiers.R" code to compare different binary classifiers on each year datasets.
"CorrelationPlot.R" code to generate correlation plot
"Data_5_1_3.R" code to compute the features.
"Downsample_2020.R" code to perform downsampling experiment
"Feature_selection.R" code to perform feature selection in the downsampling experiment
"general_functions.R" general support functions, called within other scripts
"GWAS.R" code to perform genome wide association study
"LBI_cor.R" code to calculate the correlation between local branching index (LBI) and other tree shape statistics.
"LBI_cor_tip.R" code to calculate the correlation between LBI and diversificationRate statistic.
"msvmRFE.R" calls function to run 'Support Vector Machine Recursive Feature Elimination (mSVM-RFE) feature ranking algorithm' https://github.com/johncolby/SVM-RFE
"paper_plot.R" code to generate violin plot of epitope distances (vaccine candidates vs realized)
"SVM_featurs.R" code to run the various SVM experiments
"Tree_Statistics.R" code to compute tree shape statistics.
"VaccineDist.R" code to compute the distance between our suggested candidates and the next year's circulating sequences.

Sharing/Access information

Data was derived from the following sources:

Global Initiative on Sharing Avian Influenza Data (GISAID) https://gisaid.org/

As per GISAID access terms, the sequences used in this study are not reproduced in this repository, but may be downloaded from the GISAID server. Folder 'gisaid_acknowledge_tables.zip' contains accession numbers of all sequences used, and references to their submitting laboratories whom we gratefully thank for providing the data for this study.

Data

All data files are in CSV format. All code was written in R (open-source), and influenza trees are included in RDATA files to be read into R.
Accession numbers and references to the GISAID submitting laboratories for the sequences used in this study are included in a zip folder.
To recreate the analysis in full, these accession numbers may be used to download the influenza sequences directly from GISAID. All required R packages are loaded within the scripts - a user who has not previously used these will need to first install the listed packages from CRAN. Code is provided to install any packages not on CRAN.

The numbered 2016-2020 folders contain all the influenza trees and other data for the experiments in years 2016 to 2020 respectively.
The 'General_data' folder contains general data that is not specific to any particular year.

2016-2020 data folders

Aux_data. Auxiliary data listing GISAID sequence IDs and linked collection dates. Variables: TipLabels_Data (GISAID sequence labels) and Date (date of collection)
flutree. RDATA file containing phylo object of influenza tree, tip labels matching Aux_data
mycurrentdata, myfutdata. Metric scores (columns B:CB, see below) per tip (column CC, labels match above)

Metrics used are as defined in full in the manuscript, see references there for full description. We provide a summary here. Base names describe HA results, NA results are indivated with '_NA'.

Betweenness centrality: Max. number of shortest paths through nodes
Weighted betweenness: As above, but with weighted edges
Closeness centrality: Max. total distance to all other nodes
Weighted closeness: As above, but with weighted edges
Eigenvector centrality: Max. value in Perron-Frobenius vector
Weighted eigenvector: As above, but with weighted edges
Diameter: Largest distance between 2 nodes
WienerIndex: Sum of all distances between 2 nodes
Mean tips pairwise distance: Average distance between 2 tips
Max. tips pairwise distance: Max. distance between 2 tips (with branch lengths)\
Min. adjacency: Min. adjacency matrix eigenvalue >0
Max. adjacency: Max. adjacency matrix eigenvalue
Min. Laplacian: Min. Laplacian matrix eigenvalue >0
Max. Laplacian: Max. Laplacian matrix eigenvalue
Cherry number: Number of nodes with 2 tip children
Normalized Pitchforks: 3x(Number of nodes with 3 tip descendants)/n
Normalized Colless imbalance: 1/n^{3/2}*sum{i in I{r}} |r_i-s_i|
Normalized Sackin imbalance: 1/n^{3/2}*sum_{i in L} N_i
Normalized Maximum height: The maximum height of tips in the tree/(n-1)
Maximum width: Max. number of nodes at the same depth
Stairs1: The portion of imbalanced subtrees
Stairs2: The average of min(s_i,r_i)/max(s_i,r_i) over all internal nodes
Max. difference in widths: max_i (n_{i+1}-n_i)
Variance: The variance of internal node depth
I2: sum{i in I{r}, r_i + s_i > 2} |r_i-s_i|/|r_i+s_i-2|
B1: sum{i in I} M_i^{-1}
B2: sum{i in L} N_i/2^{N_i}
Normalized Average ladder: The mean size of ladders in the tree/(n-2)
Normalized ILnumber: Number of internal nodes with a single tip child/(n-2)
Branching speed: The ratio of the number of tips to the height of the tree
Branching next index: Mean of indicator: does the next branching event descend from this node
Generalized branching next: Number of next two branching events descending from this node
Skewness: The skewness of the internal branch lengths (describes the degree of asymmetry of a distribution)
Kurtosis: The kurtosis of the internal branch lengths (describes the degree of asymmetry of a distribution)
Diversification rate: The reciprocal sum of the branches from a tip to the root of the tree

From manuscript table 1:
Here r_i and s_i are the number of tips of the left and right subtrees of an internal node i. n is the number of tips of a subtree. n_i is the number of nodes at depth i, M_i represents the height of the subtree rooted at an internal node i, and N_i is equal to the depth of node i. A ladder in a tree is a set of consecutive nodes with one tip child. We represent the set of all internal nodes of a tree as I, the set of all tips (or external nodes) as L, and the root of a subtree as r. In generalized branching next we chose m = 2. The tree shape statistics induced by betweenness centrality, closeness centrality and eigenvector centrality are defined as the maximum values of each centrality over all the nodes of a tree, and distances are in units of number of edges (without branch lengths).

'General_data' folder

Countrycode_tips. Appends Aux_data with additional metadata. Variables: TipLabels_Data (GISAID sequence labels), Date (date of collection), places (country or city of sample), country (country of sample) and region (geographical region of sample)
epitopeDis. Derived epitope distances associated with each tip. Variables: tip (tip number from R analysis), dist (epitope distance), and tip_name (GISAID sequence labels)
pop_proportions. Proportion of the world population by region. Variables: Region (global region), Proportion (proportion of world population residing in that region)
vaccine_subtree. Distances between predictions and circulating viruses. Variables: year (year of the experiment), min (minimum distance of the successful sequences to the sequences circulating in the following year), max (maximum distance " "), median (median distance " "), who (the distance of the vaccine candidates introduced by WHO to the next year circulating sequences)
var_data. Distances between predictions and circulating viruses in different years and differents models. Variables: name (year of experiment), model (the model that we used to choose the successful sequences. Either SVM or Random), value (the distance of the suggested candidate to the next year circulating sequences)