Genetic diversity and spread dynamics of SARS-CoV-2 variants present in African populations
Data files
May 31, 2024 version files 164.52 MB
Abstract
The dynamics of coronavirus disease-19 (COVID-19) have been extensively researched in many settings around the world, but little is known about these patterns in Africa. 7540 complete nucleotide genomes from 51 African nations were obtained and analysed from the National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing Influenza Data (GISAID) databases to examine genetic diversity and spread dynamics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) lineages circulating in Africa. Utilising a variety of clade and lineage nomenclature schemes, we looked at their diversity, and used maximum parsimony inference methods to recreate their evolutionary divergence and history. According to this study, only 465 of the 2610 Pango lineages found to have existed in the world circulated in Africa after three years of the COVID-19 pandemic outbreak, with five different lineages dominating at various points during the outbreak. We identified South Africa, Kenya, and Nigeria as key sources of viral transmissions between Sub-Saharan African nations. These findings provide insight into the viral strains that are circulating in Africa and their evolutionary patterns.
README: Genetic diversity and spread dynamics of SARS-CoV-2 variants present in African populations
https://doi.org/10.5061/dryad.1c59zw42d
Description of the data
A.23.1 – folder with information of Variant A (A.23.1) lineage
- _out.220209031522640G0LQd4KgcYqlpNnB3CSHJlsfnormal.txt – all A.23.1 sequences used in this study
- A.23.1dates.csv – information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by a comma
- A.23.1dates.xlsx – information on sequence ID, where they were collected (State) and when they were collected (Date) in xlsx file format
- A.23.1dates2.txt - information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by tab
- DedupA.23.1.fasta – a fasta file with A.23.1 sequences cleaned of duplicates
- DedupA.23.1.fasta.bionj – contains neighbor-joining algorithm of saitou and nei
- DedupA.23.1.fasta.ckp.gz – is a checkpoint file of the entire run gzip compressed to save space
- DedupA.23.1.fasta.iqtree – shows the main report file that is self-readable with computational results. It also contains a textual representation of the final tree
- DedupA.23.1.fasta.log – shows the log file of the entire run
- DedupA.23.1.fasta.mldist – shows pairwise distance matrix of final tree given by IQTREE
- DedupA.23.1.fasta.treefile – is the ML tree in NEWICK format, which can be visualized by any supported tree viewer programs like FigTree iTOL
B.1.1.529 - folder with information of Omicron (B.1.1.529) lineage
- B.1.1.529dates.csv – information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by a comma
- B.1.1.529dates.txt - information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by tab
- B.1.1.529dates.xlsx – information on sequence ID, where they were collected (State) and when they were collected (Date) in xlsx file format
- DedupB.1.1.529.fasta – a fasta file with A.23.1 sequences cleaned of duplicates
- DedupB.1.1.529.fasta.bionj – contains neighbor-joining algorithm of saitou and nei
- DedupB.1.1.529.fasta.ckp.gz – is a checkpoint file of the entire run gzip compressed to save space
- DedupB.1.1.529.fasta.iqtree – shows the main report file that is self-readable with computational results. It also contains a textual representation of the final tree
- DedupB.1.1.529.fasta.log – shows the log file of the entire run
- DedupB.1.1.529.fasta.mldist – shows pairwise distance matrix of final tree given by IQTREE
- DedupB.1.1.529.fasta.treefile – is the ML tree in NEWICK format, which can be visualized by any supported tree viewer programs like FigTree iTOL
B.1.351 - folder with information of Beta (B.1.351) lineage
- B.1.351dates.csv – information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by a comma
- B.1.351dates.txt - information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by tab
- B.1.351dates.xlsx – information on sequence ID, where they were collected (State) and when they were collected (Date) in xlsx file format
- DedupB.1.351.fasta – a fasta file with A.23.1 sequences cleaned of duplicates
- DedupB.1.351.fasta.bionj – contains neighbor-joining algorithm of saitou and nei
- DedupB.1.351.fasta.ckp.gz – is a checkpoint file of the entire run gzip compressed to save space
- DedupB.1.351.fasta.iqtree – shows the main report file that is self-readable with computational results. It also contains a textual representation of the final tree
- DedupB.1.351.fasta.log – shows the log file of the entire run
- DedupB.1.351.fasta.mldist – shows pairwise distance matrix of final tree given by IQTREE
- DedupB.1.351.fasta.treefile – is the ML tree in NEWICK format, which can be visualized by any supported tree viewer programs like FigTree iTOL
B.1.525 - folder with information of Eta (B.1.525) lineage
- B.1.525dates.csv – information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by a comma
- B.1.525dates.txt - information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by tab
- B.1.525dates.xlsx – information on sequence ID, where they were collected (State) and when they were collected (Date) in xlsx file format
- DedupB.1.525.fasta – a fasta file with A.23.1 sequences cleaned of duplicates
- DedupB.1.525.fasta.bionj – contains neighbor-joining algorithm of saitou and nei
- DedupB.1.525.fasta.ckp.gz – is a checkpoint file of the entire run gzip compressed to save space
- DedupB.1.525.fasta.iqtree – shows the main report file that is self-readable with computational results. It also contains a textual representation of the final tree
- DedupB.1.525.fasta.log – shows the log file of the entire run
- DedupB.1.525.fasta.mldist – shows pairwise distance matrix of final tree given by IQTREE
- DedupB.1.525.fasta.treefile – is the ML tree in NEWICK format, which can be visualized by any supported tree viewer programs like FigTree iTOL
B.1.640 - folder with information of IHU Variant (B.1.640) lineage
- B.1.640dates.csv – information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by a comma
- B.1.640dates.txt - information on sequence ID, where they were collected (State) and when they were collected (Date) in csv file format separated by tab
- B.1.640dates.xlsx – information on sequence ID, where they were collected (State) and when they were collected (Date) in xlsx file format
- DedupB.1.640.fasta – a fasta file with A.23.1 sequences cleaned of duplicates
- DedupB.1.525.fasta.bionj – contains neighbor-joining algorithm of saitou and nei
- DedupB.1.640.fasta.ckp.gz – is a checkpoint file of the entire run gzip compressed to save space
- DedupB.1.640.fasta.iqtree – shows the main report file that is self-readable with computational results. It also contains a textual representation of the final tree
- DedupB.1.640.fasta.log – shows the log file of the entire run
- DedupB.1.640.fasta.mldist – shows pairwise distance matrix of final tree given by IQTREE
- DedupB.1.640.fasta.treefile – is the ML tree in NEWICK format, which can be visualized by any supported tree viewer programs like FigTree iTOL
Variants – contains PASTML information on all variants
- PASTMLallDates.csv – information on Accession, Release Date, Pangolin, Length, Geo Location, and Collection Date of all sequences used in this study
- Sequences_gisaid.fasta – all sequences from GISAID database used in this study
- Sequences_ncbi.fasta – all sequences fron NCBI database used in this study
- Sequence_ref.fasta – reference sequence from NCBI database
Pastml_A.23.1.zip – all information generated with PASTML
- Colours.character_states.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-state.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-state.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_state.model_F81.tab – contains the tree in txt format
- named.tree_dedupA.23.1.fasta_I5UVQqr.nexus – contains the tree in nexus format
- named.tree_dedupA.23.1.fasta_I5UVQqr.nwk – contains the tree in nwk format
- params.character_state.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
- pastml_full_tree_visualisation.html – contains the visualization of the full tree
pastml_all.zip
- Colours.character_Geo_location.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-Geo_location.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-Geo_location.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_Geo_location.model_F81.tab – contains the tree in txt format
- named.tree_newick_format_tree_O932IMJ.nexus – contains the tree in nexus format
- named.tree_newick_format_tree_O932IMJ.nwk – contains the tree in nwk format
- params.character_Geo_location.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
pastml_B.1.1.529.zip
- Colours.character_states.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-state.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-state.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_state.model_F81.tab – contains the tree in txt format
- named.tree_dedupB.1.1.529.fasta_ps1F5W0.nexus – contains the tree in nexus format
- named.tree_dedupB.1.1.529.fasta_ps1F5W0.nwk – contains the tree in nwk format
- params.character_state.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
- pastml_full_tree_visualisation.html – contains the visualization of the full tree
pastml_B.1.640.zip
- Colours.character_states.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-state.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-state.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_state.model_F81.tab – contains the tree in txt format
- named.tree_dedupB.1.640mafft.fasta_xuZ6Isj.nexus – contains the tree in nexus format
- named.tree_dedupB.1.640mafft.fasta_xuZ6Isj.nwk – contains the tree in nwk format
- params.character_state.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
- pastml_full_tree_visualisation.html – contains the visualization of the full tree
pastml_B.1351.2.zip
- Colours.character_states.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-state.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-state.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_state.model_F81.tab – contains the tree in txt format
- named.tree_dedupB.1.351mafft.fasta_FaWRLNq.nexus – contains the tree in nexus format
- named.tree_dedupB.1.351mafft.fasta_FaWRLNq.nwk – contains the tree in nwk format
- params.character_state.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
- pastml_full_tree_visualisation.html – contains the visualization of the full tree
pastml_B.1525.zip
- Colours.character_Geo_location.tab – it shows states and the corresponding colour
- Combined_ancestral_states.tab it shows nodes and corresponding states
- iTOL_colorstrip-Geo_location.txt – it shows node ids with their corresponding color, labels, shapes and title of state
- iTOL_popup_info.txt – shows ancestral character reconstruction (ACR) results
- iTOL_style-Geo_location.txt – shows node id with corresponding type, node, color size factor and label
- iTOL_tree_id.txt – contains the tree id number
- iTOL_url.txt – contain the url address where you can view the tree online
- marginal_probabilities.character_Geo_location.model_F81.tab – contains the tree in txt format
- named.tree_mafftVOCs.fasta.nexus – contains the tree in nexus format
- named.tree_mafftVOCs.fasta.nwk – contains the tree in nwk format
- params.character_Geo_location.method_MPPA.model_F81.tab – contains parameters of the tree
- pastml_compressed_visualisation.html – contains the visualization of the compressed tree
- pastml_full_tree_visualisation.html – contains the visualization of the full tree
Sharing/Access information
Data was derived from the following sources:
Code/Software
GOALIGN DEDUP (to remove duplicates)
goalign dedup -i 'name of fasta file' > 'name of output fasta file without duplicates'
IQ-TREE v1.6.12
iqtree -s NameOfFastaFile -m GTR+G
Methods
Dataset mining and workflow
SARS-CoV-2 genome sequences collected from Africa were obtained from NCBI database and GISAID database on February 26, 2023. 24415 African sequences were retrieved from both databases so as to examine the number of lineages circulating within Africa. The two databases had only 8044 complete genome sequences combined from Africa, and these sequences excluding those with low coverage using NextClade were retrieved to determine spread dynamics. 5908 sequences from 23 African countries were available in the NCBI and 2137 sequences from 41 African countries from GISAID database. The sequences were aligned using the online version of the MAFFT multiple sequence alignment tool, with the Wuhan-Hu-1 (MN 908947.3) as the reference sequence, and sequences with more than 5.0% ambiguous letters were removed. Duplicates were removed using goalign dedup software and only high quality African complete sequences remained (n=7540).
Phylogenetic reconstruction
Using IQ-TREE multicore software version v1.6.12 and NextClade, phylogeny reconstruction on the dataset was performed numerous times.
Lineage classification
PANGOLin, a web application was used to classify sequences into their lineages. The objective was to determine the SARS-CoV-2 lineages that are circulating in Africa that are most important from an epidemiological perspective, as well as the lineage dynamics within and across the African continent, due to the fact that this naming system integrates genetic and geographic data concerning SARS-CoV-2 dynamics.
Phylogeographic reconstruction
VOC, (VOI) and VUM were designated based on the WHO framework as of 20 January 2022. We included one lineage, namely A.23.1 and labelled it as VOI for the purposes of this analysis. This lineage was included because it demonstrated the continued evolution of African lineages into potentially more transmissible variants. VOI, VOC, and VUM that emerged on the African continent were marked. These were A.23.1 (VOI), B.1.351 and B.1.1.529 (VOC), B.1.640, and B.1.525 (VUM). Genome sequences of these five lineages were extracted from NCBI database for phylogeographic reconstruction. A similar approach to that described above (including alignment using online MAFFT) was employed. Phylogeographic reconstruction for all variants circulating in Africa and all VOI, VOC, and VUM was conducted using PASTML.