Recombination is an important driver of genetic diversity, though it is relatively uncommon in hepatitis C virus (HCV). Recent investigation of sequence data acquired from HCV clinical trials produced 21 full-genome recombinant viruses belonging to three putative inter-subtype forms 2b/1a, 2b/1b, and 2k/1b. The 2k/1b chimera is the only known HCV circulating recombinant form (CRF), provoking interest in its genetic structure and origin. Discovered in Russia in 1999, 2k/1b cases have since been detected throughout the former Soviet Union, Western Europe, and North America. Although 2k/1b prevalence is highest in the Caucasus mountain region (i.e., Armenia, Azerbaijan, and Georgia), the origin and migration patterns of CRF 2k/1b have remained obscure due to a paucity of available sequences. We assembled an alignment which spans the entire coding region of the HCV genome containing all available 2k/1b sequences (>500 nucleotides; n=109) sampled in 19 countries from public databases (102 individuals), additional newly sequenced genomic regions (from 48 of these 102 individuals), unpublished isolates with newly sequenced regions (5 additional individuals), and novel complete genomes (2 additional individuals) generated in this study. Analysis of this expanded dataset reconfirmed the monophyletic origin of 2k/1b with a recombination breakpoint at position 3,187 (95% confidence interval: 3,172–3,202; HCV GT1a reference strain H77). Phylogeography is a valuable tool used to reveal viral migration dynamics. Inference of the timed history of spread in a Bayesian framework identified Russia as the ancestral source of the CRF 2k/1b clade. Further, we found evidence for migration routes leading out of Russia to other former Soviet Republics or countries under the Soviet sphere of influence. These findings suggest an interplay between geopolitics and the historical spread of CRF 2k/1b.
2k/1b full coding region sequence alignment
Contains 109 sequences of Hepatitis C virus circulating recombinant form 2k/1b spanning the entire coding region. Aligned with MUSCLE v2.0 in AliView and edited manually.
2k1b_109_fullCodingRegion.fasta
2k/1b densely sampled coding region sequence alignment
Contains 109 sequences of Hepatitis C virus circulating recombinant form 2k/1b spanning the most densely sampled coding regions (i.e. at least 25 sequences in that region): Core/E1, NS2/NS3, NS4B, NS5A/NS5B, all concatenated. Aligned with MUSCLE v2.0 in AliView and edited manually.
2k1b_109_denselySampled_concatenatedCoreE1NS23NS4BNS5AB.fasta
Caucasus sensitivity analysis MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic Caucasus sensitivity analysis. The Caucasus mountain countries Armenia, Azerbaijan, and Georgia are combined into a single geographic location.
2k1b_FullCOO_Caucasus11.MCC.tree
2k/1b dense genome 1 GTRΓ4, 1 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from densely sampled genome analysis with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_denseGenome_1GTR1clock_COO_BSSVSMJ.MCC.tree
2k/1b dense genome 4 GTRΓ4, 1 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from densely sampled genome analysis with four partitioned GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
2k1b_denseGenome_4GTR1clock_COO_BSSVSMJ.MCC.tree
2k/1b dense genome 4 GTRΓ4, 4 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from densely sampled genome analysis with four partitioned GTRΓ4 substitution models and four uncorrelated relaxed lognormal clocks.
2k1b_denseGenome_4GTR4clocks_COO_BSSVSMJ.MCC.tree
2k/1b full genome 1 GTRΓ4, 1 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from full coding region analysis with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_fullGenome_1GTR1clock_COO_BSSVSMJ.MCC.tree
2k/1b full genome 9 GTRΓ4, 1 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from full coding region analysis with nine partitioned GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
2k1b_fullGenome_9GTR1clock_COO_BSSVSMJ.MCC.tree
2k/1b full genome 9 GTRΓ4, 9 clock country of origin MCC tree
Maximum clade credibility (MCC) tree from country of origin phylogeographic analysis. Performed on posterior distribution of trees from full coding region analysis with nine partitioned GTRΓ4 substitution models and nine uncorrelated relaxed lognormal clocks.
2k1b_fullGenome_9GTR9clocks_COO_BSSVSMJ.MCC.tree
Alignment containing HCV 1a full genomes and recombinant 2b/1a hemigenomes
All publicly available HCV 1a sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2b/1a hemigenomes, representing the 1a portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2b1a_1a_ext.fasta
Alignment containing HCV 2b full genomes and recombinant 2b/1a hemigenomes
All publicly available HCV 2b sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2b/1a hemigenomes, representing the 2b portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2b1a_2b_ext.fasta
Alignment containing HCV 1b full genomes and recombinant 2b/1b hemigenomes
All publicly available HCV 1b sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2b/1b hemigenomes, representing the 1b portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2b1b_1b_ext.fasta
Alignment containing HCV 2b full genomes and recombinant 2b/1b hemigenomes
All publicly available HCV 2b sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2b/1b hemigenomes, representing the 2b portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2b1b_2b_ext.fasta
Alignment containing HCV 1b full genomes and recombinant 2k/1b hemigenomes
All publicly available HCV 1b sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2k/1b hemigenomes, representing the 1b portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2k1b_1b_ext.fasta
Alignment containing HCV 2k full genomes and recombinant 2k/1b hemigenomes
All publicly available HCV 2k sequences were downloaded from GenBank and the Los Alamos National Laboratories (LANL). These were aligned with 2k/1b hemigenomes, representing the 2k portion of the coding region separated at the Recombination Detection Program (RDP)-inferred breakpoint, using the fast option in MAFFT version 7 in AliView and edited manually. Recombinant genomes were acquired from this study or these databases.
2k1b_2k_ext.fasta
2b/1a and 1a IQ-Tree
Midpoint-rooted maximum likelihood (ML) phylogeny produced in IQ-Tree from fasta file containing all publicly available HCV 1a sequences and recombinant 2b/1a hemigenomes. All non-recombinant clades comprising 4 or more taxa are collapsed. Bootstrap support available as node labels.
2b1a_1a_ext_IQTree_midpoint.tre
2b/1a, 2b/1b and 2b IQ-Tree
Midpoint-rooted maximum likelihood (ML) phylogeny produced in IQ-Tree from fasta file containing all publicly available HCV 2b sequences and recombinant 2b/1a and 2b/1b hemigenomes. All non-recombinant clades comprising 4 or more taxa are collapsed. Bootstrap support available as node labels.
2b1a-2b1b_2b_IQTree_midpoint.tre
2k/1b and 2k IQ-Tree
Midpoint-rooted maximum likelihood (ML) phylogeny produced in IQ-Tree from fasta file containing all publicly available HCV 2k sequences and recombinant 2k/1b hemigenomes. All non-recombinant clades comprising 4 or more taxa are collapsed. Bootstrap support available as node labels.
2k1b_2k_ext_IQTree_midpoint.tre
2k/1b, 2b/1b and 1b IQ-Tree
Midpoint-rooted maximum likelihood (ML) phylogeny produced in IQ-Tree from fasta file containing all publicly available HCV 1b sequences and recombinant 2k/1b and 2b/1b hemigenomes. All non-recombinant clades comprising 4 or more taxa are collapsed. Bootstrap support available as node labels.
2k1b-2b1b_1b_IQTree_midpoint.tre
2k/1b Caucasus sensitivity analysis country of origin XML
XML for BEAST analysis on 2k/1b country of origin containing Armenia, Azerbaijan, and Georgia as a single geographic location. Performed on posterior distribution of trees from full coding region analysis with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_FullCOO_Caucasus11.xml
2k/1b dense genome 1 GTRΓ4, 1 clock country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from densely sampled region analysis with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_denseGenome_1GTR1clock_COO_BSSVS_MarkovJumps.xml
2k/1b dense genome 4 GTRΓ4, 1 clock country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from densely sampled region analysis with four partitioned GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
2k1b_denseGenome_4GTR1clock_COO_BSSVS_MarkovJumps.xml
2k/1b dense genome 4 GTRΓ4, 4 clocks country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from densely sampled region analysis with four partitioned GTRΓ4 substitution models and four uncorrelated relaxed lognormal clocks.
2k1b_denseGenome_4GTR4clocks_COO_BSSVS_MarkovJumps.xml
2k/1b full genome 1 GTRΓ4, 1 clock country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from full coding region analysis with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_fullGenome_1GTR1clock_COO_BSSVS_MarkovJumps.xml
2k/1b full genome 9 GTRΓ4, 1 clock country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from full coding region analysis with nine partitioned GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
2k1b_fullGenome_9GTR1clock_COO_BSSVS_MarkovJumps.xml
2k/1b full genome 9 GTRΓ4, 9 clocks country of origin XML
XML for BEAST analysis on 2k/1b country of origin on posterior distribution of trees from full coding region analysis with nine partitioned GTRΓ4 substitution models and nine uncorrelated relaxed lognormal clocks.
2k1b_fullGenome_9GTR9clocks_COO_BSSVS_MarkovJumps.xml
2k/1b full coding region 9 GTRΓ4, 1 clock XML
XML for BEAST analysis on HCV 2k/1b full coding region with nine GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
2k1b_FullGenome_9GTR1clock.xml
2k/1b full coding region 9 GTRΓ4, 9 clock XML
XML for BEAST analysis on HCV 2k/1b full coding region with nine GTRΓ4 substitution models and nine uncorrelated relaxed lognormal clocks.
2k1b_FullGenome_9GTR9clocks.xml
2k/1b full coding region 1 GTRΓ4, 1 clock XML
XML for BEAST analysis on HCV 2k/1b full coding region with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
2k1b_Genome_skyride_109.xml
2k/1b densely sampled coding region 4 GTRΓ4, 1 clock XML
XML for BEAST analysis on HCV 2k/1b densely sampled coding region with four GTRΓ4 substitution models and a single uncorrelated relaxed lognormal clock.
Skyride_denselySampled_4GTR1clock.xml
2k/1b densely sampled coding region 4 GTRΓ4, 4 clock XML
XML for BEAST analysis on HCV 2k/1b densely sampled coding region with four GTRΓ4 substitution models and four uncorrelated relaxed lognormal clocks.
Skyride_denselySampled_4GTR4clocks.xml
2k/1b densely sampled coding region 1 GTRΓ4, 1 clock XML
XML for BEAST analysis on HCV 2k/1b densely sampled coding region with a single GTRΓ4 substitution model and a single uncorrelated relaxed lognormal clock.
Skyride_denselySampled_concatenatedCoreE1NS23NS4BNS5AB.xml