Skip to main content
Dryad logo

Data from: Construction of a virtual Mycobacterium tuberculosis consensus genome and its application to data from a next generation sequencer

Citation

Okumura, Kayo; Kirikae, Teruo; Kayano, Mitsunori; Miyoshi-Akiyama, Tohru (2015), Data from: Construction of a virtual Mycobacterium tuberculosis consensus genome and its application to data from a next generation sequencer, Dryad, Dataset, https://doi.org/10.5061/dryad.nq070

Abstract

Background: Although Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results. Those analyses are essentially based genome sequence information of M. tuberculosis and could be performed in sillico in theory, with whole genome sequence (WGS) data available in the databases and obtained by next generation sequencers (NGSs). As an approach to establish higher resolution methods for such analyses, whole genome sequences of the M. tuberculosis complexes (MTBCs) strains available on databases were aligned to construct virtual reference genome sequences called the consensus sequence (CS), and evaluated its feasibility in in sillico epidemiological analyses. Results: The consensus sequence (CS) was successfully constructed and utilized to perform phylogenetic analysis, evaluation of read mapping efficacy, which is crucial for detecting single nucleotide polymorphisms (SNPs), and various MTBC typing methods virtually including spoligotyping, VNTR, Long sequence polymorphism and Beijing typing. SNPs detected based on CS, in comparison with H37Rv, were utilized in concatemer-based phylogenetic analysis to determine their reliability relative to a phylogenetic tree based on whole genome alignment as the gold standard. Statistical comparison of phylogenic trees based on CS with that of H37Rv indicated the former showed always better results that that of later. SNP detection and concatenation with CS was advantageous because the frequency of crucial SNPs distinguishing among strain lineages was higher than those of H37Rv. The number of SNPs detected was lower with the consensus than with the H37Rv sequence, resulting in a significant reduction in computational time. Performance of each virtual typing was satisfactory and accorded with those published when those are available. Conclusions: These results indicated that virtual CS constructed from genome sequence data is an ideal approach as a reference for MTBC studies.

Usage Notes