Motivation: 16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads. Results: We created a workflow known as Illinois Mayo Taxon Organization from RNA Dataset Operations (IM-TORNADO) for processing non-overlapping reads while retaining maximal information content. Using synthetic mock datasets, we show that the use of both reads produced answers with greater correlation to those from full length 16S rDNA when looking at taxonomy, phylogeny, and beta-diversity. Availability and Implementation: IM-TORNADO is freely available at http://sourceforge.net/projects/imtornado and produces BIOM format output for cross compatibility with other pipelines such as QIIME, mothur, and phyloseq.
Files for taxonomy comparison
Synthetic mock dataset used for taxonomy assignment comparison, across multiple read lengths. File includes scripts used and raw validation data.
comparison_taxonomy.tar.bz2
Files for comparison of sequence aligners
Synthetic mock datasets used to compare the mulitple sequence aligners cmalign (from the infernal package) versus the NAST algorithm (from the PyNAST implementation). File includes R scripts and resulting raw validation data.
comparison_aligners.tar.bz2
Files for validation with realistic synthetic reads
Synthetic mock datasets used for validation based on a realistic human stool microbiome dataset. These are NOT real bacterial reads. Those reads are available as example data from the IM-TORNADO pipeline. File includes scripts and raw validation data.
validation_realistic.tar.bz2
Files for taxonomy comparison using reads with errors
Synthetic mock datasets (100 replicates) used for validation of taxonomy using reads with errors and different length. File also includes scripts and resulting raw data for the validation.
validation_taxonomy_errors.tar.bz2
Files for validation of beta diversity using 16S rDNA regions V3 to V5
Synthetic mock datasets (100 replicates) used for validation of beta diversity across synthetic communities using 16S rDNA region V3 to V5. File also includes scripts and resulting raw data for the validation.
validation_V3V5.tar.bz2
Files for validation of beta diversity using 16S rDNA regions V6 to V9
Synthetic mock datasets (100 replicates) used for validation of beta diversity across synthetic communities using 16S rDNA region V6 to V9. File also includes scripts and resulting raw data for the validation.
validation_V6V9.tar.bz2