Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies
Data files
Dec 07, 2023 version files 4.67 GB
-
2021-04_UW_M027_config.yaml
-
2021-04_UW_M027.fastq.gz
-
consensus_M004.tar.gz
-
consensus_M005.tar.gz
-
consensus_M027.tar.gz
-
consensus_M1567.tar.gz
-
consensus_M2199.tar.gz
-
M027_chunked_demux_config.yaml
-
M027_demux.tar.gz
-
README.md
-
tagged_M004.tar.gz
-
tagged_M005.tar.gz
-
tagged_M027.tar.gz
-
tagged_M1567.tar.gz
-
tagged_M2199.tar.gz
Abstract
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
README
DW 6Dec2023
Optimized SMRT-UMI protocol produces highly accurate sequence datasets
from diverse populations – application to HIV-1 quasispecies
These files support the analysis of dUMI samples described in Westfall et al.
"Optimized SMRT-UMI protocol produces highly accurate sequence
datasets from diverse populations – application to HIV-1
quasispecies". This research identified optimized conditions to
prevent error during PCR and the supporting software PORPIDpipeline
for sequence generation and filtering.
These files used as input for different pipelines and analyses.
Supplement Figure S4 in the manuscript shows the entire analysis
workflow and indicates which files are used as input for each step.
Included here is one dataset which can be run through the two pipelines (M027)
as well as the outputs from the sUMI_dUMI_comparison pipeline from the other four
datasets:
2021-04_UW_M027.fastq.gz
-CCS fastq.gz file from PacBio sequencing from M027 dataset
-Used as input to chunked_demux pipeline, which demultiplexes based on index PCR primer sequences
-(https://github.com/MullinsLab/chunked_demux.git)
M027_chunked_demux_config.yaml
-Config file used to run chunked_demux pipeline for dataset M027
M027_demux.tar.gz
-Compressed fastq collection of demultiplexed reads output from chunked_demux pipeline from dataset M027
-Fastq files used as input to sUMI_dUMI_comparison pipeline
-(https://github.com/MullinsLab/sUMI_dUMI_comparison.git)
2021-04_UW_M027_config.yaml
-Config file used to run sUMI_dUMI_comparison pipeline for demultiplexed fastq files from dataset M027
consensus.tar.gz files for each dataset (M027, M1567, M2199, M004, M005)
-Compressed fasta consensus sequence files output from sUMI_dUMI_comparison pipeline
-Used as input for R script Sequence_Analysis.Rmd
tagged.tar.gz files for each dataset (M027, M1567, M2199, M004, M005)
-Compressed read collections and data tables created by sUMI_dUMI_comparison pipeline
-Some read collections used to generate alignments using Geneious software
-If the samples were indexed, at the top level are folders containing the samples labeled
with each Index primer combination. Otherwise each sample is listed.
-These correspond to the read collections output from chunked_demux pipeline.
-Within each sample's folders are two csv files and four additional folders
-{sample}_family_tags.csv
-table summarizing each sUMI family, the family size, and what tag it was given
-only families with likely_real
tag are carried forward for dUMI analysis
-{sample}_dUMI_ranked.csv
-table summarizing the UMI1 and UMI2 sequences found in each likely_real
sUMI family and their prevalence
-Used as input for R script Identifying_Recombinant_Reads.Rmd
-UMI1 folder
-Read collections from all UMI1 families found in the sample
-Summarized in {sample}_family_tags.csv
-UMI1_keeping folder
-Read collections from all likely_real
UMI families. These are deemed real and carried forward
-UMI2 folder
-Read collections from all UMI2 families found in the likely_real
UMI1 families
-dUMI folder
-Read collections from each dUMI family
-Summarized in {sample}_dUMI_ranked.csv
Both pipelines hosted on github and all R code and supplemental info described below, uploaded to Zenodo
Three R workbooks and one python script which were used to analyse the outputs from each
dataset from the sUMI_dUMI_comparison pipeline (https://github.com/MullinsLab/sUMI_dUMI_comparison.git).
The output files tagged.tar.gz and consensus.tar.gz for each dataset are stored in the Dryad repository.
Included here:
Sequence_Analysis.Rmd
-Workbook documenting the R code used to combine the sequences from each dataset and create
specific sequence collections and data tables summarizing sequencing results
from sUMI and dUMI consensus sequences
-Uses the output from sUMI_dUMI_comparison pipeline (consensus.tar.gz) for each dataset as input
-Outputs:
-sequence_counts.csv
-read_counts.csv
-pass07filter.fasta
-purified.fasta
-purified_pass07filter.fasta
-less100templates.fasta
-less100templates_pass07filter.fasta
-less100templates_purified.fasta
-less100templates_purified_pass07filter.fasta
-dUMI_rank1.fasta
-sUMI_rank1.fasta
compare_seqs.py
-Python script written which compares two sequences sets and notes any sequence differences between
sequences with same UMI. Used to compare sUMI and dUMI methods.
-Uses dUMI_rank1.fasta and sUMI_rank1.fasta from Sequence_Analysis.Rmd as input.
-This program compares sequences derived from sUMI and dUMI datasets. To run it
you need to install BLAST+ executables from NCBI
(https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and set
your computer's PATH system variable to the location of the downloaded
BLAST+ executables.
-Output:
-discordant_sequences.csv
Identifying_Recombinant_Reads.Rmd
-Workbook documenting the R code used to combine the csv files from each dataset to
create a single output containing all UMI combinations and calculating Levenstein distances for each
-Uses the output from sUMI_dUMI_comparison pipeline (tagged.tar.gz) for each dataset and the data tables
from Sequence_Analysis.Rmd as input
-Output:
-dUMI_df.csv
Figures.Rmd
-Workbook documenting the R code used to create figure drafts and individual csv datasets
for each figure. These were loaded to Prism software to generate the final figures.
-Uses dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input
-Output:
-Fig2A - family sizes.csv
-Fig2B - levenstein distance read counts.csv
-Fig2C - recombinant reads by sample prep.csv
-Fig2D - families below 30 percent recombinant.csv
-Fig2E - families passing 0.7 filter.csv
-Fig2F - reads passing all filters.csv
-plot_drafts folder with R plots as drafts for figures
Three supplemental files were created as part of the analysis
Sample_Info_Table.csv
-This file contains information about the preparation of each sample in our experiments
-Used as input for Sequence_Analysis.Rmd, Identifying_Recombinant_Reads.Rmd, and Figures.Rmd
Sequence Alignments.geneious
-Collection of alignments of reads from tagged.tar.gz
-Used to understand the cause of discordance between sUMI and dUMI consensus sequences
Table 2 - Sequence Discordance.xlsx
-Summary tables describing discordant sequences and error rates
-Arranged by hand from sequence_counts.csv, discordant_sequences.csv and Geneious read counts
-Precursor to Table 2 within manuscript
Further info about the overall analysis:
Five different PacBio sequencing datasets were used for this analysis:
M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from
PacBio sequencing files and the chunked_demux_config files were used
as input for the chunked_demux pipeline. Each config file lists the
different Index primers added during PCR to each sample. The pipeline
produces one fastq file for each Index primer combination in the
config. For example, in dataset M027 there were 3-4 samples using each
Index combination. The fastq files from each demultiplexed read set
were moved to the sUMI_dUMI_comparison pipeline fastq folder for
further demultiplexing by sample and consensus generation with that
pipeline. More information about the chunked_demux pipeline can be
found in the README.md file on github.
The demultiplexed read collections from the chunked_demux pipeline or
CCS read files from datasets which were not indexed (M1567, M004,
M005) were each used as input for the sUMI_dUMI_comparison pipeline
along with each dataset's config file. Each config file contains the
primer sequences for each sample (including the sample ID block in the
cDNA primer) and further demultiplexes the reads to prepare data
tables summarizing all of the UMI sequences and counts for each family
(tagged.tar.gz) as well as consensus sequences from each sUMI and rank
1 dUMI family (consensus.tar.gz). More information about the
sUMI_dUMI_comparison pipeline can be found in the paper and the
README.md file on github.
The consensus.tar.gz and tagged.tar.gz files were moved from
sUMI_dUMI_comparison pipeline directory on the server to the
Pipeline_Outputs folder in this analysis directory for each dataset
and appended with the dataset name (e.g. consensus_M027.tar.gz). Also
in this analysis directory is a Sample_Info_Table.csv containing
information about how each of the samples was prepared, such as
purification methods and number of PCRs. There are also three other
folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and
Figures. Each has an .Rmd
file with the same name inside which is
used to collect, summarize, and analyze the data. All of these
collections of code were written and executed in RStudio to track
notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the
consensus.tar.gz files, combine them, and create two fasta files, one
with all sUMI and one with all dUMI sequences. Using these as input,
two data tables were created, that summarize all sequences and read
counts for each sample that pass various criteria. These are used to
help create Table 2 and as input for
Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta
files containing all of the rank 1 dUMI sequences and the matching
sUMI sequences were created. These were used as input for the python
script compare_seqs.py which identifies any matched sequences that are
different between sUMI and dUMI read collections. This information was
also used to help create Table 2. Finally, to populate the table with
the number of sequences and bases in each sequence subset of interest,
different sequence collections were saved and viewed in the Geneious
program.
To investigate the cause of sequences where the sUMI and dUMI sequences
do not match, tagged.tar.gz was decompressed and for each family with
discordant sUMI and dUMI sequences the reads from the UMI1_keeping
directory were aligned using geneious. Reads from dUMI families failing
the 0.7 filter were also aligned in genious. The uncompressed tagged
folder was then removed to save space. These read collections contain
all of the reads in a UMI1 family and still include the UMI2 sequence.
By examining the alignment and specifically the UMI2 sequences, the site
of the discordance and it's case was identified for each family as
described in the paper. These alignments were saved as "Sequence
Alignments.geneious". The counts of how many families were the result of
PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file
from each sample was extracted from all of the tagged.tar.gz
files,combined and used as input to create a single dataset containing
all UMI information from all samples. This file dUMI_df.csv was used
as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv
as input to create draft figures and then individual datasets for each
Figure. These were copied into Prism software to create the final
figures for the paper.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an `.Rmd` file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
`Sequence_Analysis.Rmd` has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for `Indentifying_Recombinant_Reads.Rmd` and `Figures.Rmd`. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using `Identifying_Recombinant_Reads.Rmd`, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
`Figures.Rmd` used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for each
Figure. These were copied into Prism software to create the final figures for the paper.
Usage notes
Sequence analysis pipelines, R code, and supplemental info for these data are found at: