Next-generation sequencing (NGS) has revolutionized genetics and enabled the accurate identification of many genetic variants across many genomes. However, detection of biologically important low-frequency variants within genetically heterogeneous populations remains challenging, because they are difficult to distinguish from intrinsic NGS sequencing error rates. Approaches to overcome these limitations are essential to detect rare mutations in large cohorts, virus or microbial populations, mitochondria heteroplasmy, and other heterogeneous mixtures such as tumors. Modifications in library preparation can overcome some of these limitations, but are experimentally challenging and restricted to skilled biologists. This paper describes a novel quality filtering and base pruning pipeline, called Complex Heterogeneous Overlapped Paired-End Reads (CHOPER), designed to detect sequence variants in a complex population with high sequence similarity derived from All-Codon-Scanning (ACS) mutagenesis. A novel fast alignment algorithm, designed for the specified application, has O(n) time complexity. CHOPER was applied to a p53 cancer mutant reactivation study derived from ACS mutagenesis. Relative to error filtering based on Phred quality scores, CHOPER improved accuracy by about 13% while discarding only half as many bases. These results are a step toward extending the power of NGS to the analysis of genetically heterogeneous populations.
ACAGTG1_part1.fq
M237I_ACS sample/ forward reads/ part 1 (please join with ACAGTG1_part2.fq for a complete file). This sample corresponds to ACS mutagenesis performed on the p53 core domain containing the cancer mutant M237I. No selective pressure for p53 activity was applied. This sample controls for, and allows analysis of, the diversity of the ACS library.
ACAGTG1_part2.fq
M237I_ACS sample/ forward reads/ part 2 (please join with ACAGTG1_part1.fq for a complete sample). This sample corresponds to ACS mutagenesis performed on the p53 core domain containing the cancer mutant M237I. No selective pressure for p53 activity was applied. This sample controls for, and allows analysis of, the diversity of the ACS library.
ACAGTG2_part1.fq
M237I_ACS sample/ reverse reads/ part 1(please join with ACAGTG2_part2.fq for a complete sample). This sample corresponds to ACS mutagenesis performed on the p53 core domain containing the cancer mutant M237I. No selective pressure for p53 activity was applied. This sample controls for, and allows analysis of, the diversity of the ACS library.
ACAGTG2_part2.fq
M237I_ACS sample/ reverse reads/ part 2 (please join with ACAGTG2_part1.fq for a complete sample). Corresponds to ACS mutagenesis performed on the p53 core domain containing the cancer mutant M237I. No selective pressure for p53 activity was applied. This sample controls for, and allows analysis of, the diversity of the ACS library.
CAGATC1.fq
M237I_RESCUE sample/ forward reads. Identical to M237I_ACS sample except that transformants were selected for active p53 by culturing the cells in media lacking uracil, thus requiring active p53 for growth.
CAGATC2.fq
M237I_RESCUE sample/ reverse reads. Identical to M237I_ACS sample except that transformants were selected for active p53 by culturing the cells in media lacking uracil, thus requiring active p53 for growth.
TGACCA1_part1.fq
M237I sample/ forward reads/ part 1 (please join with TGACCA1_part2.fq for a complete sample). Represents the unprocessed p53 core domain that contains the cancer mutation M237I but no other introduced mutations. This sample was prepared to analyze and control the baseline error rates associated with the procedure, as well as the intrinsic NGS sequencing error rates.
TGACCA1_part2.fq
M237I sample/ forward reads/ part 2 (please join with TGACCA1_part1.fq for a complete sample). Represents the unprocessed p53 core domain that contains the cancer mutation M237I but no other introduced mutations. This sample was prepared to analyze and control the baseline error rates associated with the procedure, as well as the intrinsic NGS sequencing error rates.
TGACCA2_part1.fq
M237I sample/ reverse reads/ part 1 (please join with TGACCA2_part1.fq for a complete sample). Represents the unprocessed p53 core domain that contains the cancer mutation M237I but no other introduced mutations. This sample was prepared to analyze and control the baseline error rates associated with the procedure, as well as the intrinsic NGS sequencing error rates.
TGACCA2_part2.fq
M237I sample/ reverse reads/ part 2 (please join with TGACCA2_part1.fq for a complete sample). Represents the unprocessed p53 core domain that contains the cancer mutation M237I but no other introduced mutations. This sample was prepared to analyze and control the baseline error rates associated with the procedure, as well as the intrinsic NGS sequencing error rates.
m237i1
The m237i p53 cancer mutant gene core domain sequence.
choper
The CHOPER filtering implementation in python.
codons
Counting codons after applying CHOPER filters for statistical analysis.
PhredFilter
Phred filtering code. To run, use the following:
"java -jar PhredFilter.jar