The accurate detection of induced mutations is critical for both forward and reverse genetics studies. Experimental chemical mutagenesis induces relatively few single base changes per individual. In a complex eukaryotic genome, false positive detection of mutations can occur at or above this mutagenesis rate. We demonstrate here, using a population of ethyl methanesulfonate (EMS) treated Sorghum bicolor BTx623 individuals, that using replication to detect false positive induced variants in next-generation sequencing data permits higher throughput variant detection with greater accuracy. We used a lower sequence coverage depth (average of 7X) from 586 independently mutagenized individuals and detected 5,399,493 homozygous SNPs. Of these, 76% originated from only 57,872 genomic positions prone to false positive variant calling. These positions are characterized by high copy number paralogs where the error-prone SNP positions are at copies containing a variant at the SNP position. The ability of short stretches of homology to generate these error prone positions suggests that incompletely assembled or poorly mapped repeated sequences are one driver of these error prone positions.. Removal of these false positives left 1,275,872 homozygous and 477,531 heterozygous EMS-induced SNPs which, congruent with the mutagenic mechanism of EMS, were greater than 98% G:C to A:T transitions. Through this analysis we generated a database of sequence indexed mutants of Sorghum. This collection contains 4,035 high impact homozygous mutations in 3,637 genes and 56,514 homozygous missense mutations in 23,227 genes. Each line contains, on average, 2,177 annotated homozygous SNPs per genome, including seven likely gene knockouts and 96 missense mutations. The number of mutations in a transcript was linearly correlated with the transcript length and also the G+C count, but not with the GC/AT ratio. Analysis of the detected mutagenized positions identified CG-rich patches, and flanking sequences strongly influenced EMS-induced mutation rates. Our method for detecting false-positive induced mutations is generally applicable to any organism, is independent of the choice of in silico variant-calling algorithm, and is most valuable when the true mutation rate is likely to be low, such as in laboratory induced mutations or somatic mutation detection in medicine.
Supplemental File S1
Standard filtered SNPs detected in each of the 586 sorghum individuals.
Supplemental_File-S1.zip
Supplemental File S2
Probable error-prone SNP genomic positions in the sorghum reference genome (version 2.1).
Supplemental_File-S2.zip
Supplemental File S3
Non-replicate and likely EMS-induced homozygous SNPs in each of the 586 sorghum individuals.
Supplemental_File-S3.zip
Supplemental File S4
Non-replicate, and likely EMS-induced heterozygous SNPs in each of the 586 sorghum individuals.
Supplemental_File-S4.zip
Supplemental File S5
Counts of homozygous and heterozygous SNPs in all the 586 sequenced lines.
Supplemental_File-S5.zip
Supplemental File S6
Replicate and likely false-negative EMS-induced homozygous G:C to A:T SNPs in each of the 586 sorghum individuals.
Supplemental_File-S6.zip
Supplemental File S7
Replicate and likely false-negative EMS-induced heterozygous G:C to A:T SNPs in each of the 586 sorghum individuals.
Supplemental_File-S7.zip
Supplemental File S8
SnpEff functional classification of the homozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S8.zip
Supplemental File S9
Function description for the genes containing the SnpEff-annotated homozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S9.zip
Supplemental File S10
List of SnpEff-annotated EMS-induced SNPs predicted to trigger silent stop codon lost substitutions in genes of the 586 sorghum individuals.
Supplemental_File-S10.zip
Supplemental File S11
SnpEff functional classification of the heterozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S11.zip
Supplemental File S12
Function description for the genes containing the SnpEff-annotated heterozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S12.zip
Supplemental File S13
Function description for the genes containing the SnpEff-annotated tentative false-negative homozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S13.zip
Supplemental File S14
Function description for the genes containing the SnpEff-annotated tentative false-negative heterozygous EMS-induced SNPs in all the 586 sorghum individuals.
Supplemental_File-S14.zip
Supplemental File S15
SnpEff functional classification of the likely EMS-induced indels in all the 586 sorghum individuals.
Supplemental_File-S15.zip
Supplemental File S16
Function description for the genes containing the SnpEff-annotated likely EMS-induced indels in all the 586 sorghum individuals.
Supplemental_File-S16.zip
Supplemental File S17
SIFT prediction results for the missense-annotated homozygous EMS-induced SNPs in all the 586 sequenced individuals.
Supplemental_File-S17.zip
Supplemental File S18
SIFT prediction results for the missense-annotated heterozygous EMS-induced SNPs in all the 586 sequenced individuals.
Supplemental_File-S18.zip
Supplemental File S19
Summary statistics for the sequencing, mapping, variants prediction, filtering, annotation and classification for the detected EMS-induced variants in all 586 sequenced individuals.
Supplemental_File-S19.zip
Supplemental File S20
Detailed classification, gene function description and annotation of the medium or high impact homozygous EMS-induced SNPs.
Supplemental_File-S20.zip
Supplemental File S21
Detailed classification, gene function description and annotation of the medium or high impact heterozygous EMS-induced SNPs.
Supplemental_File-S21.zip
Supplemental File S22
Detailed classification, gene function description and annotation of the medium or high impact EMS-induced indels.
Supplemental_File-S22.zip
Supplemental File S23
The detailed attributes of gene transcripts, including number of SNPs, GC-content, sequence length, mononucleotide and dinucleotide counts, which were used in the linear correlation analysis.
Supplemental_File-S23.zip
Supplemental File S24
The subset of BLAST ungapped global alignments for the 51nt sequence contexts of the error-prone SNP positions, randomly selected EMS-induced SNP positions and randomly selected genome positions.
Supplemental_File-S24.zip
Supplemental File S25
Sample scripts for the variants detection and annotation pipeline.
Supplemental_File-S25.zip