Skip to main content
Dryad

CGG, CAG, and GAA: genome-wide comparison of the disease linked Trinucleotide short tandem repeat

Abstract

Short tandem repeats (STRs) are tracts of 1–6 bp DNA motifs repeated in a head-to-tail fashion, collectively accounting for approximately 3% of the human genome. Among these, trinucleotide STRs hold particular relevance due to their involvement in human genetic disorders, with CGG, CAG, and GAA repeats being causative of Fragile X Syndrome, Huntington’s Disease, and Friedreich’s Ataxia, respectively. In this study, we systematically examined the genomic distribution, abundance, repeat length, and polymorphism of 5,963 CGG, 11,220 CAG, and 16,105 GAA loci across a cohort of 191 healthy individuals. Marked differences were observed between the three repeat classes. CGG STRs, while the least abundant, were strongly enriched within exonic and promoter regions and exhibited the highest levels of polymorphism, particularly in genic regions. GAA STRs were by far the most abundant and displayed the greatest overall variability, with the majority located in intergenic and intronic regions, but showing minimal polymorphism in exons and 5′-UTRs. In contrast, CAG STRs were more evenly distributed across genic and intergenic regions and were strikingly stable, despite being known to drive pathogenic expansions when exceeding certain thresholds. These findings demonstrate that trinucleotide STR classes are not interchangeable but exhibit unique genomic and evolutionary characteristics. Nucleotide composition emerges as a key determinant of STR localization, stability, and variability, suggesting that the biological roles of these repeats are intrinsically tied to their motif sequence. Our study underscores the importance of analyzing STR classes individually, as grouping them solely by motif length risks overlooking significant functional distinctions.