Short tandem repeats (STRs) are tracts of 1–6 bp DNA motifs repeated in a head-to-tail fashion, collectively accounting for approximately 3% of the human genome. Among these, trinucleotide STRs hold particular relevance due to their involvement in human genetic disorders, with CGG, CAG, and GAA repeats being causative of Fragile X Syndrome, Huntington’s Disease, and Friedreich’s Ataxia, respectively. In this study, we systematically examined the genomic distribution, abundance, repeat length, and polymorphism of 5,963 CGG, 11,220 CAG, and 16,105 GAA loci across a cohort of 191 healthy individuals. Marked differences were observed between the three repeat classes. CGG STRs, while the least abundant, were strongly enriched within exonic and promoter regions and exhibited the highest levels of polymorphism, particularly in genic regions. GAA STRs were by far the most abundant and displayed the greatest overall variability, with the majority located in intergenic and intronic regions, but showing minimal polymorphism in exons and 5′-UTRs. In contrast, CAG STRs were more evenly distributed across genic and intergenic regions and were strikingly stable, despite being known to drive pathogenic expansions when exceeding certain thresholds. These findings demonstrate that trinucleotide STR classes are not interchangeable but exhibit unique genomic and evolutionary characteristics. Nucleotide composition emerges as a key determinant of STR localization, stability, and variability, suggesting that the biological roles of these repeats are intrinsically tied to their motif sequence. Our study underscores the importance of analyzing STR classes individually, as grouping them solely by motif length risks overlooking significant functional distinctions.

Dataset DOI: 10.5061/dryad.5tb2rbpgt

Description of the data and file structure

Project Title

CGG, CAG, and GAA: Genome-wide comparison of the Disease Linked Trinucleotide Short Tandem Repeats

Overview

Short tandem repeats (STRs) of trinucleotide motifs play distinct roles in genome biology and human disease. This project focuses on the distribution, variability, and polymorphism of CAG, CGG, and GAA repeats—associated with Huntington’s Disease, Fragile X Syndrome, and Friedreich’s Ataxia, respectively—across 191 healthy genomes.

Directory Contents

The dataset is organized into several CSV files, grouped by repeat motif and analysis type:

1. Summary Files by Motif

For each repeat class (CAG, CGG, GAA), the following summary files are included:

File Name	Description
`*_Platinum_by_chr.csv`	Summary statistics of STRs aggregated by chromosome.
`*_Platinum_by_sample.csv`	Per-sample statistics (e.g., average repeat length, heterozygosity).
`*_Platinum_by_heterozygosity.csv`	Summary of heterozygosity levels across samples.
`*_Platinum_by_locus.csv`	Locus-level data including genomic position, motif length, variability, and annotation category.

Example:
GAA_Platinum_by_locus.csv contains detailed locus-wise STR metrics for all GAA loci found in the Platinum genome cohort.

File Structures:
*_Platinum_by_chr

Column Name	Description
Chr	Chromosome
Mbp	Chromosome size
Total_Reps	Total repeat loci on the chromosome
Mean_Units	Mean repeat length on the chromosome in the screened population
Med_Units	Median repeat length on the chromosome in the screened population
Polymorphic_Reps	Number of polymorphic repeats detected on the chromosome
poly_Ratio	Ratio of polymorphic repeats to stable repeats
Reps_per_Mbp	Total density of repeats on the chromosome (per Mbp)
Poly_per_Mbp	Total density of polymorphicrepeats on the chromosome (per Mbp)

*_Platinum_by_sample

Column Name	Description
Sample_ID	Sample identifier
Highest_Repeat	Largest repeat length detected in the sample
Total_Repeats	Total number of repeat alleles genotyped within the smaple
Mean_Rep_length	Mean repeat length within the given smaple
Unstable_Reps	Number of polymorphic repeats detected within the sample
Percentage_Unstable_Reps	What percentage of repeats that were dected which were polymorphic

*_Platinum_by_heterozygosity.csv

Column Name	Description
Med_Units	Repeats grouped by population median length
Stable_Reps	The number of repeats that were stable across the population
Unstable_Reps	The number of repeats that were polymorphic across the population
Total_Reps	Total number repeats of that median population repeat length
Flagged_Unstable	Total proportion of polymorphic repeats of that median population repeat length

*_Platinum_by_locus.csv

Column Name	Description
Call_ID	Repeat locus identifier
Chr	Chromosome
Start	Genomic co-ordiante start position
Ref_Units	Repeat length in hg38 reference genome
Mean_Units	Mean repeat length in screened population
Med_Units	Median repeat length in screened population
Min_Units	Smallest repeat length in screened population
Max_Units	Largest repeat length in screened population
Hits	Number of times repeat was successfully genotyped
Instability_Rating	Rate of repeat instability (0 - 1)
SD	Standard devation of repeat length
EE50	Number repeat expansion events detected (Above 50)
EE100	Number repeat expansion events detected (Above 100)
Status	Repeat status (STABLE or POLYMORPHIC)
Gene	Gene most positionally associated with repeat
Region	Genetic region in which the repeat occours

2. Read Coverage Files

File Name	Description
`ReadCoverage_*_Platinum.csv`	Read depth information for each STR locus, used to assess reliability of genotyping across the cohort.

Structure:

ReadCoverage_*_Platinum.csv

Column Name	Description
Sample_ID	Sample identifier
Call_ID	Repeat locus identifier
MaxSpanningRead	Number and size of spanning reads detected
MaxFlankingRead	Number and size of flanking reads detected
MaxInrepeatRead	Number and size of in repeat reads reads detected

3. Raw and Filtered Repeat Data

File Name	Description
`Repeats_*_Platinum.csv`	Raw STR calls from the Platinum genomes.
`Repeats_*_Platinum_filtered.csv`	High-confidence subset of STR calls after quality filtering.

Structure:
Repeats_*_Platinum.csv & Repeats_*_Platinum_filtered.csv

Column Name	Description
Call_ID	Repeat locus identifier
Sample_ID	Sample identifier
Chr	Chromosome
Start	Genomic co-ordiante start position
End	Genomic co-ordiante end position
GT	Overall genotype, vcf notation
Ref_Units	Repeat length in hg38 reference genome
Allele1_Units	Repeat length of the 1st allele
Allele2_Units	Repeat length of the 2nd allele

4. MSSNG Unaffected Dataset

These files provide repeat data from unaffected individuals in the MSSNG cohort, serving as an independent control dataset:

File Name	Description
`*_Repeats_MSSNG_Unaffected_by_locus.csv`	Locus-level repeat data for unaffected individuals in the MSSNG project.

Structure:
*_Repeats_MSSNG_Unaffected_by_locus.csv has the same file structure as *_Platinum_by_locus.csv outlined in Section 1.

File Naming Convention

CAG, CGG, GAA: Motif class
Platinum: Refers to the 191 high-quality genomes used for the core analysis
MSSNG_Unaffected: Refers to unaffected individuals from the MSSNG autism genomics cohort
by_chr, by_sample, by_locus, etc.: Summary scope
filtered: Indicates data has passed stringent quality control filters

Notes

All files are in CSV or TSV format and can be opened in R, Python, or spreadsheet software.
Locus-based files include genomic coordinates (chr, start, end), motif size, repeat counts, and annotation categories (e.g., exonic, intronic).
Summary statistics were generated using custom pipelines for STR genotyping and annotation.

Access information

Data was derived from the following sources:

The MSSNG Project (https://research.mss.ng/)
Platinum Genomes (https://emea.illumina.com/platinumgenomes.html)

Human subjects data

I confirm that we have received explicit consent and all data has been de-identified .

CGG, CAG, and GAA: genome-wide comparison of the disease linked Trinucleotide short tandem repeat

Data files

Abstract

Description of the data and file structure

Project Title

Overview

Directory Contents

1. Summary Files by Motif

2. Read Coverage Files

3. Raw and Filtered Repeat Data

4. MSSNG Unaffected Dataset

File Naming Convention

Notes

Access information

Human subjects data

CGG, CAG, and GAA: genome-wide comparison of the disease linked Trinucleotide short tandem repeat

Data files

Abstract

README: CGG, CAG, and GAA: genome-wide comparison of the disease linked Trinucleotide short tandem repeat

Description of the data and file structure

Project Title

Overview

Directory Contents

1. Summary Files by Motif

2. Read Coverage Files

3. Raw and Filtered Repeat Data

4. MSSNG Unaffected Dataset

File Naming Convention

Notes

Access information

Human subjects data