CGG, CAG, and GAA: genome-wide comparison of the disease linked Trinucleotide short tandem repeat
Data files
Oct 23, 2025 version files 58.26 MB
-
CAG_Platinum_by_chr.csv
990 B
-
CAG_Platinum_by_heterozygosity.csv
800 B
-
CAG_Platinum_by_locus.csv
834.10 KB
-
CAG_Platinum_by_sample.csv
416 B
-
CAG_Repeats_MSSNG_Unaffected_by_locus.csv
885.88 KB
-
CGG_Platinum_by_chr.csv
993 B
-
CGG_Platinum_by_heterozygosity.csv
836 B
-
CGG_Platinum_by_locus.csv
429.67 KB
-
CGG_Platinum_by_sample.csv
424 B
-
CGG_Repeats_MSSNG_Unaffected_by_locus.csv
477.64 KB
-
GAA_Platinum_by_chr.csv
1.04 KB
-
GAA_Platinum_by_heterozygosity.csv
1.47 KB
-
GAA_Platinum_by_locus.csv
1.27 MB
-
GAA_Platinum_by_sample.csv
436 B
-
GAA_Repeats_MSSNG_Unaffected_by_locus.csv
1.38 MB
-
ReadCoverage_CAG_Platinum.csv
6.53 MB
-
ReadCoverage_CGG_Platinum.csv
4.04 MB
-
ReadCoverage_GAA_Platinum.csv
10.84 MB
-
README.md
9.95 KB
-
Repeats_CAG_Platinum_filtered.csv
5.15 MB
-
Repeats_CAG_Platinum.csv
5.48 MB
-
Repeats_CGG_Platinum_filtered.csv
2.75 MB
-
Repeats_CGG_Platinum.csv
2.93 MB
-
Repeats_GAA_Platinum_filtered.csv
7.38 MB
-
Repeats_GAA_Platinum.csv
7.87 MB
Abstract
Short tandem repeats (STRs) are tracts of 1–6 bp DNA motifs repeated in a head-to-tail fashion, collectively accounting for approximately 3% of the human genome. Among these, trinucleotide STRs hold particular relevance due to their involvement in human genetic disorders, with CGG, CAG, and GAA repeats being causative of Fragile X Syndrome, Huntington’s Disease, and Friedreich’s Ataxia, respectively. In this study, we systematically examined the genomic distribution, abundance, repeat length, and polymorphism of 5,963 CGG, 11,220 CAG, and 16,105 GAA loci across a cohort of 191 healthy individuals. Marked differences were observed between the three repeat classes. CGG STRs, while the least abundant, were strongly enriched within exonic and promoter regions and exhibited the highest levels of polymorphism, particularly in genic regions. GAA STRs were by far the most abundant and displayed the greatest overall variability, with the majority located in intergenic and intronic regions, but showing minimal polymorphism in exons and 5′-UTRs. In contrast, CAG STRs were more evenly distributed across genic and intergenic regions and were strikingly stable, despite being known to drive pathogenic expansions when exceeding certain thresholds. These findings demonstrate that trinucleotide STR classes are not interchangeable but exhibit unique genomic and evolutionary characteristics. Nucleotide composition emerges as a key determinant of STR localization, stability, and variability, suggesting that the biological roles of these repeats are intrinsically tied to their motif sequence. Our study underscores the importance of analyzing STR classes individually, as grouping them solely by motif length risks overlooking significant functional distinctions.
Dataset DOI: 10.5061/dryad.5tb2rbpgt
Description of the data and file structure
Project Title
CGG, CAG, and GAA: Genome-wide comparison of the Disease Linked Trinucleotide Short Tandem Repeats
Overview
Short tandem repeats (STRs) of trinucleotide motifs play distinct roles in genome biology and human disease. This project focuses on the distribution, variability, and polymorphism of CAG, CGG, and GAA repeats—associated with Huntington’s Disease, Fragile X Syndrome, and Friedreich’s Ataxia, respectively—across 191 healthy genomes.
Directory Contents
The dataset is organized into several CSV files, grouped by repeat motif and analysis type:
1. Summary Files by Motif
For each repeat class (CAG, CGG, GAA), the following summary files are included:
| File Name | Description |
|---|---|
*_Platinum_by_chr.csv |
Summary statistics of STRs aggregated by chromosome. |
*_Platinum_by_sample.csv |
Per-sample statistics (e.g., average repeat length, heterozygosity). |
*_Platinum_by_heterozygosity.csv |
Summary of heterozygosity levels across samples. |
*_Platinum_by_locus.csv |
Locus-level data including genomic position, motif length, variability, and annotation category. |
Example:
GAA_Platinum_by_locus.csv contains detailed locus-wise STR metrics for all GAA loci found in the Platinum genome cohort.
File Structures:
*_Platinum_by_chr
| Column Name | Description |
|---|---|
| Chr | Chromosome |
| Mbp | Chromosome size |
| Total_Reps | Total repeat loci on the chromosome |
| Mean_Units | Mean repeat length on the chromosome in the screened population |
| Med_Units | Median repeat length on the chromosome in the screened population |
| Polymorphic_Reps | Number of polymorphic repeats detected on the chromosome |
| poly_Ratio | Ratio of polymorphic repeats to stable repeats |
| Reps_per_Mbp | Total density of repeats on the chromosome (per Mbp) |
| Poly_per_Mbp | Total density of polymorphicrepeats on the chromosome (per Mbp) |
*_Platinum_by_sample
| Column Name | Description |
|---|---|
| Sample_ID | Sample identifier |
| Highest_Repeat | Largest repeat length detected in the sample |
| Total_Repeats | Total number of repeat alleles genotyped within the smaple |
| Mean_Rep_length | Mean repeat length within the given smaple |
| Unstable_Reps | Number of polymorphic repeats detected within the sample |
| Percentage_Unstable_Reps | What percentage of repeats that were dected which were polymorphic |
*_Platinum_by_heterozygosity.csv
| Column Name | Description |
|---|---|
| Med_Units | Repeats grouped by population median length |
| Stable_Reps | The number of repeats that were stable across the population |
| Unstable_Reps | The number of repeats that were polymorphic across the population |
| Total_Reps | Total number repeats of that median population repeat length |
| Flagged_Unstable | Total proportion of polymorphic repeats of that median population repeat length |
*_Platinum_by_locus.csv
| Column Name | Description |
|---|---|
| Call_ID | Repeat locus identifier |
| Chr | Chromosome |
| Start | Genomic co-ordiante start position |
| Ref_Units | Repeat length in hg38 reference genome |
| Mean_Units | Mean repeat length in screened population |
| Med_Units | Median repeat length in screened population |
| Min_Units | Smallest repeat length in screened population |
| Max_Units | Largest repeat length in screened population |
| Hits | Number of times repeat was successfully genotyped |
| Instability_Rating | Rate of repeat instability (0 - 1) |
| SD | Standard devation of repeat length |
| EE50 | Number repeat expansion events detected (Above 50) |
| EE100 | Number repeat expansion events detected (Above 100) |
| Status | Repeat status (STABLE or POLYMORPHIC) |
| Gene | Gene most positionally associated with repeat |
| Region | Genetic region in which the repeat occours |
2. Read Coverage Files
| File Name | Description |
|---|---|
ReadCoverage_*_Platinum.csv |
Read depth information for each STR locus, used to assess reliability of genotyping across the cohort. |
Structure:
ReadCoverage_*_Platinum.csv
| Column Name | Description |
|---|---|
| Sample_ID | Sample identifier |
| Call_ID | Repeat locus identifier |
| MaxSpanningRead | Number and size of spanning reads detected |
| MaxFlankingRead | Number and size of flanking reads detected |
| MaxInrepeatRead | Number and size of in repeat reads reads detected |
3. Raw and Filtered Repeat Data
| File Name | Description |
|---|---|
Repeats_*_Platinum.csv |
Raw STR calls from the Platinum genomes. |
Repeats_*_Platinum_filtered.csv |
High-confidence subset of STR calls after quality filtering. |
Structure:
Repeats_*_Platinum.csv & Repeats_*_Platinum_filtered.csv
| Column Name | Description |
|---|---|
| Call_ID | Repeat locus identifier |
| Sample_ID | Sample identifier |
| Chr | Chromosome |
| Start | Genomic co-ordiante start position |
| End | Genomic co-ordiante end position |
| GT | Overall genotype, vcf notation |
| Ref_Units | Repeat length in hg38 reference genome |
| Allele1_Units | Repeat length of the 1st allele |
| Allele2_Units | Repeat length of the 2nd allele |
4. MSSNG Unaffected Dataset
These files provide repeat data from unaffected individuals in the MSSNG cohort, serving as an independent control dataset:
| File Name | Description |
|---|---|
*_Repeats_MSSNG_Unaffected_by_locus.csv |
Locus-level repeat data for unaffected individuals in the MSSNG project. |
Structure:
*_Repeats_MSSNG_Unaffected_by_locus.csv has the same file structure as *_Platinum_by_locus.csv outlined in Section 1.
File Naming Convention
CAG,CGG,GAA: Motif classPlatinum: Refers to the 191 high-quality genomes used for the core analysisMSSNG_Unaffected: Refers to unaffected individuals from the MSSNG autism genomics cohortby_chr,by_sample,by_locus, etc.: Summary scopefiltered: Indicates data has passed stringent quality control filters
Notes
- All files are in CSV or TSV format and can be opened in R, Python, or spreadsheet software.
- Locus-based files include genomic coordinates (chr, start, end), motif size, repeat counts, and annotation categories (e.g., exonic, intronic).
- Summary statistics were generated using custom pipelines for STR genotyping and annotation.
Access information
Data was derived from the following sources:
- The MSSNG Project (https://research.mss.ng/)
- Platinum Genomes (https://emea.illumina.com/platinumgenomes.html)
Human subjects data
I confirm that we have received explicit consent and all data has been de-identified .
