Skip to main content

CGG short tandem repeat genotype predictions

Cite this dataset

Annear, Dale (2022). CGG short tandem repeat genotype predictions [Dataset]. Dryad.


As expansions of CGG short tandem repeats (STR) are established as the genetic aetiology of many neurodevelopmental disorders, we aimed to elucidate the inheritance patterns and role of CGG STRs in autism-spectrum disorder (ASD). By genotyping 6,063 CGG STR loci in a large cohort of trios and quads with an ASD-affected proband, we determined an unprecedented rate of CGG repeat length deviation across a single generation. While the concept of repeat length being linked to deviation rate was solidified, we demonstrate how shorter STRs display greater degrees of size variation. We observed that CGG STRs did not segregate by Mendelian principles, with a bias against longer repeats, which appeared to magnify as repeat length increased. Through logistic regression, we identified 19 genes that displayed significantly higher rates and degrees of CGG STR expansion within the ASD-affected probands (p < 1 x 10-5). This study not only highlights novel repeat expansions that may play a role in ASD but also reinforces the hypothesis that CGG STRs are specifically linked to human cognition.


Study cohorts and whole genome sequencing

Within this project, whole genome sequencing data was obtained from 5,889 samples from the MSSNG project and 501 from the NGC project. For the MSSNG cohort, PCR-free WGS was conducted on DNA samples of 1,811 trios of an autism-affected proband and unaffected parents and 114 quads trios of an autism-affected proband, an unaffected sibling, and unaffected parents. Library preparation and sequencing were conducted on the Illumina HiSeq X platform.  All samples were aligned to the GRCh38/hg38 reference genome using BWA-mem. Full details on the MSSNG data pipelines can be obtained from their website (Via MSSNG, For the NGC cohort, PCR-free WGS was conducted on DNA samples obtained from blood of 167 trios of an affected proband (young children needing intensive care) and unaffected parents. DNA samples were shipped to Illumina (UK) for sequencing and were prepared with the Illumina TruSeq DNA PCR-Free Sample preparation kit (Illumina, Inc) as previously described (63,64). Samples were sequenced on the Illumina Hiseq 2500, and quality control and read alignment to the human reference genome GRCh38 was performed by Illumina as previously described. Average coverage obtained was 30–40 × for the nuclear genome and 800–1000 × for the mitochondrial genome.

Genome-wide CGG STR genotyping through ExpansionHunter

Genome-wide CGG repeat genotyping was conducted on the CRAM files, aligned to GRCh38, generated by the WGS described in the previous section. The short tandem repeat genotyping algorithm ExpansionHunter (version 5.0.0) was used, developed by Dolzhenko et al. The default parameters were used, and the GRCh38 FASTA file was used for the genome reference argument. For the “–variant-catalog” argument, a custom CGG-repeat catalogue JSON file was used as developed and described previously by Annear et al., however, it was updated for the GRCh38 reference assembly. The resultant output VCF and JSON files were processed through a bioinformatic pipeline, STaRparse (, in order to automatically parse, filter, analyse, and annotate the extracted CGG STR data. STaRparse was build utilising Python (3.6.8) and R (3.6.3) environments. To ensure accuracy of repeat length predictions, loci were excluded based on sequence coverage and the presence of only flanking reads.


Steunfonds Marguerite-Marie Delacroix