Skip to main content
Dryad

Data from: Length variation in short tandem repeats affects gene expression in natural populations of Arabidopsis thaliana

Cite this dataset

Reinar, William B et al. (2021). Data from: Length variation in short tandem repeats affects gene expression in natural populations of Arabidopsis thaliana [Dataset]. Dryad. https://doi.org/10.5061/dryad.fttdz08sg

Abstract

The genetic basis for the fine-tuned regulation of gene expression is complex and ultimately influences the phenotype and thus the local adaptation of natural populations. Short tandem repeats (STRs) consisting of repetitive DNA motifs have been shown to regulate gene expression. STRs are variable in length within a population and serve as a heritable, but semi-reversible, reservoir of standing genetic variation. For sessile organisms such as plants, STRs could be of major importance in fine-tuning gene expression as a response to a shifting local environment. Here, we used a transcriptome dataset from natural accessions of Arabidopsis thaliana to investigate population-wide gene expression patterns in light of genome-wide STR variation. We empirically modeled gene expression as a response to the STR length within and around the gene and demonstrated that an association between gene expression and STR length variation is unequivocally present in the sampled population. To support our model, we explored the promoter activity in a transcriptional regulator involved in root hair formation and provided experimentally determined causality between coding sequence length variation and promoter activity. Our results support a general link between gene expression variation and STR length variation in A. thaliana.

Methods

Please see the manuscript for details.

Usage notes

This depository contains the Supplemental Data Sets of "Length variation in short tandem repeats affects gene expression in natural populations of Arabidopsis thaliana" (Reinar et al. 2021).

Supplemental Data Set S1: Gene Ontology enrichment of genes with STRs in close proximity (excel sheet)

Supplemental Data Set S2: Diploid STR unit number counts in the sampled population (tab separated values)

Supplemental Data Set S3: Genetic relatedness of samples (covariance matrix) (tab separated values)

Supplemental Data Set S4: Comparison of scored STR lengths to an independent study (comma separated values)

Supplemental Data Set S5: Modelling results, mock STR genotypes (665,330 tests) (tab separated values)

Supplemental Data Set S6: Modelling results, true STR genotypes (665,364 tests) (tab separated values)

Supplemental Data Set S7: Modelling results, SNPs (893,372 tests) (tab separated values)

Supplemental Data Set S8: Modelling results, SNPs close to eSTRs (2,306 tests) (tab separated values)

Supplemental Data Set S9: Counts used in Fisher Exact tests (excel sheet)

Supplemental Data Set S10: Named genes affected by eSTRs (excel sheet)

Supplemental Data Set S11: Sequenced AL6 cDNA from Col-0 and natural sample CS77246 (FASTA file)

Supplemental Data Set S12: GUS and FRET experimental data and statistical analysis (excel sheet)

In order to reproduce the Figures and to run the modelling, we have included Python scripts as Jupyter Notebooks: 

GeneExp.Figures.ipynb

GeneExp.Modelling.ipynb

Note that Python dependencies has to be installed prior to running code in the Jupyter Notebooks. To run these Notebooks, some additional materials are required, which are also included in this depository.

Extra Material: Log-transformed gene expression values

Extra Material: STRs within 100kb of genes

Extra Material: Gene regions in the Araport11 annotation of TAIR10 (python pickle format)

Note that the .html versions of the Notebooks can be viewed in a web-browser:

GeneExp.Figures.html

GeneExp.Modelling.html

Funding

The Research Council of Norway, Award: 251076, 230849