Skip to main content
Dryad logo

Data from: Experimental demonstration and pan-structurome prediction of climate-associated riboSNitches in Arabidopsis

Citation

Ferrero-Serrano, Ángel et al. (2022), Data from: Experimental demonstration and pan-structurome prediction of climate-associated riboSNitches in Arabidopsis, Dryad, Dataset, https://doi.org/10.5061/dryad.mw6m905zj

Abstract

Background
Genome-Wide Association Studies (GWAS) aim to correlate phenotypic changes with genotypic variation. Single nucleotide variants (SNVs) within transcripts may alter mRNA structure, with potential impacts on transcript stability, macromolecular interactions and translation. However, no plant genomes have been yet assessed for the presence of these structure-altering polymorphisms or “riboSNitches”.
Results
We experimentally demonstrate the presence of riboSNitches in transcripts of two Arabidopsis genes, ZINC RIBBON 3 (ZR3) and COTTON GOLGI-RELATED 3 (CGR3), which are associated with continentality and temperature variation in the natural environment. These riboSNitches are associated with differences in the abundance of their respective transcripts, implying their role in regulating gene expression in adaptation to local climate conditions. We computationally predict transcriptome-wide riboSNitches in 879 naturally inbred Arabidopsis accessions. We also characterize correlations between SNPs/riboSNitches in these accessions and 434 climate descriptors of local environments; suggesting the role of these variants in local adaptation. We integrate this information in CLIMtools V2.0 and provide a new web resource, T-CLIM, which allows users to determine the association of transcript abundance variation with climate variation.
Conclusions
We functionally validate two plant riboSNitches and, for the first time, demonstrate riboSNitch is conditionally dependent on temperature, coining the term conditional riboSNitch. We provide the first pan-genome wide prediction of riboSNitches in plants. We expand our previous CLIMtools web resource with riboSNitch information and with 1868 additional Arabidopsis genomes and 269 additional climate conditions, which will facilitate in silico studies of natural genetic variation, its phenotypic consequences and its role in local adaptation.

Methods

Compiled geo-climatic variables were obtained through curation of databases. We extracted information on 473 climate variables (see description of environmental parameters) for a comprehensive description of the local environment of previously sequenced Arabidopsis accessions [1–7].

For the calculations of genotype × environmental associations (GEA), we used a GWAS approach with each of the numerical environmental parameters included in this study.The online tool GWAPP (http://gwas.gmi.oeaw.ac.at/) was employed using an accelerated mixed model (AMM) [8], which addresses the confounding effects of population stratification, family structure, and cryptic relatedness [9]. This method presents at the same time an issue with the introduction of false negatives. We also include here the results derived from GEA using a linear regression model that does not correct for population structure, also obtained using GWAPP [8].

Transcriptome-wide association (TWA) analysis of correlations between transcript abundance and each of the environmental variables included in this study using the set of 558 accessions within the set of 879 Eurasian accessions with available information on transcript abundance from a previous study [10] (GEO dataset with accession number GSE80744 and SRA study SRP074107). Spearman’s rank correlation coefficients between individual climate variables and individual transcript abundance values were calculated using the correlation function of the Hmisc package in R.

The SNPfold program [11] was used to identify predicted riboSNitches Genome-wide. SNPfold was applied to the 3,830,264 natural variants of protein-coding genes (including introns) within the 879 Eurasian Arabidopsis in our study.

References

1. Alonso-Blanco C, Andrade J, Becker C, Bemm F, Bergelson J, Borgwardt KM, et al. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016;166:481–91.

2. Durvasula A, Fulgione A, Gutaker RM, Alacakaptan SI, Flood PJ, Neto C, et al. African genomes illuminate the early history and transition to selfing in Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2017;114:5213–8.

3. Frachon L, Bartoli C, Carrere S, Bouchez O, Chaubet A, Gautier M, et al. A genomic map of climate adaptation in Arabidopsis thaliana at a micro-geographic scale. Front Plant Sci. 2018;9:967.

4. Fulgione A, Koornneef M, Roux F, Hermisson J, Hancock AM. Madeiran Arabidopsis thaliana reveals ancient long-range colonization and clarifies demography in Eurasia. Mol Biol Evol. 2018;35:564–74.

5. Horton MW, Hancock AM, Huang YS, Toomajian C, Atwell S, Auton A, et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat Genet. 2012;44:212–6.

6. Hsu CW, Lo CY, Lee CR. On the postglacial spread of human commensal Arabidopsis thaliana: journey to the East. New Phytol. 2019;222:1447–57.

7. Zou YP, Hou XH, Wu Q, Chen JF, Li ZW, Han TS, et al. Adaptation of Arabidopsis thaliana to the Yangtze River basin. Genome Biol. 2017;18:239.

8. Seren U, Vilhjalmsson BJ, Horton MW, Meng D, Forai P, Huang YS, et al. GWAPP: a web application for genome-wide association mapping in Arabidopsis. Plant Cell. 2012;24:4793–805.

9. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11:459–63.

10. Kawakatsu T, Huang SSC, Jupe F, Sasaki E, Schmitz RJ, Urich MA, et al. Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell. 2016;166:492–505.

11. Halvorsen M, Martin JS, Broadaway S, Laederach A. Disease-associated mutations that alter the RNA structural ensemble. PLoS Genet. 2010;6:e1001074.

Usage Notes

This Dataset includes the following files:

1. "EURASIAN_ACCESSIONS.TAIR10_52_RiboSNitch_prediction.csv":  Genome-wide riboSNitch prediction. Output from the SNPfold program to identify predicted riboSNitches Genome-wide. SNPfold was applied to the 3,830,264 natural variants of protein-coding genes (including introns) within the 879 Eurasian Arabidopsis in our study. Column labels in this file represent:
        -chr: chromosome.
        -pos: position.
        -REF: reference allele.
        -ALT: alternative allele.
        -INFO: SNP effect prediction.
        -SYMBOL: symbols for the gene where the SNP is located.
        -LOCUS_ID: for the gene where the SNP is located.
        -TRANSCRIPT: transcript ID for the gene where the SNP is located.
        -ANNOTATION_CHECK: column confirming that the input file came from the .gff file for the Arabidopsis TAIR10_52 annotation (http://ftp.ensemblgenomes.org/pub/plants/release-52/gff3/arabidopsis_thaliana/). All values in this column should read "gff."
        -LOCATION: column confirming that the input file came from the gene body. All values in this column should read "body."
        -ADDED: Nucleotide window used for the prediction on each SNP. All values should read "40."
        -STEP: Used to describe the difference in length of the left and right flanks. All values should read “0”.
        -CC_BPRPOB: correlation coefficients for base-pairing probability.
        -CC_SHANNON: Shannon entropy change for all possible single mutations in the RNA.

2. "Environmental_data.csv": environmental data for the accessions included in this study. The column "accession_id" includes the identification of each accession. The column names identify individual environmental variables. In this dataset, we provide another file "Description_of_environmental_variables.xlsx" that can be used to identify any of these descriptive names in the column named as "ID" and obtain more descriptive information on each environmental variable.
    
3. "CLIMATE_TWAS.csv": Spearman’s rank correlation coefficients between individual climate variables (column names) and individual transcript abundance value (row names) genome-wide. The first column, named "LOCUS_ID" provides the locus ID, while column names provides a descriptive name for individual environmental variables. In this dataset, we provide another file "Description_of_environmental_variables.xlsx" that can be used to identify any of these descriptive names in the column named as "ID" and obtain more descriptive information on each environmental variable.
 
4. "Environmental_GWAS_2021_LM.zip": We include here the full pre-annotated and pre-filtered results derived from Genome–environment associations using a linear regression model that does not correct for population structure that we obtained using GWAPP [8]. The name of the files coincides with the "ID" provided in the "Description_of_environmental_variables.xlsx" file, where the user can obtain more detailed information on any particular environmental variable. The list of files included in the compressed folder is provided in the file "File_names_Environmental_GWAS_2021_LM.csv"   For each of the files in this compressed folder, the   
        -chr: chromosome.
        -pos: position.
        -score: association strength between any given environmental variable and individual SNPs (negative logarithm of the P-value).
        -maf: minimum allele frequency for individual SNPs (frequency).
        -mac: minimum allele count (count).
        -GVE: Genetic variance explained.

5. "Environmental_GWAS_2021_AMM.zip": We include here the full pre-annotated and pre-filtered results derived from Genome–environment associations using a mixed model that corrects for population structure that we obtained using GWAPP [8]. The name of the files coincides with the "ID" provided in the "Description_of_environmental_variables.xlsx" file, where the user can obtain more detailed information on any particular environmental variable. The list of files included in the compressed folder is provided in the file "File_names_Environmental_GWAS_2021_AMM.csv"     
        -chr: chromosome.
        -pos: position.
        -score: association strength between any given environmental variable and individual SNPs (negative logarithm of the P-value).
        -maf: minimum allele frequency for individual SNPs( frequency).
        -mac: minimum allele count (count).
        -GVE: Genetic variance explained.

6. "Description_of_environmental_variables.xlsx": This file contains the description of environmental variables used in this study. The name in the "ID" column coincides with the ID provided for these environmental variables in the "CLIMATE_TWAS.csv" as well as the individual file name provided for GWAS analysis in both compressed folders that include the GWAS output for these analyses using either GWAS analysis method. For each "ID," this file provides a description of the source, the variable, its units, the analyzed timeframe, the link where the source files were obtained from to extract environmental variables, the date in which data was accessed, type of variable, and descriptive statistics, including the number of missing values per variable.

7. "File_names_Environmental_GWAS_2021_LM.csv": list of files included in the "Environmental_GWAS_2021_LM.zip" folder.

8. "File_names_Environmental_GWAS_2021_AMM.csv": list of files included in the "Environmental_GWAS_2021_AMM.zip" folder.

8. README.txt    

9. "AraCLIM-V2.zip": Code and data for AracLIM V2.0. For the latest version visit (https://github.com/CLIMtools/AraCLIM-V2).

10. "GenoCLIM-V2.zip": Code and data for GenoCLIM V2.0. For the latest version (https://github.com/CLIMtools/GenoCLIM-V2).

11. "CLIMGeno-V2.zip ": Code and data for CLIMGeno V2.0. For the latest version (https://github.com/CLIMtools/CLIMGeno-V2).

12. "T-CLIM.zip ": Code and data for T-CLIM V2.0. For the latest version (https://github.com/CLIMtools/T-CLIM).

Funding

National Science Foundation, Award: IOS-2122357

USDA-ARS, Award: 8062-21000-041-00D

Pennsylvania State University