Meta analysis of public drought gene expression data in plants

Published May 05, 2026 on Dryad. https://doi.org/10.5061/dryad.7sqv9s50g

Data files

May 05, 2026 version files 1.34 GB

Arabidopsis_TPM_8-14-23.tsv

276.72 MB
Heat_TPM.txt

32.64 MB
maize_TPM.tsv

504.31 MB
README.md

3.51 KB
rice_TPM.txt.tsv

269.33 MB
soy_TPM.txt

74.65 MB
TPM_tomato.tsv

91.88 MB
wheat_TPM.txt

85.92 MB

Abstract

Physiologically relevant drought stress is difficult to apply consistently, and the heterogeneity in experimental design, growth conditions, and sampling schemes make it challenging to compare water deficit studies in plants. Here, we re-analyzed hundreds of drought gene expression experiments across diverse model and crop species and quantified the variability across studies. We found that drought studies are surprisingly uncomparable, even when accounting for differences in genotype, environment, drought severity, and method of drying. Many studies, including most Arabidopsis work, lack high-quality phenotypic and physiological datasets to accompany gene expression, making it impossible to assess the severity or in some cases the occurrence of water deficit stress events. From these datasets, we developed supervised learning classifiers that can accurately predict if RNA-seq samples have experienced a physiologically relevant drought stress, and suggest this can be used as a quality control for future studies. Together, our analyses highlight the need for more community standardization, and the importance of paired physiology data to quantify stress severity for reproducibility and future data analyses.

https://doi.org/10.5061/dryad.7sqv9s50g

Here, we re-analyzed hundreds of drought gene expression experiments across diverse model and crop species and quantified the variability across studies. We assembled a database of drought RNAseq data in Arabidopsis, soybean, tomato, rice, maize, and rice from the NCBI sequence read archive (SRA). Bulk data was retrieved using a series of drought stress related keywords with the SRA Advanced Search Builder. The following metadata was collected for each experiment: tissue type(s), developmental stage, environment (e.g, greenhouse, field, growth chamber etc), media type, duration of stress, mechanism of drying, associated physiology datasets, genotype, number of timepoints, and number of replicates. 112 studies had a linked publication in the NCBI metadata and 130 had no associated publication across all 6 species. Similar metadata was retrieved for individual SRA samples along with a binary classification of treatment (drought or control) where possible.

Description of the data and file structure

Raw transcript abundance values are provided as transcripts per million, or TPM, along with the corresponding SRA identifier for each RNA-seq sample. Each species has a separate tab-delimited file. These files contain gene-level TPM values for public drought and control RNA-seq datasets used in the meta-analysis.

The dataset includes the following expression matrices:

Arabidopsis_TPM_8-14-23.tsv
Gene-level TPM matrix for Arabidopsis thaliana RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

Heat_TPM.txt
Gene-level TPM matrix for heat stress RNA-seq samples used as an additional comparative stress dataset. Rows correspond to genes and columns correspond to SRA samples.

maize_TPM.tsv
Gene-level TPM matrix for maize, Zea mays, RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

rice_TPM.txt.tsv
Gene-level TPM matrix for rice, Oryza sativa, RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

soy_TPM.txt
Gene-level TPM matrix for soybean, Glycine max, RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

TPM_tomato.tsv
Gene-level TPM matrix for tomato, Solanum lycopersicum, RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

wheat_TPM.txt
Gene-level TPM matrix for wheat, Triticum aestivum, RNA-seq samples included in the drought meta-analysis. Rows correspond to genes and columns correspond to SRA samples.

Sharing/Access information

Links to other publicly accessible locations of the data:

https://github.com/bobvanburen/Drought_meta_analysis_VanBuren_etal_2024/

Code/Software

Jupyter notebooks are available on GitHub for all analyses of the drought expression data including filtering, batch effect correction, dimensionality reduction, clustering, and machine learning based predictive modeling: https://github.com/bobvanburen/Drought_meta_analysis_VanBuren_etal_2024/

We assembled a database of drought RNAseq data in Arabidopsis, soybean, tomato, rice, maize, and rice from the NCBI sequence read archive (SRA). Bulk data was retrieved using a series of drought or heat stress related keywords with the SRA Advanced Search Builder. The following metadata was collected for each experiment: tissue type(s), developmental stage, environment (e.g, greenhouse, field, growth chamber etc), media type, duration of stress, mechanism of drying, associated physiology datasets, genotype, number of timepoints, and number of replicates. 112 studies had a linked publication in the NCBI metadata and 130 had no associated publication across all 6 species. Similar metadata was retrieved for individual SRA samples along with a binary classification of treatment (drought or control) where possible. Metadata was retrieved from the SRA and associated publications, but the lack of publications and ambiguity in some labels led to a high degree of missing or sparse metadata for many samples, and our manual annotations were conservative to reduce mislabeling samples for analysis and downstream predictive modeling.

Raw RNAseq reads were downloaded from the NCBI SRA and quantified using a pipeline to trim, align, and quantify gene expression data (https://github.com/pardojer23/RNAseqV2). Briefly, sequence adapters were trimmed and a quality check was performed on the raw FASTQ files using the fastp program (v0.23.2). The cleaned sequencing reads were then pseudo-aligned to the Arabidopsis TAIR10 (Cheng et al., 2017), maize (Zea mays B73 V5) (Hufford et al., 2021), rice (Oryza sativa Kitaake v3.1) (Jain et al., 2019), tomato (Solanum lycopersicum ITAG4.0) (Hosmani et al., 2019), soybean (Glycine max var. Williams 82 V4) (Valliyodan et al., 2019), wheat (Triticum aestivum cv. Chinese Spring RefSeq v2.1) (Zhu et al., 2021) genomes using salmon (v1.6) (Patro et al., 2017). The transcript level counts were converted to gene level using the R package TXimport (v 1.22.0) (Soneson et al., 2015). Raw TPMs or log2+1 transformed values were used for downstream analyses. The median alignment rate is 69.1% across all species and 70.8% in Arabidopsis, 79.5% in maize , 62.0% in rice, 65.1% in tomato, 64.7% in soybean, and 64.9% in wheat. These alignment rates are consistent with other meta analyses of gene expression in Arabidopsis (Zhang et al., 2020).