Data from: A comparative analysis of stably expressed genes across diverse angiosperms exposes flexibility in underlying promoter architecture
Cite this dataset
Yang, Eric; Maranas, Cassandra; Nemhauser, Jennifer (2023). Data from: A comparative analysis of stably expressed genes across diverse angiosperms exposes flexibility in underlying promoter architecture [Dataset]. Dryad. https://doi.org/10.5061/dryad.9w0vt4bmk
Abstract
Promoters regulate both the amplitude and pattern of gene expression—key factors needed for optimization of many synthetic biology applications. Previous work in Arabidopsis found that promoters that contain a TATA-box element tend to be expressed only under specific conditions or in particular tissues, while promoters which lack any known promoter elements, thus designated as Coreless, tend to be expressed more ubiquitously. To test whether this trend represents a conserved promoter design rule, we identified stably expressed genes across multiple angiosperm species using publicly available RNA-seq data. Comparisons between core promoter architectures and gene expression stability revealed differences in core promoter usage in monocots and eudicots. Furthermore, when tracing the evolution of a given promoter across species, we found that core promoter type was not a strong predictor of expression stability. Our analysis suggests that core promoter types are correlative rather than causative in promoter expression patterns and highlights the challenges in finding or building constitutive promoters that will work across diverse plant species.
README: A comparative analysis of stably expressed genes across diverse angiosperms exposes flexibility in underlying promoter architecture
The dataset contains raw counts from RNA-seq atlases (RNA-seq datasets on multiple tissues throughout development) from fifteen species: Arabidopsis thaliana, Camelina sativa, Cucumis melo, Glycine max, Phaseolus vulgaris, Pisum sativum, Vigna unguiculata, Sorghum bicolor, Zea mays, Solanum lycopersicum, Actinidia chinensis, Triticum aestivum, Arachis hypogaea, Cicer arietinum, and Solanum tuberosum.
Expression stability (coefficient of variation) and mean expression strength (geometric mean) were calculated for each transcript to determine expression pattern.
Core promoter elements were scanned for each transcripts to determine promoter core type.
Relationships between core promoter elements and expression pattern were explored.
We were able to map gene expression pattern onto core promoter type in multiple genomic contexts. We found that while TATA-box-containing promoters are consistently over-represented in selectively expressed genes, Coreless promoters are only associated with stable expression in eudicots but not monocots. Additionally, by identifying orthologous gene groups within these species, we were able to track changes in core promoter type and expression pattern for groups of evolutionarily related promoters. We found that stably expressed genes are also more likely to have orthologs in other species compared to unstably expressed genes, and the orthologs tend to retain similar expression patterns. Lastly, we show that changes in core promoter types do not explain changes in expression pattern. This evolution-guided approach reveals design rules surrounding core promoter architecture and expression patterns.
Description of data and file structure
There are three main folders:
- Data: Contains all necessay input data to run the scripts as well as the output data from the scripts. Files in the folders are reference by relative path in the scripts.
- Summary.csv: unformatted version of Supplemental_Table_S3
- Raw_Counts: Output from Kallisto broken down by species
- Promoters:
- Y_40pcent_Z: Subset of promoters used to generate Figure2
- Y_orthologs_Z: Subset of promoters used to generate Figure4
- X_Y_annotation: details and genomic position of the promoter regions after parsing the gff3
- X_Y_sequence: actual sequence of promoter region
- Orthologs: Intermediate files generated for ortholog related analysis
- Normalized_Counts: Raw_Counts processed by DESeq2
- Metadata: Metadata regarding experimental setup for the RNAseq datasets. Used to perform DESeq2
- Genomes: Reference transcriptome and gene annotation (gff3) organized by species
- !!! Please note: due to licencing reasons, files in this folder is uploaded as supplemental information (Data_Genome.zip) on Dryad. Please download the file and unzip the folders into Data/Genomes/ and the scripts should work as intended !!!
- Core_Promoters: Analysis regarding core promoter types
- 40pcent_promoters: Analysis for Figure2 and Supplemental_Figure_S1
- X_MotifScores: numerical value of predicted motif scores for each promoter
- X_MotifsLabeled: Binary True/False for each motif based on motif scores with CV and geometric mean information
- Orthologs: Final Candidates
- Motif_results.csv: Motif scoring result (numeric value)
- Octamer_results.csv: Octamer presence or absence (True/False)
- Figures: Output folder for figures generated from the analysis.
- Scripts: All the scripts necessay to perform the analysis. The datasets are numbered by the sequence they need to be run to recreate the analysis, and each script explicitly refers to necessay input files in the Data folder.
- 0_Slurm_Pipeline are slurm/shell scripts performed on UW's Hyak computing cluster, taking in NCBI SRA numbers and generating transcript count tables
Details for 0_Slurm_Pipeline
This folder contains slurm/shell scripts
- Each step (with the exception of 5_kallisto_index.slurm) is performed in parallel
- Output files are saved under folder [Species_name]
- The scripts always take three arguments:
1. path to list_of_SRA.txt
2. SE (single end) or PE (paired ends)
3. Species_name (For the purpose of directing to the correct input/output folder)
Not all the arguments are used for all scripts but they're always included for consistency sake
- !!!Before running, the scripts will have to be modified to direct slurm to your folder of choice as they are coded currently with absolute path!!!
- 1_prefetch_parallel.slurm: Prefetch data given a list of NCBI SRA numbers
- modules: sratoolkit 3.0.1
- usage example: 1_prefetch_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 2_fasterq-dump_parallel.slurm: Download fasta files given a list of NCBI SRA numbers
- modules: sratoolkit 3.0.1
- usage example: 2_fasterq-dump_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 3_fastqc_parallel.slurm: fastqc on downloaded fasta files
- modules: java 17.0.5, fastqc 0.11.9
- usage example: 3_fastqc_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 4_trimmomatic_parallel.slurm: performs trimmomatic QC step
- modules: java 17.0.5, trimmomatic 0.39
- usage example: 4_trimmomatic_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 5_kallisto_index.slurm: Kallisto requires building an index from the reference transcriptome
- modules: kallisto 0.44.0
- usage example: The script has to be modified to direct slurm to the folder output_species_name/reference_transcriptome.fa.gz\, and the script can be run by calling 5_kallisto_index.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 6_kallisto_parallel.slurm: running Kallisto
- modules: kallisto 0.44.0
- usage example: 6_kallisto_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 7_consolidate_parallel.slurm: Copies individual count files (abundance.tsv) into the same folder (Counts)
- modules: None
- usage example: 7_consolidate_parallel.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
- 8_fastqc_parallel_postTrim.slurm: Another fastqc after the trimming step to ensure the trim was successful
- modules: java 17.0.5, fastqc 0.11.9
- usage example: 8_fastqc_parallel_postTrim.slurm [list_of_SRA.txt] [SE or PE] [Species_name]
Details for 1_Metadata_from_RUNselector.Rmd
R packages:
- dplyr 1.1.2
Purpose:
- Generate DESeq2 compatible metadata files
Inputs:
- Run-selector information from NCBI database or additional information gathered from the original study
- Additional information can be located in ../Data/Metadata/
- ../Data/Metadata/[Species_name].txt
Outputs:
- DEseq2 compatible metadata files
- ../Data/Metadata/[Species_name].csv
Details for 2_MOR_Normalization.Rmd
R packages:
- tidyverse 2.0.0
- DESeq2 1.38.3
Purpose:
- Takes output from Kallisto (in ../Data/Raw_Counts) and normalize according to the experimental setup (Metadata) using DESeq2's Normalization function.
Inputs:
- Run-selector information from NCBI database or additional information gathered from the original study
- Additional information can be located in ../Data/Metadata/
- ../Data/Metadata/[Species_name].txt
Outputs:
- DEseq2 compatible metadata files
- ../Data/Metadata/[Species_name].csv
Details for 3_ExtractPromUTR(ALL_Transcripts).ipynb
Python packages:
- gffpandas.gffpandas 0.1.0
- pandas 2.0.0
Purpose:
- This samples 40% of the total transcriptome of a species and first extract the intergenic region and 5'UTR annotation according to the associated gff3 file, and then use the annotation to extract actual sequences from the reference genome.
Parsing GFF3 annotation:
Inputs:
- List of transcripts (../Data/Normalized_Counts/[Genus_species_Summary].csv)
- Gff3 annotation file (../Data/Genomes/[Genus_species]/*.gff3)
Outputs:
- Position information for intergenic region and 5'UTR (../Data/Promoters/[Species]_40pcent_annotation.csv)
Extracting Sequences:
Inputs:
- Position information for intergenic region and 5'UTR (../Data/Promoters/[Species]_40pcent_annotation.csv)
- Reference genomes (../Data/Genomes/[Genus_species]/*.fa)
Outputs:
- Extracted Sequences ("../Data/Promoters/[Species]_40pcent_sequence.csv")
Note: the codes uses pd.str.contains() to search for seach gene in a loop, which isn't terribly efficient once the file gets big and might take hours to run. Parsing the Attributes might be a better way to go.
Details for 4_Label_Promoters.Rmd
R packages:
- tidyverse 2.0.0
- universalmotif 1.16.0
- ggplot2 3.4.2
- showtext 0.9.6
Purpose:
- Label the "promoter" regions of the 40% extracted genes as TATA, Inr, Ypatch, or Coreless
Input files:
- Extracted promoter sequences stored in ../Data/Promoters/[Genus_species]_40pcent_sequence.csv
- CPEs.meme files for the motifs stored in ../Data/Core_Promoters/ (obtained from Jores et al.)
- maize_TATA.meme is generated by modifying the meme file above and changing the TATA motif to a maize specific motif published in (Mejía-Guerra et al. 2015)
Output files:
- Scores from motif scanning are stored in ../Data/Core_Promoters/40pcent_promoters/[Genus_species]_MotifScores.csv
- Scores are summarized to presence or absence of motif and merged with CV, geom_mean data are stored in ../Data/Core_Promoters/40pcent_promoters/[Genus_species]_MotifsLabeled.csv
- Figure2 is saved in ../Figures/Figure2.svg
- Supplemental_Figure_S1_A is saved in ../Figures/Supplemental_Figure_S1_A.svg
- Supplemental_Figure_S2 is saved in ../Figures/Supplemental_Figure_S2.svg
- Supplemental_Figure_S1_B is saved in ../Figures/Supplemental_Figure_S1_B.svg
Details for 5_At_gene_ranking.Rmd
R packages:
- tidyverse 2.0.0
Purpose:
- Extract the top 5% stable, bottom 5% stable, and 5% random Arabidopsis genes for homology analysis
Inputs:
- ../Data/Normalized_Counts/Arabidopsis_thaliana_Summary.csv
(Generated from 2_MOR_Normalization.Rmd)
Output:
- ../Data/Orthologs/arabidopsis_gene_set.csv
Details for 6_Identifying_orthologs.Rmd
R packages:
- biomaRt 2.54.1
- tidyverse 2.0.0
Purpose:
- Takes the top 5% stable, 5% unstable, and 5% random genes from Arabidopsis and fetch their orthologs via biomaRt
inputs:
- Arabidopsis gene sets generated from step 5_At_gene_ranking.Rmd (../Data/Orthologs/arabidopsis_gene_set.csv)
outputs:
- Raw output from Phytozome is saved in ../Data/Orthologs/Phytozome_orthologs.csv
- Entries fetched from different species versions are collapsed as the same species in ../Data/Orthologs/Phytozome_orthologs_condensed.csv
- Output from Ensembl is saved in ../Data/Orthologs/Ensembl_orthologs.csv
- Merged orthologs with arabidopsis CV and geom_mean information is stored in ../Data/Orthologs/All_orthologs.csv
- ../Figures/Figure3.svg
Details for 7_Processing_orthologs.Rmd
R packages:
- tidyverse 2.0.0
Purpose:
- Taking the ortholog list, filtering down the interesting candidates with flipped expression patterns in at least two species.
inputs:
- ../Data/Orthologs/All_orthologs.csv generated from 6_Identifying_orthologs.Rmd
outputs:
- Candidates that
1. Changed expression pattern in at least 2 species
2. Had changes that mapped onto phylogenetic tree in a reasonable manner
were selected for and written to ../Data/Orthologs/High_Confidence_Candidates.csv
- Candidates that also had been verified by blast-align tree pipeline (see manuscript) were selected for with CV and geom_mean information attached in ../Data/Orthologs/Verified_Candidates.csv
Details for 8_ExtractPromUTR(Orthologs).ipynb
Python packages:
- gffpandas.gffpandas 0.1.0
- pandas 2.0.0
Purpose:
- Similar to 3_ExtractPromUTR(ALL_Transcripts).ipynb, parse gff3 file for just the Verified_Candidates list, and extract promoters.
inputs:
- ../Data/Orthologs/Verified_Candidates.csv generated from 7_Processing_orthologs.Rmd
- Gff3 files from ../Data/Genomes/[Genus_species]/*.gff3
- reference genome from ../Data/Genomes/[Genus_species]/*.fa
outputs:
- extracted gff3 info into ../Data/Promoters/[Genus_species]_orthologs_annotation.csv (from extract_prom_utr function)
- extracted sequences into ../Data/Promoters/[Genus_species]_orthologs_sequence.csv (from extract_sequence function)
Details for 9_Motif_Scan.Rmd
R packages:
- readr 2.1.4
- tidyverse 2.0.0
- tidyr 1.3.0
- tibble 3.2.1
- stringr 1.5.0
- universalmotif 1.16.0
Purpose:
- First merge Verified_Candidates list with promoter info, then label the candidate promoters as TATA, Coreless, Ypatch, Inr similar to 4_Label_Promoters.Rmd
- The code is modified from Jores et al., Nat. Plants 2021
Inputs:
- Verified_candidates list from ../Data/Orthologs/Verified_candidates.csv generated by 7_Processing_orthologs.Rmd
- "promoter" and UTR sequence from ../Data/Promoters/[Genus_species]_orthologs_sequence.csv generated by 8_ExtractPromUTR(Orthologs).ipynb
Note the "promoter" here is up to 2kb of the intergenic region as defined in 8_ExtractPromUTR(Orthologs).ipynb
Outputs:
- Verified_candidates list merged with sequence info: ../Data/Orthologs/Verified_candidates_seq.csv
- Motif scan merged with Verified_candidates list in ../Data/Core_Promoters/Orthologs/Motif_results.csv
Details for 10_Octamer_Scan.ipynb
Python packages:
- pandas 2.0.0
- Biopython 1.81
Purpose:
- Scans input promoter sequences for TATA, Y patch, CA, GA octamer motifs
- The octamers are defined by Yamamoto et al., 2009
Inputs:
- Verified_candidates_seq list from ../Data/Orthologs/Verified_candidates_seq.csv generated by 8_ExtractPromUTR(Orthologs).ipynb
Note the "promoter" region in the dataframe is up to 2kb of intergenic region as defined in 8_ExtractPromUTR(Orthologs).ipynb
Outputs:
- Output result from motif scanning: ../Data/Core_Promoters/Orthologs/Octamer_results.csv
Details for 11_Append_Geom_mean_PR.Rmd
R packages:
- tidyverse 2.0.0
Purpose:
- Finish building the final dataframe:
- Merge Motif screen results with Octamer scan results
- Append geom_mean PR info
Inputs:
- Motif scan output from ../Data/Core_Promoters/Orthologs/Motif_results.csv
- Octomer scan output from ../Data/Core_Promoters/Orthologs/Octamer_results.csv
Outputs:
- ../Data/Summary.csv is written. It is turned into xlsx and conditionally formatted to become Supplemental_Table_S3
Sharing/Access information
Data was derived from the following sources:
RNA-seq Datasets were retrived from NCBI via their BioProject number: PRJNA268115, PRJNA314076, PRJNA194429, PRJNA231618, PRJDB6414, PRJNA291488, PRJNA413872, PRJNA79597, PRJNA221782, PRJNA277076, PRJNA389300, SRA558272, SRA558514, SRA558539, PRJNA171684, SRP010680, PRJNA507622, PRJNA691387, PRJEB25639, PRJEB2430
Reference transcriptomes and reference genomes were downloaded from the Ensembl Plants (http://plants.ensembl.org/index.html) for Arabidopsis thaliana, Camelina sativa, Cucumis melo, Glycine max, Phaseolus vulgaris, Pisum sativum, Vigna unguiculata, Sorghum bicolor, Zea mays, Solanum lycopersicum, Actinidia chinensis, Triticum aestivum. and Phytozome (https://phytozome-next.jgi.doe.gov) for Arachis hypogaea, Cicer arietinum, and Solanum tuberosum.
Methods
RNA-seq dataset processing
(Relevant files: 0_Slurm_Pipeline)
RNA-seq atlases were located in the NCBI Sequence Read Archive (SRA) database. The references for the datasets can be found in Supplemental Table S1. The individual datasets were retrieved using sratoolkit-3.0.1 prefetch followed by fasterq-dump functions. Fastqc-0.11.9 were used to generate a QC report for each dataset. Trimmomatic-0.39 were used for adaptor and low quality ends trimming using the following settings: ‘SLIDINGWINDOW:4:20 MINLEN:36’. ILLUMINACLIP files TruSEq3-PE-2.fa was supplied for paired end data and TruSEq3-SE.fa were supplied for single end data. Reference transcriptome were downloaded from the Ensembl Plants (http://plants.ensembl.org/index.html) for Arabidopsis thaliana, Camelina sativa, Cucumis melo, Glycine max, Phaseolus vulgaris, Pisum sativum, Vigna unguiculata, Sorghum bicolor, Zea mays, Solanum lycopersicum, Actinidia chinensis, Triticum aestivum and Phytozome (https://phytozome-next.jgi.doe.gov) for Arachis hypogaea, Cicer arietinum, and Solanum tuberosum (Cunningham et al., 2021; Goodstein et al., 2012). An index file was generated and the reads aligned and counted using Kallisto-0.44.0 with ‘-o counts -b 500’. For single end data, Fragment Length and Standard Deviation were required, but the information is difficult to locate, and so a default value of ‘-l 200 -s 20’ were used across the board.
Another Fastqc was performed on the trimmed files, and a final MultiQC-1.13 were run on the entire folder encompassing all the log files that Fastqc, Trimmomatic, and Kallisto generated. The MultiQC report was inspected to ensure the trimming step improved read quality and there were no major warnings.
Normalizing count, Calculating CV and Percent Ranking
(Relevant files: 1_Metadata_from_RUNselector.Rmd, 2_MOR_Normalization.Rmd)
Using an R script, the raw counts for each species were normalized using the DESeq2 package using a metadata file curated from the original study for the RNA-seq datasets. The coefficient of variation across all samples for a given atlas was used as a metric for stability for each gene, and the percentile ranking for each gene was calculated. The geometric mean for each gene was also calculated across all samples.
Extracting intergenic region and 5’UTR
(Relevant files: 3_ExtractPromUTR(ALL_Transcripts).ipynb, 8_ExtractPromUTR(Orthologs).ipynb)
Gff3 annotation files and reference genomes were downloaded from Ensembl or Phytozome depending on where the reference transcriptomes were retrieved from. 40% of transcripts were selected from the total transcriptome and their intergenic region and 5’UTR were extracted from the Gff3 annotation. Intergenic region and 5’UTRs of identified orthologs were extracted in a similar manner.
Labeling core promoter types
(Relevant files: 4_Label_Promoters.Rmd, 9_Motif_Scan.Rmd, 10_Octamer_Scan.ipynb)
Motif Scan: Intergenic regions and 5’UTR sequences are trimmed to only regions to be scanned for each core promoter types: TATA box (-100 to TSS), Y patch (-100 to +100), and Inr (-10 to +10). Intergenic regions shorter than 100bps were excluded from analysis. Each regions were scanned for their respective motifs according using motif files as well as methods outlined in (Jores et al., 2021). A motif is considered to be present when the relative motif scores are above 0.85.
Octamer Scan: Intergenic regions and 5’UTR sequences were trimmed based on the positions relative to the TSS outlined in Yamamoto et al. 2009 (TATA, −45 to −18; Y Patch, −50 to +50; CA, −35 to −1; GA, −35 to +75). Each region was scanned for the presence of octamer motifs from the TATA, Y patch, GA, and CA lists outlined in Yamamoto et al. 2009. If the specified region contained at least one motif for a given promoter type, it was labeled as positive.
Ortholog Analysis
(Relevant files: 5_At_gene_ranking.Rmd, 6_Identifying_orthologs.Rmd, 7_Processing_orthologs.Rmd)
The Arabidopsis transcriptome was filtered to only include primary transcripts, and mitochondria as well as chloroplast transcripts were removed. Top 5% stable genes by CV, bottom 5% stable genes by CV and a random set of 1343 genes (5%) were randomly selected.
Using biomaRt in R, the Ensembl and Phytozome databases were queried for orthologs for the selected set of Arabdiopsis genes for each species (Durinck et al., 2009). Orthologs from Arachis hypogaea, Cicer arietinum, and Solanum tuberosum were retrieved from Phytozome, and the rest of the species from Ensembl. For analysis in Figure3B, significance test of done by ANOVA followed by Tukey’s HSD. For each target gene that matched to an Arabidopsis transcript, only the highest expressing transcript was kept. If an Arabidopsis transcript retrieved more than one orthologs from a target species, these pairs of orthologs were removed from analysis. We only kept orthologous gene groups that had a “change” in expression pattern, defined as crossing the 50th percentile CV, in two target species, and the remaining candidates were manually mapped onto the phylogenetic tree to identify gene groups that had changes in expression pattern that are consistent with the tree. This means having changes in expression pattern that are mostly found in the same clade. Gene trees were built for these candidates using blast-align-tree (https://github.com/steinbrennerlab/blast-align-tree) and the candidate lists were further trimmed based on the gene trees to ensure a 1:1 relationship between all members in the gene group.
The dataset contains all the necessary scripts to transform the data as described in the manuscript and perform the analysis in the paper.
Usage notes
Scripts are written in R and Python.
Funding
National Institute of General Medical Sciences, Award: R01-GM107084
National Science Foundation, Award: IOS-1546873
Howard Hughes Medical Institute, Award: Faculty Scholar Award