The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L
Data files
Aug 14, 2023 version files 329.48 MB
-
README.md
-
SupplementalDataSetS4.xlsx
-
SupplementalDataSetS5.xlsx
Sep 05, 2023 version files 329.47 MB
-
Additional_File_13._CPM_Normalized_Counts.xlsx
-
Additional_File_14._Fractional_Counts.xlsx
-
README.md
Jun 27, 2024 version files 347.94 MB
-
Additional_File_6._Replicate_Count_Data_and_Metadata.xlsx
-
README.md
Abstract
A large dataset of replicated transcriptomes was developed to accelerate Theobroma cocoa genomics research with the long-term goal of progressing breeding towards developing high-yielding elite varieties of cacao. RNAs were extracted and transcriptomes were sequenced from 123 different tissues and stages of development representing major organs and developmental stages of the cacao lifecycle. In addition, several experimental treatments and time courses were performed to measure gene expression in tissues responding to biotic and abiotic stressors. Samples were collected in replicates (3-5) to enable statistical analysis of gene expression levels for a total of 390 transcriptomes. We describe the creation of the atlas,and its global characterization and define sets of genes co-regulated in highly organ- and temporally-specific manners. To promote wider use of these data, all raw sequencing data, expression read mapping matrices, scripts, and other information used to create the resource are freely available online. A gene expression browser with a graphical user interface was developed to display gene expression patterns and to provide easy access of raw data and statistical analyses.
README: The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L
Description of the Data and file structure
- The first row lists all tissues, replicates and time points for each sample. The first column lists each cacao gene that was detected. All other cells contain the number of transcripts that were mapped for each gene/sample combination.
- CPM counts are normalized by counts per million, they are used on the BAR website
- To compare values in the gene expression matrix with the BAR website, be sure to use the CPM read counts
- Fractional reads are unnormalized raw reads to be used for downsteam analysis such as DESeq2, do not compare these counts with the counts on BAR
- Genotype of the tissue is indicated in the metadata, the data was mapped to multiple genomes which may differ from the genotype of the tissue
- Ex: CCN51 tissues were mapped to the CCN51 genome AND SCA6 genome
- The Leaf Development and Leaf Infection atlases were mapped to the Criollo v2.0 genome
- Note that gene accession numbers for identical genes are different across the different assemblies
- To determine identical genes across assemblies, a reciprocal BLAST must be performed or reference additional file 11.
Sharing/access Information
Links to other publicly accessible locations of the data:
NCBI Master Bioproject Guiltinan-Maximova Lab: https://www.ncbi.nlm.nih.gov/bioproject/936437
The Cacao SCA eFP Browser is at https://bar.utoronto.ca/efp_cacao_sca/cgi-bin/efpWeb.cgi
The Cacao CCN eFP Browser is at https://bar.utoronto.ca/efp_cacao_ccn/cgi-bin/efpWeb.cgi
The Cacao TC eFP Browser is at https://bar.utoronto.ca/efp_cacao_tc/cgi-bin/efpWeb.cgi
Was data derived from another source? No
If yes, list source(s) Original data produced in the Guiltinan-Maximova lab at Penn State.
Methods
RNA was extracted form about 400 different tissues/treatments and replicates. Transcriptome sequencing was performed by Quant Seq (Lexogen). Raw QuantSeq reads were first examined with FASTQC (v0.11.9 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the overall data quality before processing. Reads were then processed using bbduk (BBMap tools v37.76; https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) to trim the adapter sequences, poly-A tails, and low-quality bases and to discard fragments less than 20 bp in length after trimming. Trimmed reads were mapped to the CCN-51 and SCA6 Theobroma cacao genotype reference genomes using the STAR Aligner version 2.7.5b (Dobin et al. 2013). Expression quantification was performed with featureCounts from the Subread package version 2.0.1 (Liao et al. 2013) in a fractional read-counting mode to prop distribute muti-mapping reads among features using gene annotation GFF3 files modified with GenomeTools version 1.5.9 (Gremme et al. 2013) to include intron coordinates. The count matrices were normalized to counts per million (CPM) values using the default parameters of the cpm function in the edgeR Bioconductor package (Robinson et al. 2010). Annotations were performed as described in Winters (2023). Analysis commands utilized in the QuantSeq read processing are reported in Supplemental Data Set 3.
Usage notes
Excel or any text editor or spreadsheet program.