A large dataset of replicated transcriptomes was developed to accelerate Theobroma cocoa genomics research with the long-term goal of progressing breeding towards developing high-yielding elite varieties of cacao. RNAs were extracted and transcriptomes were sequenced from 123 different tissues and stages of development representing major organs and developmental stages of the cacao lifecycle. In addition, several experimental treatments and time courses were performed to measure gene expression in tissues responding to biotic and abiotic stressors. Samples were collected in replicates (3-5) to enable statistical analysis of gene expression levels for a total of 390 transcriptomes. We describe the creation of the atlas,and its global characterization and define sets of genes co-regulated in highly organ- and temporally-specific manners. To promote wider use of these data, all raw sequencing data, expression read mapping matrices, scripts, and other information used to create the resource are freely available online. A gene expression browser with a graphical user interface was developed to display gene expression patterns and to provide easy access of raw data and statistical analyses.

RNA was extracted form about 400 different tissues/treatments and replicates. Transcriptome sequencing was performed by Quant Seq (Lexogen). Raw QuantSeq reads were first examined with FASTQC (v0.11.9 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the overall data quality before processing. Reads were then processed using bbduk (BBMap tools v37.76; https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) to trim the adapter sequences, poly-A tails, and low-quality bases and to discard fragments less than 20 bp in length after trimming. Trimmed reads were mapped to the CCN-51 and SCA6 Theobroma cacao genotype reference genomes using the STAR Aligner version 2.7.5b (Dobin et al. 2013). Expression quantification was performed with featureCounts from the Subread package version 2.0.1 (Liao et al. 2013) in a fractional read-counting mode to prop distribute muti-mapping reads among features using gene annotation GFF3 files modified with GenomeTools version 1.5.9 (Gremme et al. 2013) to include intron coordinates. The count matrices were normalized to counts per million (CPM) values using the default parameters of the cpm function in the edgeR Bioconductor package (Robinson et al. 2010). Annotations were performed as described in Winters (2023). Analysis commands utilized in the QuantSeq read processing are reported in Supplemental Data Set 3.

Excel or any text editor or spreadsheet program.

The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L

Data files

Abstract

Description of the Data and file structure

Sharing/access Information

The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L

Data files

Abstract

README: The cacao gene atlas: A transcriptome developmental atlas reveals highly tissue-specific and dynamically-regulated gene networks in Theobroma cacao L

Description of the Data and file structure

Sharing/access Information

Methods

Usage notes

Works referencing this dataset