LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce

Denby, Katherine 1 ; Kara, Mehmet Fatih1 ; Guo, Wenbin2 ; Zhang, Runxuan2

Published Feb 08, 2024; Updated May 29, 2024 on Dryad. https://doi.org/10.5061/dryad.xwdbrv1m8

Data files

Feb 08, 2024 version files 473.75 MB

lettuceRTDv1.fasta

294.66 MB
lettuceRTDv1.gtf

179.08 MB
README.md

5.10 KB

May 29, 2024 version files 981.16 MB

LettuceRTDv1_annotated.gtf

507.41 MB
lettuceRTDv1.fasta

294.66 MB
lettuceRTDv1.gtf

179.08 MB
README.md

5.35 KB

Abstract

Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce.

https://doi.org/10.5061/dryad.xwdbrv1m8

The genome assembly of cultivated lettuce was published in 2017 (Reyes-Chin-Wo et al., 2017) with an updated genome version (version 11) available on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002870075.4/). Here, we introduce the first lettuce reference transcript dataset (LsRTDv1) integrating long-read Iso-seq and short-read RNA-seq of diverse tissue and treatment samples from lettuce with the GenBank and RefSeq transcript annotations, using stringent quality measures. The final LsRTDv1 includes 179,404 non-redundant transcripts encoded by 65,724 genes, greatly expanding the existing lettuce transcriptome and increasing the number of transcripts per gene from 1.4 to 2.7. LsRTDv1 identifies 3696 novel gene models, predominantly long non-coding RNAs, absent in both GenBank and RefSeq annotations.

Description of the data and file structure

We provide three files for the LsRTDv1:

lettuceRTDv1.fasta which provides the sequence of all annotated transcripts in LsRTDv1
lettuceRTDv1.gtf which provides the coordinates of every transcript and transcript feature (e.g. intron, exon, transcription start site, transcription end site)
LettuceRTDv1_annotated.gtf which provides the information in the above gtf file, along with annotation of transcripts (e.g. protein encoded)

These files can be used for transcript-specific RNAseq analysis (e.g. using Salmon or Kallisto for quantification of reads to each transcript) and hence for accurate gene expression analysis and analysis of alternative splicing.

Sharing/Access information

The short and long read sequencing data underlying this transcript annotation can be found at NCBI Sequencing Read Archive under BioProject PRJNA1018253

Data was derived from the following sources (each sequencing library was made of several pooled samples) and each library was sequenced with long and short read methods (Pacbio and Illumina).

Library ID	Sample ID	Sample Name	Description
P1	S19	Ls-1W-NT-CTY	cotyledons from 1-week-old seedlings w/o treatment
	S20	Ls-1W-NT-R	radicles from 1-week-old seedlings w/o treatment
	S26	Ls-1W-NT-HYP	hypocotyls from 1-week-old seedlings w/o treatment
P2	S4	Ls-6W-NT-L	leaves from 6-week-old plants w/o treatment
	S22	Ls-10W-NT-L10	6-10 mature leaves from 10-week-old plants w/o treatment
	S15	Ls-12W-NT-L	leaves from 12-week-old plants w/o treatment
	S21	Ls-10W-NT-L5	1-5 old leaves from 10-week-old plants w/o treatment
P3	S6	Ls-6W-HT-L	leaves from 6-week-old plants exposed to heat-shock treatment
	S8	Ls-6W-CT-L	leaves from 6-week-old plants exposed to chilling treatment
	S11	Ls-6W-WT-L	leaves from 6-week-old plants exposed to waterlogging treatment
P4	S10	Ls-6W-BCIN-L	leaves from 6-week-old plants infected with Botrytis cinerea
	S13	Ls-6W-DT-L	leaves from 6-week-old plants exposed to drought treatment
	S18	Ls-6W-WoT-L	leaves from 6-week-old plants exposed to wounding treatment
P5	S3	Ls-2W-NT-R	roots from 2-week-old plantlets w/o treatment
	S5	Ls-10W-NT-R	roots from 10-week-old plantlets w/o treatment
	S16	Ls-12W-NT-R	roots from 12-week-old plants w/o treatment
	S7	Ls-10W-HT-R	roots from 10-week-old plants exposed to heat-shock treatment
P6	S9	Ls-6W-CT-R	roots from 6-week-old plants exposed to chilling treatment
	S12	Ls-10W-WT-R	roots from 10-week-old plants exposed to waterlogging treatment
	S14	Ls-10W-DT-R	roots from 10-week-old plants exposed to drought treatment
P7	S2	Ls-2W-NT-L	leaves from 2-week-old plantlets w/o treatment
	S23	Ls-10W-NT-L15	11-15 young leaves from 10-week-old plants w/o treatment
	S24	Ls-10W-NT-L20	apical meristems including young leaves from 10-week-old plants w/o treatment

We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing.

Short-read assembly

The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assemblers, StringTie (Pertea et al., 2015) and Scallop (Shao and Kingsford, 2017), were used to assemble transcripts for each sample. The assemblies were then merged and refined using RTDmaker (https://github.com/anonconda/RTDmaker) to remove low-quality transcripts, including redundant transcripts with identical intron combinations to longer transcripts, fragmented transcripts with length <70% of gene length, transcripts with non-canonical SJs, transcripts with SJs only supported by <5 spliced reads in <2 samples and low expressed transcripts with <1 transcript per million reads (TPM) in <2 samples.

Long-read assembly

We employed the IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) to pre-process the Iso-seq data from the seven samples. The CCS method was used to generate circular consensus sequences (CCS) from raw subreads and reads with minimum predicted accuracy <90% were discarded (--min-rq=0.9). Barcodes associated with the CCS reads were eliminated using the lima method. To further refine the reads, Isoseq3 was applied to trim poly(A) tails and identify and remove concatemers. The output of full-length, non-concatemer (FLNC) reads was mapped to the reference genome using Minimap2 (Li, 2018). TAMA-collapse was used to collapse redundant transcript models in each sample with variation at the 5’ and 3’ ends and at SJs not allowed (-a = 0, -m = 0 and -z = 0) to ensure high accuracy of boundaries. Reads with errors within the 10 bp up- or down-stream of a SJ were removed. TAMA-merge was used to merge transcript models from the seven samples (Kuo et al., 2020). To improve the quality of the assembly, we implemented well-established methods for SJ and transcript start site (TSS) and end site (TES) analyses previously used for Arabidopsis AtRTD3 and barley BaRTv2 (Zhang et al., 2022b; Coulter et al., 2022). We removed low-quality transcripts that exhibited non-canonical SJs and low quality SJs unless they were also present in the short-read assembly. We applied a binomial test to distinguish high-confidence TSS and TES with a false discovery rate (FDR) <0.05. For genes with limited read support, statistical testing becomes challenging, hence we also kept TSS/TES if they were supported by at least 2 Iso-seq reads. Redundancy merge was applied to transcripts if they only differed ±50 nucleotides at their TSS/TES. In addition, transcripts only supported by a single Iso-seq read were removed from the final dataset.

Integration of multiple annotations

We integrated four transcript annotations: the long-read assembly, short-read assembly and two versions of Lsat_Salinas_v11 genome annotations GenBank (GCA_002870075.4) and RefSeq (GCF_002870075.4). The Iso-seq long-read assembly served as the reliable backbone, while the other three annotations were incorporated in a step-wise manner to improve the RTD completeness. Firstly, the transcripts in the short-read assembly that introduce novel SJs and/or novel gene loci were integrated into the long-read assembly. Subsequently, we added transcripts from GenBank and RefSeq annotations that contributed novel SJs or gene loci to build the lettuce RTD (LsRTDv1). In cases where two transcripts from GenBank and RefSeq had identical SJ combinations or were mono-exonic transcripts with overlapping regions exceeding 30% of both transcripts, we collapsed them to a single transcript, and the longest TSS and TES were used as the start and end point of the collapsed transcript. In LsRTDv1, the overlapped transcripts were assigned the same gene ID. However, if a set of overlapped transcripts entirely resided within the intron region of other transcripts, they were treated as intronic transcripts and assigned with a different gene ID. Where the overlapped transcripts can be divided into multiple groups and the adjacent groups overlapped less than 5% of the group lengths, they were assigned separate gene IDs.