LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce
Data files
Feb 08, 2024 version files 473.75 MB
-
lettuceRTDv1.fasta
-
lettuceRTDv1.gtf
-
README.md
May 29, 2024 version files 981.16 MB
-
LettuceRTDv1_annotated.gtf
-
lettuceRTDv1.fasta
-
lettuceRTDv1.gtf
-
README.md
Abstract
Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce.
README: LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce
https://doi.org/10.5061/dryad.xwdbrv1m8
The genome assembly of cultivated lettuce was published in 2017 (Reyes-Chin-Wo et al., 2017) with an updated genome version (version 11) available on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002870075.4/). Here, we introduce the first lettuce reference transcript dataset (LsRTDv1) integrating long-read Iso-seq and short-read RNA-seq of diverse tissue and treatment samples from lettuce with the GenBank and RefSeq transcript annotations, using stringent quality measures. The final LsRTDv1 includes 179,404 non-redundant transcripts encoded by 65,724 genes, greatly expanding the existing lettuce transcriptome and increasing the number of transcripts per gene from 1.4 to 2.7. LsRTDv1 identifies 3696 novel gene models, predominantly long non-coding RNAs, absent in both GenBank and RefSeq annotations.
Description of the data and file structure
We provide three files for the LsRTDv1:
- lettuceRTDv1.fasta which provides the sequence of all annotated transcripts in LsRTDv1
- lettuceRTDv1.gtf which provides the coordinates of every transcript and transcript feature (e.g. intron, exon, transcription start site, transcription end site)
- LettuceRTDv1_annotated.gtf which provides the information in the above gtf file, along with annotation of transcripts (e.g. protein encoded)
These files can be used for transcript-specific RNAseq analysis (e.g. using Salmon or Kallisto for quantification of reads to each transcript) and hence for accurate gene expression analysis and analysis of alternative splicing.
Sharing/Access information
The short and long read sequencing data underlying this transcript annotation can be found at NCBI Sequencing Read Archive under BioProject PRJNA1018253
Data was derived from the following sources (each sequencing library was made of several pooled samples) and each library was sequenced with long and short read methods (Pacbio and Illumina).
Library ID | Sample ID | Sample Name | Description |
---|---|---|---|
P1 | S19 | Ls-1W-NT-CTY | cotyledons from 1-week-old seedlings w/o treatment |
S20 | Ls-1W-NT-R | radicles from 1-week-old seedlings w/o treatment | |
S26 | Ls-1W-NT-HYP | hypocotyls from 1-week-old seedlings w/o treatment | |
P2 | S4 | Ls-6W-NT-L | leaves from 6-week-old plants w/o treatment |
S22 | Ls-10W-NT-L10 | 6-10 mature leaves from 10-week-old plants w/o treatment | |
S15 | Ls-12W-NT-L | leaves from 12-week-old plants w/o treatment | |
S21 | Ls-10W-NT-L5 | 1-5 old leaves from 10-week-old plants w/o treatment | |
P3 | S6 | Ls-6W-HT-L | leaves from 6-week-old plants exposed to heat-shock treatment |
S8 | Ls-6W-CT-L | leaves from 6-week-old plants exposed to chilling treatment | |
S11 | Ls-6W-WT-L | leaves from 6-week-old plants exposed to waterlogging treatment | |
P4 | S10 | Ls-6W-BCIN-L | leaves from 6-week-old plants infected with Botrytis cinerea |
S13 | Ls-6W-DT-L | leaves from 6-week-old plants exposed to drought treatment | |
S18 | Ls-6W-WoT-L | leaves from 6-week-old plants exposed to wounding treatment | |
P5 | S3 | Ls-2W-NT-R | roots from 2-week-old plantlets w/o treatment |
S5 | Ls-10W-NT-R | roots from 10-week-old plantlets w/o treatment | |
S16 | Ls-12W-NT-R | roots from 12-week-old plants w/o treatment | |
S7 | Ls-10W-HT-R | roots from 10-week-old plants exposed to heat-shock treatment | |
P6 | S9 | Ls-6W-CT-R | roots from 6-week-old plants exposed to chilling treatment |
S12 | Ls-10W-WT-R | roots from 10-week-old plants exposed to waterlogging treatment | |
S14 | Ls-10W-DT-R | roots from 10-week-old plants exposed to drought treatment | |
P7 | S2 | Ls-2W-NT-L | leaves from 2-week-old plantlets w/o treatment |
S23 | Ls-10W-NT-L15 | 11-15 young leaves from 10-week-old plants w/o treatment | |
S24 | Ls-10W-NT-L20 | apical meristems including young leaves from 10-week-old plants w/o treatment |
Methods
We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing.
Short-read assembly
The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assemblers, StringTie (Pertea et al., 2015) and Scallop (Shao and Kingsford, 2017), were used to assemble transcripts for each sample. The assemblies were then merged and refined using RTDmaker (https://github.com/anonconda/RTDmaker) to remove low-quality transcripts, including redundant transcripts with identical intron combinations to longer transcripts, fragmented transcripts with length <70% of gene length, transcripts with non-canonical SJs, transcripts with SJs only supported by <5 spliced reads in <2 samples and low expressed transcripts with <1 transcript per million reads (TPM) in <2 samples.
Long-read assembly
We employed the IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) to pre-process the Iso-seq data from the seven samples. The CCS method was used to generate circular consensus sequences (CCS) from raw subreads and reads with minimum predicted accuracy <90% were discarded (--min-rq=0.9). Barcodes associated with the CCS reads were eliminated using the lima method. To further refine the reads, Isoseq3 was applied to trim poly(A) tails and identify and remove concatemers. The output of full-length, non-concatemer (FLNC) reads was mapped to the reference genome using Minimap2 (Li, 2018). TAMA-collapse was used to collapse redundant transcript models in each sample with variation at the 5’ and 3’ ends and at SJs not allowed (-a = 0, -m = 0 and -z = 0) to ensure high accuracy of boundaries. Reads with errors within the 10 bp up- or down-stream of a SJ were removed. TAMA-merge was used to merge transcript models from the seven samples (Kuo et al., 2020). To improve the quality of the assembly, we implemented well-established methods for SJ and transcript start site (TSS) and end site (TES) analyses previously used for Arabidopsis AtRTD3 and barley BaRTv2 (Zhang et al., 2022b; Coulter et al., 2022). We removed low-quality transcripts that exhibited non-canonical SJs and low quality SJs unless they were also present in the short-read assembly. We applied a binomial test to distinguish high-confidence TSS and TES with a false discovery rate (FDR) <0.05. For genes with limited read support, statistical testing becomes challenging, hence we also kept TSS/TES if they were supported by at least 2 Iso-seq reads. Redundancy merge was applied to transcripts if they only differed ±50 nucleotides at their TSS/TES. In addition, transcripts only supported by a single Iso-seq read were removed from the final dataset.
Integration of multiple annotations
We integrated four transcript annotations: the long-read assembly, short-read assembly and two versions of Lsat_Salinas_v11 genome annotations GenBank (GCA_002870075.4) and RefSeq (GCF_002870075.4). The Iso-seq long-read assembly served as the reliable backbone, while the other three annotations were incorporated in a step-wise manner to improve the RTD completeness. Firstly, the transcripts in the short-read assembly that introduce novel SJs and/or novel gene loci were integrated into the long-read assembly. Subsequently, we added transcripts from GenBank and RefSeq annotations that contributed novel SJs or gene loci to build the lettuce RTD (LsRTDv1). In cases where two transcripts from GenBank and RefSeq had identical SJ combinations or were mono-exonic transcripts with overlapping regions exceeding 30% of both transcripts, we collapsed them to a single transcript, and the longest TSS and TES were used as the start and end point of the collapsed transcript. In LsRTDv1, the overlapped transcripts were assigned the same gene ID. However, if a set of overlapped transcripts entirely resided within the intron region of other transcripts, they were treated as intronic transcripts and assigned with a different gene ID. Where the overlapped transcripts can be divided into multiple groups and the adjacent groups overlapped less than 5% of the group lengths, they were assigned separate gene IDs.