Skip to main content
Dryad logo

Aminoacyl-tRNA synthetase gene alignments from multiple Sileneae species generated from full-length transcripts using Iso-Seq

Citation

Warren, Jessica (2022), Aminoacyl-tRNA synthetase gene alignments from multiple Sileneae species generated from full-length transcripts using Iso-Seq, Dryad, Dataset, https://doi.org/10.5061/dryad.0k6djhb20

Abstract

Trimmed and untrimmed alignments for the final aminoacyl-tRNA synthetases in Sileneae species and Arabidopsis thaliana. We investigated the evolution of subcellular localization of aaRS enzymes in five different species from the plant lineage Sileneae that has experienced extensive and rapid mitochondrial tRNA loss. By analyzing full-length mRNA transcripts with single-molecule sequencing technology (PacBio Iso-Seq) and searching genome sequences, we found instances of predicted retargeting of an ancestrally cytosolic aaRS to the mitochondrion as well as scenarios where enzyme localization does not appear to change despite functional tRNA replacement.

Methods

RNA was extracted from A. githago (hermaphrodite), S. conica (hermaphrodite), S. latifolia (male), and S. vulgaris (male-fertile hermaphrodite) with a Qiagen RNeasy Plant Mini Kit, using RLT buffer with 10 µl beta-mercaptoethanol. RNA was DNase treated with a Qiagen RNase-Free DNase Set. Separate RNA extractions were performed on leaf tissue and an immature flower sample (~5 days post flower development) for A. githagoS. vulgaris, and S. latifolia. Two different tissues were used to increase detection of diverse transcripts, but the two RNA samples were pooled equally by mass for each species prior to library construction, so individual reads cannot be assigned to leaf or floral tissues. Only leaf tissue was used for S. conica as the individual had not yet begun flowering at the time of RNA extraction. Both tissue types were harvested at 4 weeks post-germination, and RNA integrity and purity were checked on a TapeStation 2200 and a Nanodrop 2000. 

Iso-Seq library construction and sequencing was performed at the Arizona Genomics Institute. Library construction was done using PacBio’s SMRTbell Express Template Prep Kit 2.0. The four libraries were barcoded and pooled. The multiplexed pool was sequenced with a PacBio Sequel II platform on two SMRT Cells using a Sequencing Primer V4, Sequel II Bind Kit 2.0, Internal Control 1.0, and Sequel II Sequencing Kit 2.0. Raw movie files were processed to generate circular consensus sequences (CCSs) using PacBio’s SMRT Link v9.0.0.92188 software (Pacific Biosciences 2020). Demultiplexing was performed with lima v2.0.0 and the --isoseq option. Full-length non-chimeric (FLNC) sequences were generated with the refine command and the --require_polya option in the IsoSeq3 (v3.4.0) pipeline. Clustering of FLNCs into isoforms was then performed with the cluster command in IsoSeq3 with the --use-qvs option. 

Arabidopsis aaRS genes were identified from published sources, and the corresponding protein sequences were obtained from the Araport11 genome annotation (201606 release). Homologs from the high-quality (HQ) clustered isoforms from each species were identified with a custom Perl script (iso-seq_blast_pipeline.pl available at GitHub: https://github.com/warrenjessica/Iso-Seq_scripts) that performed a tBLASTn search with each Arabidopsis aaRS sequence, requiring a minimum sequence identity of 50% and a minimum query length coverage of 50%. All HQ clusters that satisfied these criteria were retained by setting the --min_read parameter to 2 (the IsoSeq3 clustering step already excludes singleton transcripts). 

The longest ORF was extracted from each aaRS transcript using the EMBOSS v. 6.6.0 (Rice et al. 2000) getorf program with the options: -minsize 75 -find 1. Many Iso-Seq transcripts differed in length by only by a few nucleotides in the UTR region but resulted in identical ORFs. Therefore, all identical ORFs were collapsed for downstream targeting and phylogenetic analysis. Collapsed ORFs were translated into protein coding sequences. 

Very similar transcripts can be the product of different genes, alleles, or sequencing errors. In order to infer the number of unique genes for each related set of transcripts in a species, CD-HIT-EST v. 4.8.1. For this clustering step, sequences were first aligned with MAFFT v. 7.245 with default settings and trimmed by eye to remove terminal sequence ends with gaps and N-terminal extensions that were not present on all sequences. Any two sequences in which the coding region shared greater than 98% sequence similarity were collapsed into a single gene cluster (CD-HIT-EST options -c 0.98 -n 5 -d 0). Each cluster of transcripts was considered a single gene, and the transcript with the highest expression and longest length was retained as the representative sequence for the gene. 

Funding

National Science Foundation, Award: MCB-2048407