Skip to main content

Aminoacyl-tRNA synthetase gene alignments from multiple Sileneae species generated from full-length transcripts using Iso-Seq and raw microscopy image files

Cite this dataset

Warren, Jessica (2022). Aminoacyl-tRNA synthetase gene alignments from multiple Sileneae species generated from full-length transcripts using Iso-Seq and raw microscopy image files [Dataset]. Dryad.


Trimmed and untrimmed alignments for the final aminoacyl-tRNA synthetases in Sileneae species and Arabidopsis thaliana. We investigated the evolution of subcellular localization of aaRS enzymes in five different species from the plant lineage Sileneae that has experienced extensive and rapid mitochondrial tRNA loss. By analyzing full-length mRNA transcripts with single-molecule sequencing technology (PacBio Iso-Seq) and searching genome sequences, we found instances of predicted retargeting of an ancestrally cytosolic aaRS to the mitochondrion as well as scenarios where enzyme localization does not appear to change despite functional tRNA replacement.

Nikon .nd2 raw microscopy files for the transient expression and imaging of predicted transit peptides and colocalization assays in N. benthamiana epithelial cells. The amino acid sequence plus 10 upstream amino acids of the protein body were fused to GFP and co-transfected with an eqFP611-tagged transit peptide from a known mitochondrially localized protein (isovaleryl-CoA dehydrogenase).


For trimmed and untrimmed alignments:

RNA was extracted from A. githago (hermaphrodite), S. conica (hermaphrodite), S. latifolia (male), and S. vulgaris (male-fertile hermaphrodite) with a Qiagen RNeasy Plant Mini Kit, using RLT buffer with 10 µl beta-mercaptoethanol. RNA was DNase treated with a Qiagen RNase-Free DNase Set. Separate RNA extractions were performed on leaf tissue and an immature flower sample (~5 days post flower development) for A. githagoS. vulgaris, and S. latifolia. Two different tissues were used to increase detection of diverse transcripts, but the two RNA samples were pooled equally by mass for each species prior to library construction, so individual reads cannot be assigned to leaf or floral tissues. Only leaf tissue was used for S. conica as the individual had not yet begun flowering at the time of RNA extraction. Both tissue types were harvested at 4 weeks post-germination, and RNA integrity and purity were checked on a TapeStation 2200 and a Nanodrop 2000.

Iso-Seq library construction and sequencing was performed at the Arizona Genomics Institute. Library construction was done using PacBio’s SMRTbell Express Template Prep Kit 2.0. The four libraries were barcoded and pooled. The multiplexed pool was sequenced with a PacBio Sequel II platform on two SMRT Cells using a Sequencing Primer V4, Sequel II Bind Kit 2.0, Internal Control 1.0, and Sequel II Sequencing Kit 2.0. Raw movie files were processed to generate circular consensus sequences (CCSs) using PacBio’s SMRT Link v9.0.0.92188 software (Pacific Biosciences 2020). Demultiplexing was performed with lima v2.0.0 and the --isoseq option. Full-length non-chimeric (FLNC) sequences were generated with the refine command and the --require_polya option in the IsoSeq3 (v3.4.0) pipeline. Clustering of FLNCs into isoforms was then performed with the cluster command in IsoSeq3 with the --use-qvs option.

Arabidopsis aaRS genes were identified from published sources, and the corresponding protein sequences were obtained from the Araport11 genome annotation (201606 release). Homologs from the high-quality (HQ) clustered isoforms from each species were identified with a custom Perl script ( available at GitHub: that performed a tBLASTn search with each Arabidopsis aaRS sequence, requiring a minimum sequence identity of 50% and a minimum query length coverage of 50%. All HQ clusters that satisfied these criteria were retained by setting the --min_read parameter to 2 (the IsoSeq3 clustering step already excludes singleton transcripts).

The longest ORF was extracted from each aaRS transcript using the EMBOSS v. 6.6.0 (Rice et al. 2000) getorf program with the options: -minsize 75 -find 1. Many Iso-Seq transcripts differed in length by only by a few nucleotides in the UTR region but resulted in identical ORFs. Therefore, all identical ORFs were collapsed for downstream targeting and phylogenetic analysis. Collapsed ORFs were translated into protein coding sequences.

Very similar transcripts can be the product of different genes, alleles, or sequencing errors. In order to infer the number of unique genes for each related set of transcripts in a species, CD-HIT-EST v. 4.8.1. For this clustering step, sequences were first aligned with MAFFT v. 7.245 with default settings and trimmed by eye to remove terminal sequence ends with gaps and N-terminal extensions that were not present on all sequences. Any two sequences in which the coding region shared greater than 98% sequence similarity were collapsed into a single gene cluster (CD-HIT-EST options -c 0.98 -n 5 -d 0). Each cluster of transcripts was considered a single gene, and the transcript with the highest expression and longest length was retained as the representative sequence for the gene.

For Nikon .nd2 raw image files:

Constructs were made from putative transit peptides predicted from TargetP v.2.0 (Almagro Armenteros et al., 2019).Each transit peptide plus the following 30 bp (10 amino acids) was placed between the attLR1 (5') and attLR2 (3') Gateway cloning sites. The desired constructs were synthesized and cloned into pUC57 (Ampr) using EcoRI and BamHI restriction sites by GenScript, transferred into the constitutive plant destination vector pK7FWG2 (bacterial Specr/plant Kanr) (Karimi et al., 2002), which contains a C-terminal GFP fusion, using Gateway® LR Clonase II Enzyme Mix, and transformed into E. coli DH5a. Two colonies were selected for each construct, DNA was purified using the GeneJet Plasmid Miniprep Kit (Thermo Scientific) and verified by full-length plasmid sequencing (Plasmidsaurus). The putative transit peptides and following 10 amino acids were confirmed to be in-frame with the C-terminal GFP fusion protein by sequence alignment. Positive clones were used to transform electrocompetent Agrobacterium C58C1-RifR (also known as GV3101::pMP90, (Hellens et al., 2000)), colonies were selected on Rif/Spec/Gent (50 mg/mL each) and confirmed by PCR using primers directed to the 5¢ (Cam35S promoter) and 3¢(GFP) regions flanking the constructs.

            Agrobacterium transient transformation of N. benthamiana leaves was done using the method of Mangano et al. (2014), but scaled up to accommodate N. benthamiana instead of Arabidopsis leaves. The species N. benthamiana was used for transformation because it does not have a hypersensitive response to Agrobacterium at the infiltration site.

            Leaf samples were imaged after 48 hr on a Nikon A1-NiE confocal microscope equipped with a CFI Plan Apo VC 60 XC WI objective. GFP, eqFP611, and chlorophyll were excited and collected sequentially using the following excitation/emissions wavelengths: 488 nm / 525/50 nm (GFP), 561 nm / 595/50 nm (red fluorescent protein eqFP611), 640 nm / 700 (663 – 738) nm (chlorophylls). Imaging was done using Nikon NIS-Elements 5.21.03 (Build 1489) and image analysis was performed using Nikon NIS-Elements 5.41.01 (Build 1709). Maximum Intensity Projections in Z were produced after using the Align Current ND Document (settings: Align to Previous Frame, The intersection of moved images, Process the entire image), and 500 pixel ´ 500 pixel (103.56 µM ´ 103.56 µM) cropped images were created from each projection for publication.

Usage notes

These files can be viewed using Nikon's Element software, can also be viewed using ImageJ for PC (but not Mac).


National Science Foundation, Award: MCB-2048407