In silico subcellular targeting predictions for cytosolic aminoacyl tRNA-synthetases (aaRS) in parasitic plants
Data files
Aug 02, 2023 version files 776.54 KB
-
20230315_PresenceAbsencematrix.csv
3.83 KB
-
20230329_orthogroups_geneID_editforheatmaps.csv
3.58 KB
-
20230329_orthogroups_geneID.csv
3.93 KB
-
input_prot_fasta.zip
392.43 KB
-
README.md
6.89 KB
-
Subcellular_Predictions.zip
365.87 KB
Oct 18, 2024 version files 2.13 MB
-
20240805_aaRS_presenceabsence.csv
3.90 KB
-
input_prot_fasta.zip
744.95 KB
-
README.md
8.08 KB
-
Subcellular_Predictions.zip.zip
1.37 MB
-
targeting_orthogroups_Rheatmaps.csv
580 B
-
targeting_orthogroups.csv
3.44 KB
Abstract
Eukaryotic nuclear genomes often encode distinct sets of protein translation machinery for function in the cytosol vs. organelles (mitochondria and plastids). This phenomenon raises questions about why multiple translation systems are maintained even though they are capable of comparable functions, and whether they evolve differently depending on the compartment where they operate. These questions are particularly interesting in land plants because translation machinery, including aminoacyl-tRNA synthetases (aaRS), is often dual-targeted to both the plastids and mitochondria. These two organelles have quite different metabolisms, with much higher rates of translation in plastids to supply the abundant, rapid-turnover proteins required for photosynthesis. Previous studies have indicated that plant organellar aaRS evolve more slowly compared to mitochondrial aaRS in other eukaryotes that lack plastids. Thus, we investigated the evolution of nuclear-encoded organellar and cytosolic translation machinery across a broad sampling of angiosperms, including non-photosynthetic (heterotrophic) plant species with reduced rates of plastid gene expression to test the hypothesis that translational demands associated with photosynthesis constrain the evolution of bacterial-like enzymes involved in organellar tRNA metabolism. Remarkably, heterotrophic plants exhibited wholesale loss of many organelle-targeted aaRS and other enzymes, even though translation still occurs in their mitochondria and plastids. These losses were often accompanied by apparent retargeting of cytosolic enzymes and tRNAs to the organelles, sometimes preserving aaRS-tRNA charging relationships but other times creating surprising mismatches between cytosolic aaRS and mitochondrial tRNA substrates. Our findings indicate that the presence of a photosynthetic plastid drives the retention of specialized systems for organellar tRNA metabolism.
README: Subcellular targeting of aminoacyl tRNA synthetases (aaRS) and other proteins in parasitic plants
Here is the output of subcellular targeting prediction programs for aminoacyl tRNA-synthetases (aaRS) and other proteins from a sampling of parasitic and autotrophic plants species. Provided are protein models used to generate predictions, and script and files for downstream analysis including filtering out sequences lacking an N-terminus starting within 100 AA of the orthologous protein from model plant Arabidopsis thaliana. Jupyter notebooks used for statistics and plotting are also included.
Description of the data and file structure
Zipped data directories:
input_prot_fasta.zip contains fasta files of protein sequences used for the analysis. These files contain all transcripts with sufficient similarity to Arabidopsis ortholog from heterotrophic species and close relatives. To recapitulate my analysis, run all scripts in this directory.
Subcellular_Predictions.zip contains the predictions for protein sequences, prior to or after tidying. There are two subfolders containing predictions for all protein models (alltrans_predictions/), or only those models which have a start within 100 AA of the start of the Arabidopsis ortholog (first100bpAthal_predictions)
Metadata files needed for various steps in the analysis:
20240805_aaRS_presenceabsence.csv - contains a data matrix with information about presence (2), partial presence (1), or absence (0) of aminoacyl tRNA synthetases in parasitic plants and related species. For more information, see manuscript or https://doi.org/10.5061/dryad.0cfxpnw7p. Used for checking and plotting concordance between loss of organellar enzymes and retargeting of cytosolic enzymes.
targeting_orthogroups.csv - Metadata for each aaRS enzyme, including targeting, description including target amino acid, and orthogroup from Orthfinder analysis (see manuscript for more details). Formatted for use by .ipynb notebooks.
targeting_orthogroups_Rheatmaps - Metadata for each aaRS enzyme, including targeting, description including target amino acid, and orthogroup from Orthfinder analysis (see manuscript for more details). Formatted slightly differently for plotting heatmaps for each enzyme using R script.
Scripts for data processing:
bash scripts for running steps of analysis:
step1_predicttargeting.sh - does targeting prediction
step2_mafft.sh - aligns all orthologues using mafft, and generates information about the alignment, including flagging sequences that do not start within 100 AA of the Arabidopsis ortholog N-terminus, and thus are likely truncated due to low quality transcript assembly.
step3_subsetpredictions.sh - sorts out predictions that start within 100 AA of the Arabidopsis ortholog into separate file. 6. Also sorts a separate file of the most retargeted protein model/transcript that starts within 100 AA of Arabidopsis which is used for downstream plotting. I provide these files in .zip directories described above.
Jupyter notebooks for plotting and statistics:
- To make heatmaps, box plots of retargeting ratios (Fig. 2): 20240815_parse_targeting_aaRS_cutoff_0.50_dryad.ipynb
- To make Figure S2: 20240815_parse_overlap_presenceabsence_retargetboth.ipynb
- To make Presence/Absence-Retargeting matrix for tRNA plot: 20240820_parse_overlap_presenceabsence_retarget_ind.ipynb
R script for plotting all protein models:
- transcript_by_transcript_heatmap.r
Helper scripts and functions called through pipeline:
- mafftparse4.0.py
- tidy_predictions_100bpAthal.py
- tidy_predictionscmd_alltrans.py
Sharing/Access information
Data generated by running in silico programs to predict subcellular localization/targeting of protein. Wrapper scripts for upstream processing of protein sequences via blast, running prediction programs, as well as downstream analysis and visualization are included. Protein sequences (provided as a .fa file) come from Orthogroup finding step, see Dryad dataset https://doi.org/10.5061/dryad.0cfxpnw7p and publication for details.
Programs for subcellular targeting prediction:
TargetP2.0:
Armenteros JJA, Salvatore M, Emanuelsson O et al., 2019. Detecting sequence signals in targeting peptides using deep learning. Life Science Alliance 2.
LOCALIZER
Sperschneider J, Catanzariti AM, Deboer K et al., 2017. LOCALIZER: Subcellular localization prediction of both plant and effector proteins in the plant cell. Scientific Reports 7.
MAFFT was used for multiple sequence alignment:
Katoh K, Misawa K, Kuma K-I, Miyata T, 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30, 3059–3066.
Blast+ was used for upstream filtering of protein sequences:
Camacho C, Coulouris G, Avagyan V et al., 2009. BLAST+: Architecture and applications. BMC Bioinformatics 10.
Seqkit is used for basic fasta manipulation:
Shen W, Le S, Li Y, Hu F, 2016. SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11.
Code/Software
Includes python scripts, bash scripts, and one R script used for upstream protein sequence processing and downstream analysis and data visualization.
Analysis also requires MAFFT, TargetP2.0 and LOCALIZER to be installed locally.
Starting with individual .fa files with sequences from all species for each gene of interest in one folder, navigate to folder (i.e. input_prot_fasta/) and do:
bash ../step1_predicttargeting.sh
bash ../step2_mafft.sh
bash ../step3_subsetpredictions.sh
Navigate to PLOTTING/
Use Jupyter notebooks for plotting and statistics:
To make heatmaps, box plots of retargeting ratios (Fig. 2): 20240815_parse_targeting_aaRS_cutoff_0.50_dryad.ipynb
To make Figure S2: 20240815_parse_overlap_presenceabsence_retargetboth.ipynb
Absence/Presence-Retargeting matrix for tRNA plot: 20240820_parse_overlap_presenceabsence_retarget_ind.ipynb
Note: To plot targeting prediction for each protein model on a gene-by-gene basis as seen in supplemental dataset 6, use R script:
transcript_by_transcript_heatmap.r
python packages required include:
pandas
numpy
seaborn
scipy.stats
matplotlib
re
os
datetime
math
R packages required include:
tidyverse
aplot
dplyr
Version Changes
As of October:
20240805: Use of new input orthogroup sequence files/orthogroup names due to rerun of Orthofinder with new gene models. Addition of plastid-specific targeting control sequences including enzymes involved in fatty acid biosynthesis and the Clp protease complex. Changes are reflected in updated folder input_prot_fasta/ and in metadata files targeting_orthogroups.csv, 20240805_aaRS_presenceabsence.csv.
Also, provided protein sequences represent filtered sequences after finding all protein models for a given locus, then using BLASTP against the *A. thaliana *protein model to exclude any models from that locus that may be from a different reading frame and thus are not true orthologous. Originally, the unfiltered files with all proteins models for a given locus were provided along with a bash script used to run blastp. These are excluded from the updated submission for ease of use for the user, but more information about where to find these models is in the publication.
20240820: Changes to scripts:
- 20240815_parse_targeting_aaRS_cutoff_0.50_dryad.ipynb- raised threshold for probability/likelihood of targeting prediction from TargetP or LOCALIZER to 0.5 (formerly 0.25) for protein to be considered "retargeted".
- mafftparse4.0.py- updated to require 15 contiguously aligned amino acids (no gaps) within 100 amino acids of the start of Arabidopsis thaliana reference sequence. Formerly (mafftparse3.0.py) just required any number of aligned amino acids within 100 amino acids.
- Minor cosmetic changes to 20240820_parse_overlap_presenceabsence_retarget_ind.ipynb, 20240815_parse_overlap_presenceabsence_retargetboth.ipynb.
Methods
Subcellular predictions generated from in silico prediction programs TargetP2.0 (Armenteros et al 2019) and LOCALIZER (Sperschneider et al 2017). Included also are input sequences for aaRS and other enzymes used for predictions, which were pulled from publically available sequence data. Included are scripts and metadata files for reproducing results if desired.
Usage notes
MEGA11 to look at all multiple sequence alignments.
Jupyter Notebooks, Rstudio are needed to look at scripts.
Running scripts/analysis requires dependencies described in README.md