Skip to main content

The expansion and diversification of pentatricopeptide repeat RNA editing factors in plants

Cite this dataset

Small, Ian et al. (2019). The expansion and diversification of pentatricopeptide repeat RNA editing factors in plants [Dataset]. Dryad.


The RNA-binding pentatricopeptide repeat (PPR) family comprises hundreds to thousands of genes in most plants, but only a few dozen in algae, evidence of massive gene expansions during land plant evolution. The nature and timing of these expansions has not been well-defined due to the sparse sequence data available from early-diverging land plant lineages. We exploit the comprehensive OneKP dataset of over 1,000 transcriptomes from diverse plants and algae to establish a clear picture of the evolution of this massive gene family, focusing on the proteins typically associated with RNA editing, which show the most spectacular variation in numbers and domain composition across the plant kingdom. We characterise over 2,250,000 PPR motifs in over 400,000 proteins. In lycophytes, polypod ferns and hornworts, nearly 10% of expressed protein-coding genes encode putative PPR editing factors, whereas they are absent from algae and complex-thalloid liverworts. We show that rather than a single expansion, most land plant lineages with high numbers of editing factors have continued to generate novel sequence diversity. We identify sequence variation that implies functional differences between PPR proteins in seed plants versus non-seed plants and which we propose to be linked to seed-plant-specific editing cofactors. Finally, using the sequence variation across the dataset, we develop a structural model of the catalytic DYW domain associated with C-to-U editing and identify a clade of unique DYW variants that are strong candidates as U-to-C RNA editing factors, given their phylogenetic distribution and sequence characteristics.


Sources of data

Nucleotide sequences from the 1000 plants initiative (oneKP) (Matasci et al., 2014) were downloaded from The files used for this study are the SOAPdenovo-Trans (Xie et al., 2014)assemblies (file names of the form ‘xxxx-SOAPdenovo-Trans-assembly.fa’ where xxxx is the four-letter sample code).

Prediction of PPR motif arrays

Assemblies were translated in all six frames, retaining open reading frames of at least 31 amino acids (the length of an S motif, the shortest of the PPR motif variants). Hmmsearch from the HMMER 3.2.1 package (Eddy, 2018) was used to search for PPR motifs, initially using the motif profiles developed by(Cheng et al., 2016), but subsequently we re-defined the DYW motif and defined the new DYW:KP variant using oneKP sequences and used these profiles in place of the DYW profile of (Cheng et al., 2016)Hmmsearch settings were ‘--domtblout --noali -E 0.1’. Our own in-house code (PPRfinder) was used to select the best-scoring non-overlapping chain of motifs from the domain table output of hmmsearch, with a motif score threshold of 0 for all motif types, except SS motifs (score >= 10) and DYW motifs (score >= 30). All motifs on the same strand, whatever frame they were in, were considered in this process. Amino acid sequences from different frames were joined where necessary to connect PPR motif arrays, with ‘X’ residues inserted to maintain the approximate length and to indicate the junction. For all the studies reported here, the protein sequences were filtered such that only sequences with either at least two adjacent PPR motifs and a sum of motif scores >= 40, or a DYW motif, were retained. These are stringent thresholds that we believe eliminate almost all non-PPR sequences.

Alignment of PPR motifs

For most large alignments based on single motifs, sequences were aligned with hmmalign from the HMMER 3.2.1 package (Eddy, 2018)Hmmalign tends to give terminal gaps and unaligned tails where the sequence is a relatively poor match to the profile HMM and there is no surrounding sequence context provided. As all of the aligned motifs were originally identified using hmmsearch with the same profile HMM, we know that in its native sequence context each sequence does align to the HMM. Thus we wrote a script to replace terminal gaps at each end of the motif by the corresponding unaligned tails, as long as the gap and the tail were of equal length and thus there is no possible ambiguity in the placement of the amino acid residues. This procedure avoided overly sparse positions at the motif termini and a bias towards residues that best matched the HMM. For generating consensus sequences and structural models, the alignments were filtered to remove positions at which at least 50% of the sequences contained gaps. The alignments were then filtered a second time to remove sequences containing at least 5% gaps. For DYW alignments, the alignments were also filtered to remove sequences lacking the cytidine deaminase signature (HxEx25CxxC), to avoid including degenerate, non-functional sequences.

 Construction of phylogenetic trees

Large trees (thousands of sequences) were constructed using the approximate maximum likelihood approach of FastTree 2 (v. 2.1.11 double precision) (Price et al., 2010) using the JTT model of amino acid evolution (Jones et al., 1992), pseudocounts (-pseudo option) and a discrete gamma model with 20 rate categories ( -gamma option). For local support values, FastTree uses the Shimodaira-Hasegawa test on alternate nearest-neighbour interchanges for each split in the tree.

Modelling the DYW domain

Contact-derived ab initio modelling was performed using the RaptorX contact web server, available at, using a trimmed alignment of 15,177 DYW sequences as input for the contact prediction tool, and a consensus of the DYW domain as the subject for modelling (Ma et al., 2015; Wang et al., 2016; Wang et al., 2017). Models were visualised using the PyMOL Molecular Graphics System, Version 2.0 (Schrödinger, LLC), with zinc and 3-deazacytidine modelled into the best prediction based on the crystal structure of E. coli cytidine deaminase complexed to 3-deazacytidine, PDB ID 1ALN (Xiang et al., 1996).

Usage notes

Description of the files stored in this package
Figure numbers refer to Gutmann et al., Molecular Plant, 2019

DYW.pdb   538.32 kB    PDB format model of the DYW domain as depicted in Fig 6A and Fig S7A
DYW_contactmap.txt   182.63 kB    Contact map from Raptor X, depicted in Fig S7B
LL.pdb   448.41 kB    PDB format model of the LL motif as depicted in Fig 3C
LL_contactmap.txt   131.78 kB    Contact map from Raptor X, used to build the model depicted in Fig 3C
PPRfinder_output.tgz   193.58 MB    compressed output from PPRfinder for all 1KP samples; see for details
all_PPR.hmm   295.33 kB    all PPR motif profiles built by hmmbuild and used in hmmsearch; the profiles are identical to those developed by Cheng et al. 2016 except for the addition of the "KPAKA" profile (DYW:KP domain); see for details
all_dyw_motifs.ali.fa   2.58 MB    Fasta format alignment of DYW domain sequences built using hmmalign and used for constructing trees and contact maps
all_dyw_motifs.nw   924.10 kB Newick format tree of DYW domain sequences, as depicted in Fig 2 and Fig 5
all_dyw_motifs_inc_algae.ali.fa   2.58 MB    Fasta format alignment of DYW domain sequences, including those from algae, built using hmmalign
all_dyw_motifs_inc_algae.nw   1 MB    Newick format tree of DYW domain sequences, including those from algae, as depicted in Fig S9B
dyw.hmm   64.25 kB    DYW domain profile built by hmmbuild and used in hmmsearch (is also included in all_PPR.hmm); see for details
dyw_kp.hmm   63.79 kB    DYW:KP domain profile built by hmmbuild and used in hmmsearch (is also included in all_PPR.hmm); see for details (the profile is named "KPAKA")
dywkp.ali.fa   291.55 kB    Fasta format alignment of DYW:KP domain sequences, built using hmmalign
dywnotkp.ali.fa   2.68 MB    Fasta format alignment of DYW domain sequences, not including DYW:KP sequences, built using hmmalign
ll_triplet.ali.fa   197.50 kB    Fasta format alignment of P1-LL-S1 triplet sequences, built using hmmalign
morf.hmm   51.23 kB    MORF domain profile built by hmmbuild and used in hmmsearch; used to construct Fig 4B
one_sample_per_clade_cter.nw   23.75 kB    Newick format tree of P2-L2-S2-E1-E2-DYW sequences, as depicted in Fig S3
pls_triplet.ali.fa   8.79 MB    Fasta format alignment of P1-L1-S1 triplet sequences, built using hmmalign
ss_triplet.ali.fa   1.10 MB    Fasta format alignment of SS-SS-SS triplet sequences, built using hmmalign


Australian Research Council, Award: FL140100179

Australian Research Council, Award: DP150102692

Australian Research Council, Award: CE140100008

Australian Research Council, Award: DE150101484

German Research Foundation (DFG), Award: SCHA1952/2-1