Deep sequencing datasets from: Witnessing the structural evolution of an RNA enzyme

Portillo, Xavier 1 ; Huang, Yu-Ting2 ; Breaker, Ronald1 ; David, Horning 2 ; Joyce, Gerald 2

Published Sep 29, 2021 on Dryad. https://doi.org/10.5061/dryad.c866t1g78

Abstract

An RNA polymerase ribozyme that has been the subject of extensive directed evolution efforts has attained the ability to synthesize complex functional RNAs, including a full-length copy of its own evolutionary ancestor. During the course of evolution, the catalytic core of the ribozyme has undergone a major structural rearrangement, resulting in a novel tertiary structural element that lies in close proximity to the active site. Through a combination of site-directed mutagenesis, structural probing, and deep sequencing analysis, the trajectory of evolution was seen to involve the progressive stabilization of the new structure, which provides the basis for improved catalytic activity of the ribozyme. Multiple paths to the new structure were explored by the evolving population, converging upon a common solution. Tertiary structural remodeling of RNA is known to occur in nature, as evidenced by the phylogenetic analysis of extant organisms, but this type of structural innovation had not previously been observed in an experimental setting. Despite prior speculation that the catalytic core of the ribozyme had become trapped in a narrow local fitness optimum, the evolving population has broken through to a new fitness locale, raising the possibility that further improvement of polymerase activity may be achievable.

Deep Sequencing of 19 rounds of evolution

Sequencing of PCR products obtained after various rounds of evolution was performed at the Yale Center for Genome Analysis on an Illumina NovaSeq 6000, which generated ~20 million paired reads per sample. Reads were trimmed, combined, demultiplexed and filtered using Illumina standard paired end sequencing protocol (#1005063). The sequence datasets were quality filtered (phred >33), and trimmed (>150 nucleotides) using the paired-end read merger program PEAR (v 0.9.11). Individual sequences were enumerated and converted to a fastq file format using a custom Python script (eLife '21 suppfileA). The file sizes were reduced by removing sequences with less than 10 reads for rounds 16 and 31, less than 10,000 reads for round 27, and less than 1,000 reads for all other rounds. The fastq file entries were then aligned using MUSCLE (v 3.8.31). The aligned reads were trimmed to the region encompassing the P7 and P8 stems (nucleotides 9–17 and 83–95) using AliView (v 1.26), then clustered using cd-hit-est (v 4.8.1), with a clustering threshold of 100% identity (-c 1.0), maximum unmatched length of 2 nucleotides (-U 2), and length difference cutoff of 2 nucleotides (-S 2). Clusters with >1% representation in any given round were identified. Resulting tables were manually processed to produce heat map plots and a table of P8 variant percentages through generations.

Determination of polymerase fidelity by deep sequencing.

The hammerhead and class I ligase ribozymes were synthesized by the 52-2 polymerase under standard reaction conditions (100 nM polymerase, 100 nM template, 80 nM primer, 4 mM each NTP, and 50 or 200 mM MgCl2 at pH 8.3 and 17 °C). For the hammerhead, all partial- and full-length products were collected after the reactions yielded 2% of full-length products using 200 or 50 mM MgCl2 (40 m and 6 h, respectively), and analyzed. For the ligase, only gel-purified full-length ligase from a 24 hour extension reaction with 200 mM MgCl2 (yielding 2% extension to full-length) was analyzed. The products were converted to DNA molecules for Illumina sequencing as described previously (Tjhung et al., 2020). Briefly, RNA products were ligated to the Universal miRNA Cloning Linker using K227Q T4 RNA Ligase 2 and reverse transcribed with Superscript IV using primer Rev2. The resulting cDNA was isolated and tailed with poly(C) using terminal transferase, and then amplified by PCR using Q5 Hot Start High-Fidelity DNA Polymerase and primers Fwd2 and Rev2. For ligase products, cDNA tailing with terminal transferase and PCR amplification were not performed. For both ribozyme products, Illumina adapter sequences were added to the ends of the cDNA using primers Fwd3 or Fwd4 (for hammerhead or ligase, respectively) and Rev3, followed by amplification using the Illumina Nextera Index primers. Sequencing was carried out by the Salk Next Generation Sequencing Core on an Illumina MiniSeq, with either a 75- or 150-cycle paired-end run for the hammerhead or ligase, respectively. The sequence data were processed to categorize all mutations relative to the expected sequence, as described previously (Tjhung et al., 2020).

For the ligase ribozyme, an updated method from (Tjhung et al., 2020) was used for the distinct reference sequence: reads were first trimmed using cutadapt v3.4 using parameters --trimmed-only -e 0.25 --pair-filter both -M 120 -n 2 -j 2 –a CTGTAGGCACCATCAATCTGTCTCTTATACACATCTCCGAGCCC -G ATTGATGGTGCCTACAG -A CTGTCTCTTATACACATCTGACGCTGCCGACGA. Sequences without barcodes were filtered out with cutadapt using parameters -g GGAAAAGACAAATCTGCCCT --action none --discard-untrimmed -e 0.05 --pair-filter both -n 1 -j 2. Paired reads were merged using FLASH v1.2.11 with arguments -t 1 -m 50 -M 100 -x 0, and quality filtered using FASTX Toolkit v0.0.14 with –q 36 –p 100. Then bowtie2 v2.4.2 was used to align merged to the template sequence (containing both constant regions) using the following parameters --end-to-end –score-min L,0,-1.2 –rdg 3,5 –rfg 3,5 -L 5 –reorder and a reference sequence "GGAAAAGACAAATCTGCCCTCAGAGCTTGAGAACATCTTCGGATGCAGGGGAGGCAGCCCCCGGTGGCTTTAACGCCAACGTTCTCAACAATAGTGATTTTTTCTGTAGGCACCATCAAT" with constant regions not included in later fidelity measurements underlined. The generated sam file was converted into a sorted indexed bam file using SAMtools v1.9, and edit distances extracted from the bam file to determine the distribution of product Levenshtein distances from the ligase reference sequence. Breseq v0.35.5 bamtoaln was used to create a gapped alignment file (.txt) of the reads (specifying the number of aligned reads with -n 4150646) to the template sequence. A custom java script (Tjhung et al., 2020) was used to calculate the number of matches, mismatches, deletions, and insertions as a function of the template position. The script was compiled after extracting the java files using javac v16.0.2 (Oracle Inc.), and run with java using the output tables from breseq and a template length of 120. Resulting tables were manually processed to produce position-specific plots and analyses.

For the hammerhead ribozyme, an identical method was used as previously reported (Tjhung et al., 2020): reads were first trimmed using cutadapt v2.4 using parameters --trimmed-only -e 0.25 --pair-filter both -n 2 -j 2 -a CTGTAGGCACCATCAATCTGTCTCTTATACACATCTCCGAGCCC -G ATTGATGGTGCCTACAG -A CTGTCTCTTATACACATCTGACGCTGCCGACGA. Sequences without barcodes were filtered out with cutadapt -g CTACAGGGCACTCCACAC --action none --discard-untrimmed -e 0.05 --pair-filter both -n 1 -j 2. Paired reads were merged using FLASH v1.2.11 with arguments -t 1 -m 4 -M 48 -x 0.3, and quality filtered using FASTX Toolkit v0.0.13 with -Q33 -q 30 -p 100. Then bowtie2 v2.4.2 was used to align merged to the template sequence (containing both constant regions) using the following parameters --end-to-end -L 5 -reorder and a reference sequence "CTACAGGGCACTCCACACGACGTACTGATGAGGCCGAAAGGCCGAAAAGCGTTTTTTGTCATTGTCCTGTAGGCACCATCAAT" with constant regions not included in later fidelity measurements underlined. The generated sam file was converted into a sorted indexed bam file using SAMtools v1.9. Breseq v0.27.0 bamtoaln was used to create a gapped alignment file (.txt) of the reads (-n 2000000) to the template sequence. A custom java script (Tjhung et al., 2020) was used to calculate the number of matches, mismatches, deletions, and insertions as a function of the template position and the read length class. The script was compiled after extracting the java files using javac v1.7 (Oracle Inc.), and run with java using the output tables from breseq and a template length of 83. Resulting tables were manually processed to produce position-specific and length-specific plots and analyses.

Fwd2 GGGGGGATGCTACATG

Fwd3 TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTACAGGGCACTCCACAC

Fwd4 TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGGAAAAGACAAATCTGCC

Rev2 ATTGATGGTGCCTACAG

Rev3 GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGATTGATGGTGCCTACAG

Tjhung KF, Shokhirev MN, Horning DP, Joyce GF. 2020. An RNA polymerase ribozyme that synthesizes its own ancestor. Proc. Natl. Acad. Sci. USA 117:2906–2913.

Directory organization of files

Each selection round has its own directory (ex. eLife21_R#) containing five files corresponding to the aligned reads (eLife_R#_aligned), two cluster files (eLife21_R#_Clusters_a/b) and two files corresponding to the raw Illumina reads (eLife21_R#_R1/2.fastq.gz). A source data spreadsheet (eLife '21 suppfile2 source data.xlsx) and the python script used to enumerate the sequence reads (eLife '21 suppfileA.py) are provided as separate files. Fidelity data is provided as separate files consisting of raw Illumina reads of polymerase product RNA, (eLife21_###_###_R1/2.fastq.gz), aligned bam file (eLife21_c1l_200_align_sort.bam), and the final source data spreadsheet (eLife21_52-2_fidelity_table.xlsx) for polymerase fidelity calculations.

Deep Sequencing of 19 rounds of evolution

Classification of P8 stem variants - Microsoft Excel table

Excel spreadsheet parses cd-hit cluster output (eLife21_R#_Clusters_a/b) to determine the number of total reads associated with each cluster, and identifies clusters with >1% representation in the round (sheets R6-52). Clusters are manually classified and cross-checked against the aligned full sequences (eLife21_R#_aligned) to generate the table for Supplementary file 2 (sheet "prevalent clusters"). This information is used to generate the cluster representation-by-round heatmap (sheet "summary") that was used to generate Figure 5B.

"eLife '21 suppfile2 source data.xlsx"

P8 region clusters - text files

Output of cd-hit clustering of aligned regions in nucleotides 9–17 and 83–95 (as referenced to the 52-2 polymerase sequence). Used to generate high-representation cluster tables in "eLife '21 suppfile2 source data.xlsx".

"eLife21_R#_Clusters_a.txt"; "eLife21_R#_Clusters_b.txt"

Aligned polymerase sequences by round - fastq file

Gapped alignment of all sequences, by round, generated by MUSCLE alignment. These files were trimmed in AliView to the region encompassing the P7 and P8 stems (nucleotides 9–17 and 83–95) and clustered using cd-hit-est.

"eLife21_R#_aligned.fastq"

Enumeration of individual sequences - script

Python script to enumerate individual sequences in a fastq file generates a new fastq file of unique sequences, named by number of reads of that specific sequence in the input file. Used on merged, filtered, and trimmed reads, with output aligned by MUSCLE to generate "eLife21_R#_aligned.fastq"

"eLife '21 suppfileA.py"

Raw Fastq files from individual rounds of directed in vitro evolution

Fastq files pertaining to sequences present in selected rounds of polymerase evolution. Select rounds (24, 38, 43, and 49) were sequenced as multiple sample sets, R24 as 3 sample sets and R38/43/49 as 2 sample sets. Paired-end reads for each round were merged, quality filtered (phred >33), and trimmed (>150 nucleotides) using PEAR (v 0.9.11), followed by enumeration of individual sequences using the "eLife '21 suppfileA.py" script.

"eLife21_R#_raw_R1.fastq.gz"; "eLife21_R#_raw_R2.fastq.gz"

Determination of polymerase fidelity by deep sequencing.

Raw fastq files from ribozyme product libraries

Fastq files were generated by Illumina MiniSeq runs from dsDNA libraries of ribozyme products synthesized by the 52-2 polymerase, as described in the methods section. 75-cycle paired-end runs were used for hammerhead products, and 150-cycle paired-end run for the ligase products. Files are labeled by ribozyme (hhead or c1l) and Mg++ concentration (200 or 50). Sequences of both sets were trimmed using cutadapt, merged using FLASH, and filtered for high quality reads using FASTX Toolkit, with parameters described in the methods section. These trimmed, merged, and filtered reads were then aligned to the full length sequence (hammerhead and separately class I ligase) using bowtie2. The sam files generated by bowtie2 were subsequently converted into a bam file and sorted using SAMtools.

“eLife21_c1l_200_R1/2.fastq.gz”

“eLife21_hhead_200_R1/2.fastq.gz”

“eLife21_hhead_50_R1/2.fastq.gz”

Compressed binary sequence alignment/map file of class I ligase products (*.bam)

Edit distances to the reference sequence are listed by bowtie2 in field 17 (NM:i:<N>), and were extracted using SAMtools view and the following unix command: | awk '{print $17}' | sort | uniq -c | awk '{print $2"\t"$1}'. A gapped alignment file was generated from the sorted bam file using breseq (v0.35.5) bamtoaln. Finally, the resulting gapped alignment table was processed using a custom java script (Tjhung et al., 2020) to tabulate the number of substitution, deletion, and insertion mutations for each aligned read at each position.

“eLife21_c1l_200_align_sort.bam”

Polymerase fidelity tables and distribution of errors between products – Microsoft Excel table

Tabulated mutations by position for hammerhead and ligase products were added as separate tabs to an excel file, which were then used to calculate frequency of polymerase errors by base identity and average fidelity as the geometric mean of fidelity for the four bases. The distribution of edit distances determined by the bowtie2 alignment of ligase product reads were used to determine the distribution of errors between ligase products, and are shown below the ligase fidelity table. Observed errors are compared to expected errors for an average fidelity of 84.1%, using the binomial distribution. Results are presented in two charts to the right of the data.

“eLife21_52-2_fidelity_table.xlsx”

Deep sequencing datasets from: Witnessing the structural evolution of an RNA enzyme

Data files

Abstract

Methods

Usage notes

Works referencing this dataset