Deep sequencing datasets from: RNA-catalyzed evolution of catalytic RNA
Cite this dataset
Papastavrou, Nikos; Horning, David; Joyce, Gerald (2024). Deep sequencing datasets from: RNA-catalyzed evolution of catalytic RNA [Dataset]. Dryad. https://doi.org/10.5061/dryad.rxwdbrvgs
Abstract
This dataset includes raw and processed sequencing data from evolving RNA populations described in Nikos Papastavrou, David P Horning, Gerald F Joyce. "RNA-Catalyzed Evolution of Catalytic RNA" (submitted). Briefly, directed evolution of a hammerhead ribozyme sequence was carried out over eight rounds of three steps each: 1) templated synthesis of the reverse-complement hammerhead RNA by a polymerase ribozyme; 2) templated synthesis of a new copy of the hammerhead RNA from the reverse-complement by a polymerase ribozyme; 3) selective recovery of hammerhead RNA that cleaved an attached RNA substrate. Cleaved RNA was reverse transcribed, PCR amplified, and archived for sequencing, while a portion was in vitro transcribed with T7 RNA polymerase to initiate the next round of evolution. Two distinct branches of evolution were carried out for 8 rounds, using the '52-2' or '71-89' polymerase ribozymes to replicate RNA, respectively. Sequenced RNA populations were analyzed to determine polymerase ribozyme fidelity and study the evolution of hammerhead sequences replicated by polymerases with low or high RNA copying fidelities. The dataset includes raw sequence files, processed tables of mutations by position along the sequence for each polymerase, processed tables of the sequence frequency distribution from each round in the evolving populations, and spreadsheets containing final processed data used directly in manuscript figures and tables.
README: Deep sequencing datasets from: RNA-Catalyzed Evolution of Catalytic RNA.
https://doi.org/10.5061/dryad.rxwdbrvgs
This dataset includes raw and processed sequencing data from evolving RNA populations described in Nikos Papastavrou, David P Horning, Gerald F Joyce. "RNA-Catalyzed Evolution of Catalytic RNA" (submitted). DNA libraries from each round of selection were prepared as described in the "Materials and Methods" and "Supplementary Information: Supporting Text" sections of that manuscript. Briefly, directed evolution of a hammerhead ribozyme sequence was carried out in three steps: 1) templated synthesis of the reverse-complement hammerhead RNA by a polymerase ribozyme; 2) templated synthesis of a new copy of the hammerhead RNA from the reverse-complement by a polymerase ribozyme; 3) selective recovery of hammerhead RNA that cleaved an attached RNA substrate. RNA products from each of these three steps ('HHR-', 'HHR+ pre-cleaved', and 'HHR+ cleaved', respectively) were reverse transcribed, amplified by PCR, and sequenced in an Illumina NextSeq2000 with a 300-cycle paired-end run. Two distinct branches of evolution were carried out for 8 rounds, using the '52-2' or '71-89' polymerase ribozymes to replicate RNA, respectively. A third 'reselection' branch took six distinct hammerhead ribozyme sequences, and subjected them to a single round of directed evolution either individually or as an equimolar mixture, using the '71-89' polymerase for RNA replication.
Raw paired-end sequence reads were trimmed, merged, and filtered as described below. Two different analytical pipelines were performed. In one, data from the first round of evolution in either of the first two branches were used to determine the per-nucleotide fidelity of each polymerase in synthesizing the HHR- and HHR+ pre-cleaved RNA populations. In a second pipeline, data over all eight rounds of evolution were analyzed to determine the frequency of each distinct sequence in each round, and further processed to determine distance to reference sequences, whether the sequence matched the canonical hammerhead motif, and whether the sequence formed part of a high-abundance cluster of closely-related sequences. These processed data were then used to generate figures and plots for the manuscript. Details of each pipeline are included below, and all custom scripts in Python and R are included in the dataset.
Description of the data and file structure
Data list
The dataset is organized into the following directory structure, with annotation provided to the right:
DATA each directory compressed separately
|- raw_data/
| |- R1_89_S1_R1_001.fastq.gz 71-89 branch Round 1 HHR- paired end reads
| |- R1_89_S1_R2_001.fastq.gz
| |- R1_89_E1_S2_R1_001.fastq.gz 71-89 branch Round 1 HHR+ pre-cleaved paired end reads
| |- R1_89_E1_S2_R2_001.fastq.gz
| |- R1_89_E2_S3_R1_001.fastq.gz 71-89 branch Round 1 HHR+ cleaved paired end reads
| |- R1_89_E2_S3_R2_001.fastq.gz
| |- R1_52_2_S5_R1_001.fastq.gz 52-2 branch Round 1 HHR- paired end reads
| |- R1_52_2_S5_R2_001.fastq.gz
| |- R1_52_2_E1_S6_R1_001.fastq.gz 52-2 branch Round 1 HHR+ pre-cleaved paired end reads
| |- R1_52_2_E1_S6_R2_001.fastq.gz
| |- R1_52_2_E2_S7_R1_001.fastq.gz 52-2 branch Round 1 HHR+ cleaved paired end reads
| |- R1_52_2_E2_S7_R2_001.fastq.gz
| |- R2_89_S9_R1_001.fastq.gz 71-89 branch Round 2 HHR- paired end reads
| |- R2_89_S9_R2_001.fastq.gz
| |- R2_89_E1_S10_R1_001.fastq.gz 71-89 branch Round 2 HHR+ pre-cleaved paired end reads
| |- R2_89_E1_S10_R2_001.fastq.gz
| |- R2_89_E2_S11_R1_001.fastq.gz 71-89 branch Round 2 HHR+ cleaved paired end reads
| |- R2_89_E2_S11_R2_001.fastq.gz
| |- R2_52_2_S13_R1_001.fastq.gz 52-2 branch Round 2 HHR- paired end reads
| |- R2_52_2_S13_R2_001.fastq.gz
| |- R2_52_2_E1_S14_R1_001.fastq.gz 52-2 branch Round 2 HHR+ pre-cleaved paired end reads
| |- R2_52_2_E1_S14_R2_001.fastq.gz
| |- R2_52_2_E2_S15_R1_001.fastq.gz 52-2 branch Round 2 HHR+ cleaved paired end reads
| |- R2_52_2_E2_S15_R2_001.fastq.gz
| |- R3_89_S17_R1_001.fastq.gz 71-89 branch Round 3 HHR- paired end reads
| |- R3_89_S17_R2_001.fastq.gz
| |- R3_89_E1_S18_R1_001.fastq.gz 71-89 branch Round 3 HHR+ pre-cleaved paired end reads
| |- R3_89_E1_S18_R2_001.fastq.gz
| |- R3_89_E2_S19_R1_001.fastq.gz 71-89 branch Round 3 HHR+ cleaved paired end reads
| |- R3_89_E2_S19_R2_001.fastq.gz
| |- R3_52_2_S21_R1_001.fastq.gz 52-2 branch Round 3 HHR- paired end reads
| |- R3_52_2_S21_R2_001.fastq.gz
| |- R3_52_2_E1_S22_R1_001.fastq.gz 52-2 branch Round 3 HHR+ pre-cleaved paired end reads
| |- R3_52_2_E1_S22_R2_001.fastq.gz
| |- R3_52_2_E2_S23_R1_001.fastq.gz 52-2 branch Round 3 HHR+ cleaved paired end reads
| |- R3_52_2_E2_S23_R2_001.fastq.gz
| |- R4_89_S25_R1_001.fastq.gz 71-89 branch Round 4 HHR- paired end reads
| |- R4_89_S25_R2_001.fastq.gz
| |- R4_89_E1_S26_R1_001.fastq.gz 71-89 branch Round 4 HHR+ pre-cleaved paired end reads
| |- R4_89_E1_S26_R2_001.fastq.gz
| |- R4_89_E2_S27_R1_001.fastq.gz 71-89 branch Round 4 HHR+ cleaved paired end reads
| |- R4_89_E2_S27_R2_001.fastq.gz
| |- R4_52_2_S29_R1_001.fastq.gz 52-2 branch Round 4 HHR- paired end reads
| |- R4_52_2_S29_R2_001.fastq.gz
| |- R4_52_2_E1_S30_R1_001.fastq.gz 52-2 branch Round 4 HHR+ pre-cleaved paired end reads
| |- R4_52_2_E1_S30_R2_001.fastq.gz
| |- R4_52_2_E2_S31_R1_001.fastq.gz 52-2 branch Round 4 HHR+ cleaved paired end reads
| |- R4_52_2_E2_S31_R2_001.fastq.gz
| |- R5_89_S33_R1_001.fastq.gz 71-89 branch Round 5 HHR- paired end reads
| |- R5_89_S33_R2_001.fastq.gz
| |- R5_89_E1_S34_R1_001.fastq.gz 71-89 branch Round 5 HHR+ pre-cleaved paired end reads
| |- R5_89_E1_S34_R2_001.fastq.gz
| |- R5_89_E2_S35_R1_001.fastq.gz 71-89 branch Round 5 HHR+ cleaved paired end reads
| |- R5_89_E2_S35_R2_001.fastq.gz
| |- R5_52_2_S37_R1_001.fastq.gz 52-2 branch Round 5 HHR- paired end reads
| |- R5_52_2_S37_R2_001.fastq.gz
| |- R5_52_2_E1_S38_R1_001.fastq.gz 52-2 branch Round 5 HHR+ pre-cleaved paired end reads
| |- R5_52_2_E1_S38_R2_001.fastq.gz
| |- R5_52_2_E2_S39_R1_001.fastq.gz 52-2 branch Round 5 HHR+ cleaved paired end reads
| |- R5_52_2_E2_S39_R2_001.fastq.gz
| |- R6_89_S41_R1_001.fastq.gz 71-89 branch Round 6 HHR- paired end reads
| |- R6_89_S41_R2_001.fastq.gz
| |- R6_89_E1_S42_R1_001.fastq.gz 71-89 branch Round 6 HHR+ pre-cleaved paired end reads
| |- R6_89_E1_S42_R2_001.fastq.gz
| |- R6_89_E2_S43_R1_001.fastq.gz 71-89 branch Round 6 HHR+ cleaved paired end reads
| |- R6_89_E2_S43_R2_001.fastq.gz
| |- R6_52_2_S45_R1_001.fastq.gz 52-2 branch Round 6 HHR- paired end reads
| |- R6_52_2_S45_R2_001.fastq.gz
| |- R6_52_2_E1_S46_R1_001.fastq.gz 52-2 branch Round 6 HHR+ pre-cleaved paired end reads
| |- R6_52_2_E1_S46_R2_001.fastq.gz
| |- R6_52_2_E2_S47_R1_001.fastq.gz 52-2 branch Round 6 HHR+ cleaved paired end reads
| |- R6_52_2_E2_S47_R2_001.fastq.gz
| |- R7_89_S49_R1_001.fastq.gz 71-89 branch Round 7 HHR- paired end reads
| |- R7_89_S49_R2_001.fastq.gz
| |- R7_89_E1_S50_R1_001.fastq.gz 71-89 branch Round 7 HHR+ pre-cleaved paired end reads
| |- R7_89_E1_S50_R2_001.fastq.gz
| |- R7_89_E2_S51_R1_001.fastq.gz 71-89 branch Round 7 HHR+ cleaved paired end reads
| |- R7_89_E2_S51_R2_001.fastq.gz
| |- R7_52_2_S53_R1_001.fastq.gz 52-2 branch Round 7 HHR- paired end reads
| |- R7_52_2_S53_R2_001.fastq.gz
| |- R7_52_2_E1_S54_R1_001.fastq.gz 52-2 branch Round 7 HHR+ pre-cleaved paired end reads
| |- R7_52_2_E1_S54_R2_001.fastq.gz
| |- R7_52_2_E2_S55_R1_001.fastq.gz 52-2 branch Round 7 HHR+ cleaved paired end reads
| |- R7_52_2_E2_S55_R2_001.fastq.gz
| |- R8_89_S57_R1_001.fastq.gz 71-89 branch Round 8 HHR- paired end reads
| |- R8_89_S57_R2_001.fastq.gz
| |- R8_89_E1_S58_R1_001.fastq.gz 71-89 branch Round 8 HHR+ pre-cleaved paired end reads
| |- R8_89_E1_S58_R2_001.fastq.gz
| |- R8_89_E2_S59_R1_001.fastq.gz 71-89 branch Round 8 HHR+ cleaved paired end reads
| |- R8_89_E2_S59_R2_001.fastq.gz
| |- R8_52_2_S61_R1_001.fastq.gz 52-2 branch Round 8 HHR- paired end reads
| |- R8_52_2_S61_R2_001.fastq.gz
| |- R8_52_2_E1_S62_R1_001.fastq.gz 52-2 branch Round 8 HHR+ pre-cleaved paired end reads
| |- R8_52_2_E1_S62_R2_001.fastq.gz
| |- R8_52_2_E2_S63_R1_001.fastq.gz 52-2 branch Round 8 HHR+ cleaved paired end reads
| |- R8_52_2_E2_S63_R2_001.fastq.gz
| |- HHR_E1_wt_S9_R1_001.fastq.gz re-selection branch wild-type alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_wt_S9_R2_001.fastq.gz
| |- HHR_E1_3_S10_R1_001.fastq.gz re-selection branch Seq3 alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_3_S10_R2_001.fastq.gz
| |- HHR_E1_2_S11_R1_001.fastq.gz re-selection branch Seq2 alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_2_S11_R2_001.fastq.gz
| |- HHR_E1_5_S12_R1_001.fastq.gz re-selection branch Seq5 alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_5_S12_R2_001.fastq.gz
| |- HHR_E1_15_S13_R1_001.fastq.gz re-selection branch Seq15 alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_15_S13_R2_001.fastq.gz
| |- HHR_E1_35_S15_R1_001.fastq.gz re-selection branch Seq35 alone HHR+ pre-cleaved paired end reads
| |- HHR_E1_35_S15_R2_001.fastq.gz
| |- HHR_E1_c_S16_R1_001.fastq.gz re-selection branch wt+Seq3+Seq2+Seq5+Seq15+Seq35 HHR+ pre-cleaved paired end reads
| |- HHR_E1_c_S16_R2_001.fastq.gz
| |- HHR_E2_wt_S17_R1_001.fastq.gz re-selection branch wild-type alone HHR+ cleaved paired end reads
| |- HHR_E2_wt_S17_R2_001.fastq.gz
| |- HHR_E2_3_S18_R1_001.fastq.gz re-selection branch Seq3 alone HHR+ cleaved paired end reads
| |- HHR_E2_3_S18_R2_001.fastq.gz
| |- HHR_E2_2_S19_R1_001.fastq.gz re-selection branch Seq2 alone HHR+ cleaved paired end reads
| |- HHR_E2_2_S19_R2_001.fastq.gz
| |- HHR_E2_5_S20_R1_001.fastq.gz re-selection branch Seq5 alone HHR+ cleaved paired end reads
| |- HHR_E2_5_S20_R2_001.fastq.gz
| |- HHR_E2_15_S21_R1_001.fastq.gz re-selection branch Seq15 alone HHR+ cleaved paired end reads
| |- HHR_E2_15_S21_R2_001.fastq.gz
| |- HHR_E2_35_S23_R1_001.fastq.gz re-selection branch Seq35 alone HHR+ cleaved paired end reads
| |- HHR_E2_35_S23_R2_001.fastq.gz
| |- HHR_E2_c_S24_R1_001.fastq.gz re-selection branch wt+Seq3+Seq2+Seq5+Seq15+Seq35 HHR+ cleaved paired end reads
| |- HHR_E2_c_S24_R2_001.fastq.gz
| |- R8_89_m_S25_R1_001.fastq.gz 71-89 branch Round 8 HHR+ mock selection paired end reads
| |- R8_89_m_S25_R2_001.fastq.gz
| |- R8_52_m_S26_R1_001.fastq.gz 52-2 branch Round 8 HHR+ mock selection paired end reads
| |- R8_52_m_S26_R2_001.fastq.gz
| |- CT_S27_R1_001.fastq.gz re-selection branch wt+Seq3+Seq2+Seq5+Seq15+Seq35 HHR+ starting mixture paired end reads
| |- CT_S27_R2_001.fastq.gz
|- processed_data/
| |- HH_p52and71_m_n0_abd_aln_R1_52_2_muts.txt Base calls for HHR- RNA from Round 1 of 52-2 branch
| |- HH_p52and71_m_n0_abd_aln_R1_89_muts.txt Base calls for HHR- RNA from Round 1 of 71-89 branch
| |- HH_p52and71_E1_n0_abd_aln_R1_52_2_muts.txt Base calls for HHR+ pre-cleaved RNA from Round 1 of 52-2 branch
| |- HH_p52and71_E1_n0_abd_aln_R1_89_muts.txt Base calls for HHR+ pre-cleaved RNA from Round 1 of 71-89 branch
| |- HH_p52_n0_abd_clst_more_M_dist_refs_dists.txt Sequence-Frequency table for 52-2 branch
| |- HH_p71_n0_abd_clst_more_M_dist_refs_dists.txt Sequence-Frequency table for 71-89 and re-selection branches
| |- HH_p52and71_n0_abd_M_dist_refs_dists.txt Sequence-Frequency table for HHR+ cleaved RNA only from 52-2 and 71-89 branches
| |- dist_refs.fasta fasta file for 4 reference sequences for Levenshtein distance calculations
|- figure_data/
| |- Papastavrou_23_fidelity_Table_S2_and_S3.xlsx Excel spreadsheet for calculating fidelity table and average fidelity from base call tables
| |- Papastavrou_23_fig_2_and_S2_data.xlsx Excel spreadsheet with underlying data for figure 2 and S2
| |- Papastavrou_23_fig_3_data.xlsx Excel spreadsheet with underlying data for figure 3
| |- Papastavrou_23_fig_S3_data.xlsx Excel spreadsheet with underlying data for figure S3
| |- Papastavrou_23_fig_4B_data.xlsx Excel spreadsheet with underlying data for figure 4B
Data description
The 'raw_data' directory contains raw paired-end reads for each RNA population, from directed evolution with 52-2, 71-89, or re-selection with 71-89. For the first two, data correspond to 3 sets of paired-end reads for the HHR-, HHR+ pre-cleaved, and HHR+ cleaved populations from 8 rounds of evolution, plus a set of paired-end reads corresponding to a mock round 8 selection in which polymerase ribozyme replication and hammerhead ribozyme cleavage steps were omitted. For the re-selection branch, data correspond to 2 sets of paired-end reads for HHR+ pre-cleaved and HHR+ cleaved populations after a single round of evolution, starting from one of six individual hammerhead ribozyme sequences or a mixture of all six. An additional set of paired-end reads is the sequenced mixture of RNAs prior to selection as a control.
The 'processed_data' directory contains data tables generated from raw sequence files as described below. Four text files, ending 'muts.txt' correspond to the number of base calls at each position along a gapped alignment to the wild type hammerhead sequence for HHR- or HHR+ pre-cleaved RNA after a single round of evolution with either polymerase. Next are three text files representing tables of distinct sequences, their frequency in the evolving population, and associated characteristics for populations from evolution using 52-2 ('p52'), 71-89 ('p71'), or both populations ('p52and71'), all ending in 'dists.txt'. The tables include a sequence name (a unique identifier generated by the Sequence_Tabulator.py script), sequence, Levenshtein distance to four reference sequences, frequency in each sequenced round, a boolean value corresponding to whether the sequence matches the biochemically-defined hammerhead motif using the HHmotif_Matchor.py script, and a numerical identifier for whether the sequence was part of a larger cluster in a given round using the Peak_Clusteror.py script. The p52 table contained frequency data for HHR+ cleaved populations and the round 8 mock selection population from evolution with 52-2. The p71 table contained frequency data for HHR+ cleaved populations and the round 8 mock selection population from evolution with 71-89 and all sequenced populations (HHR+ pre-cleaved and cleaved) from the 71-89 reselection. The p52and71 table contains frequency data for HHR+ cleaved populations from both branches of evolution. 'dist_refs.fasta' is the fasta file of the four reference sequences used as distance standards to generate each table. These 3 tables were processed with a variety of R notebooks in Rstudio to generate or plot data for Figures 2-5 and S1-S4 in the manuscript.
The 'figure_data' directory contains four excel files with data used directly in the indicated figures. The excel file for Tables S2 and S3, the fidelity tables, takes as input the 'muts.txt' processed files to yield fidelity tables and average fidelity using spreadsheet formulas. Detailed use instructions are provided in the file. The remaining files contain data plotted in the indicated figures. In many cases, this data was plotted directly in Rstudio or Graphpad Prism, these files are merely a record of the underlying data sets that those figures represent. Plotting of data using R scripts is described in the pipeline below.
Description of data processing pipeline
Initial raw read trimming and merging
The PCR products from RNA-catalyzed evolution were subject to deep sequencing, together with corresponding HHR– RNAs and HHR+ RNAs that had not undergone hammerhead cleavage and had been reverse transcribed and PCR amplified. All three sets of cDNAs were prepared for sequencing by nested PCR amplification and sequenced by the Salk Next Generation Sequencing Core on an Illumina NextSeq2000 with a 300-cycle paired-end run.
Paired-end fastq files were first trimmed of Illumina adapter sequences using cutadapt v2.4 and the following command for HHR- data:
cutadapt --trimmed-only -e 0.25 --pair-filter=both -n 2 -j 2 -m 63 -M 73 -a ACTGTAGGCACCATCAATCTGTCTCTTATACACATCTCCGAGCCC -G ATTGATGGTGCCTACAGT -A CTGTCTCTTATACACATCTGACGCTGCCGACGA -o TRIMMED_R1.fastq.gz -p TRIMMED_R2.fastq.gz RAW_R1.fastq.gz RAW_R2.fastq.gz
or for HHR+ data:
cutadapt --trimmed-only -e 0.25 --pair-filter=both -n 2 -j 2 -m 60 -M 71 -a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -A CTGTCTCTTATACACATCTGACGCTGCCGACGA -o TRIMMED_R1.fastq.gz -p TRIMMED_R2.fastq.gz RAW_R1.fastq.gz RAW_R2.fastq.gz
Reads lacking the correct primer sequences (allowing no more than 2 mismatches) were removed using cutadapt v4.3 using the following command for HHR- data:
cutadapt -g CAATGACAAAAAAGAAGTGAATGTC --action=retain --discard-untrimmed -e 0.02 --pair-filter=both -n 1 -j 2 -o PRIMER_MATCH_R1.fastq.gz -p PRIMER_MATCH_R2.fastq.gz TRIMMED_R1.fastq.gz TRIMMED_R2.fastq.gz
or the following two commands for HHR+ data:
cutadapt -g CAATGACAAAAAAACTATGTCGCGA --action=retain --discard-untrimmed -e 0.02 --pair-filter=both -n 1 -j 2 -o PRIMER_MATCH_1_R1.fastq.gz -p PRIMER_MATCH_1_R2.fastq.gz TRIMMED_R1.fastq.gz TRIMMED_R2.fastq.gz
cutadapt -G GAAGTGAATGTCCGCATT --action=retain --discard-untrimmed -e 0.02 --pair-filter=both -n 1 -j 2 -o PRIMER_MATCH_2_R1.fastq.gz -p PRIMER_MATCH_2_R2.fastq.gz PRIMER_MATCH_1_R1.fastq.gz PRIMER_MATCH_1_R2.fastq.gz
Paired reads were merged using FLASH v1.2.11 using the following command for all data:
flash -t 1 -m 33 -M 73 -x 0.3 -o FLASH_EXT -d ./OUTPUT_DIRECTORY/ PRIMER_MATCH_R1.fastq.gz PRIMER_MATCH_R2.fastq.gz
Merged files where then filtered for reads with a quality score ≥30 at every position using FASTX Toolkit v0.0.14 with the following command:
fastq_quality_filter -Q 33 -q 30 -p 100 -v -i FLASH_EXT.extendedFrags.fastq -o PROCESSED.fastq
Processing merged reads for fidelity tables
Sequences of HHR– and pre-cleaved HHR+ RNAs from the first round of evolution were processed separately to determine the fidelity of a half- and full-cycle of RNA-catalyzed replication, respectively. Fastq files were processed using Sequence_Tabulator.py to enumerate the reads for each distinct sequence in each library, and the results were tabulated into a table that included sequence name, nucleotide sequence, and number of reads. A fasta file for each distinct sequence was also generated. The command used was:
python Sequence_Tabulator.py SEQ_TABLE y 0 PROCESSED_1.fastq PROCESSED_2.fastq
where input fastq files were the corresponding processed HHR– and pre-cleaved HHR+ sequences from round 1 of both branches of evolution.
Distinct sequences in output fasta files were aligned to the corresponding wild-type sequence (HHR–, CAATGACAAAAAAGAAGTGAATGTCCGCATTTCGTGCCGTAGCACTCATCAGTACGTCTCGCGACATAGTTTTTTT, or HHR+, CAATGACAAAAAAACTATGTCGCGAGACGTACTGATGAGTGCTACGGCACGAAATGCGGACATTCACTTC) using bowtie2 v2.4.2, with a permissive scoring function to ensure that highly mutated sequences in the 52-2 lineage were included in the alignment:
bowtie2 --end-to-end --reorder --score-min L,0,-1.25 -L 5 -D 240 -R 32 -x HH_WILD_TYPE.fasta -f SEQ_TABLE.fasta -S ALIGNED.sam
The generated sam file was converted to a sorted, indexed bam file using SAMtools v1.9:
samtools view -b -S ALIGNED.sam > ALIGNED.bam
samtools view -h -b -F 4 ALIGNED.bam > ALIGNED_ONLY.bam
samtools sort ALIGNED_ONLY.bam > SORTED_ALIGNED_ONLY.bam
samtools index SORTED_ALIGNED_ONLY.bam
A gapped alignment was then generated using the bamtoaln module of breseq v0.35.5:
breseq bam2aln --format txt -b SORTED_ALIGNED_ONLY.bam -f HH_WILD_TYPE.fasta -n X1 -o GAPPED_ALIGNMENT X2:X3
where X1 was the number of aligned sequences by bowtie2, X2 is the name given to wild type in the corresponding fasta file, and X3 is 1-76 or 1-70 for HHR- or HHR+, respectively.
The first line of the gapped alignment was extracted by the command:
head -n 1 GAPPED_ALIGNMENT > HH_WILD_TYPE_GAPPED.fasta
and edited into correct fasta format ('>NAME\nSEQUENCE') using a text editor.
The remainder of the gapped alignment was converted into a new fasta file with the command:
cat GAPPED_ALIGNMENT | sed 's/</>/' | awk -F' > ' '{gsub(" ",".",$1); print $1" > "$2}' | awk -v total=`cat GAPPED_ALIGNMENT | wc -l` -F' > ' 'NR > 2 && NR < (total-2) {print ">"$2"\n"$1}' > GAPPED_ALIGNMENT.fasta
The sequence table from Sequence_Tabulator.py was updated with the gapped, aligned sequences using Aligned_Sequence_Updator.py:
python Aligned_Sequence_Updator.py GAPPED_ALIGNMENT.fasta SEQ_TABLE.txt
The results were tabulated to include the number of substitutions, insertions, or deletions at each nucleotide position using Base_Count_Tabulator.py:
python Base_Count_Tabulator.py SEQ_TABLE_aln.txt HH_WILD_TYPE_GAPPED.fasta PROCESSED_1 PROCESSED_2
where the last two arguments specified the column names for sequence read counts from the two input fastq files. The output for both HHR- and pre-cleaved HHR+ inputs yielded two tables as text files, corresponding to data from round 1 of the 52-2 or 71-89 branch of evolution, respectively, for a total of four text files ending in 'muts.txt' in the 'processed_data' directory. These tables were processed using Microsoft Excel to obtain the average mutation frequencies for each of the four nucleotide bases, using 'Papastavrou_23_fidelity_Table_S2_and_S3.xlsx'.
Processing merged reads for sequence-frequency tables
To analyze changes in RNA populations over the course of evolution, sequences of cleaved HHR+ RNAs from evolution with the 52-2 or 71-89 polymerases were processed separately to generate tables of distinct sequences and corresponding number of reads in each round of evolution, using Sequence_Tabulator.py:
python Sequence_Tabulator.py SEQ_TABLE n 0 PROCESSED_ROUND_1.fastq PROCESSED_ROUND2.fastq ... PROCESSED_ROUND8.fastq
A separate table was generated using as input fastq files from both branches, so that three output tables were carried forward overall: SEQ_TABLE_52_2.txt, SEQ_TABLE_71_89.txt, SEQ_TABLE_ALL.txt.
Additional fastq files, including the 'mock' round 8 for the 52-2 branch or the 'mock' round 8 and the individual and mixed re-selection rounds (including pre-cleaved and cleaved HHR+ sequences) for the 71-89 branch, were added to SEQ_TABLE_52_2.txt and SEQ_TABLE_71_89.txt using Sequence_Expandor.py:
python Sequence_Expandor.py SEQ_TABLE.txt SEQ_TABLE_EXPANDED.txt 0 9 y PROCESSED_MOCK.fastq [PROCESSED_RESELECTION_1.fastq ...]
This step was for convenience only. Sequence_Tabulator.py names sequences procedurally, sorting all distinct sequences in descending order by maximum frequency across all input fastq files and naming numerically, from 0 upward. To ensure names reflected frequency only across the 8 rounds of evolution in either branch, only fastq files for those populations were used to generate the table. Additional populations were added with this separate script that did not re-name original sequences defined in the original step.
Sequences that included the strictly conserved hammerhead nucleotides 5´ CUGANGA… GAAA-3´, together with a Watson-Crick base pair at the base of stem I and either an R:Y or Y:Y pair at the base of stem II, were identified as matching the biochemically-defined hammerhead motif (D. E. Ruffner, G. D. Stormo, O. C. Uhlenbeck, Sequence requirements of the hammerhead RNA self-cleavage reaction. Biochemistry 29, 10695–10702 (1990)). HHmotif_Matchor.py added an additional column of boolean values to sequence tables corresponding to a match or non-match.
python HHmotif_Matchor.py SEQ_TABLE.txt
A peak-finding algorithm identified local peak sequences in each population (above a minimum threshold of 0.1%) and joined sequences within 2 mutations of the most abundant nearby peak in a greedy fashion for each round. Clustering was not carried out for additional populations added by Sequence_Expandor.py. Peak_Clusteror.py is derived from ClusterBOSS (https://github.com/ichen-lab-ucsb/ClusterBOSS). The command used was:
python Peak_Clusteror.py SEQ_TABLE.txt ROUND_# 2 2000 1 1000
Where ROUND_# identified the round by the corresponding fastq file name.
The output cluster files for each round were then processed with Peak_Updator.py to update the sequence table with columns representing the cluster membership in a given round:
python Peak_Updator.py CLUSTER_FILE_LIST SEQ_TABLE.txt
where CLUSTER_FILE_LIST is a text file containing the names of the cluster file generated for each round as separate lines.
Finally, LevDist_Measuror.py was used to determine the Levenshtein distance between each distinct sequence and four reference sequences. The script outputs a sequence-frequency table containing the sequence name, nucleotide sequence, Levenshtein distance to each of the reference sequences, normalized read frequency to 8 decimals, a Boolean value for whether the sequence matches the hammerhead motif (as defined in the main text), and membership within any cluster. The command used was:
python LevDist_Measuror.py SEQ_TABLE.txt dist_refs.fasta #
where dist_refs.fasta is the fasta file of reference sequences included in the 'processed_data' directory and # indicates the number of distinct populations in the table (9 for 52-2, 24 for 71-89, and 18 for the combined sequence dataset). The output are three tables in the 'processed_data' directory all ending in 'dists.txt', corresponding to the 3 sequence datasets, 52-2, 71-89, and combined. These tables were used as input for processing in Python or R as described below to plot data in figures.
Processing sequence-frequency tables to generate figures
The three sequence-frequency tables in the frequency subdirectory where used in Rstudio with the included R notebooks to plot data for Figures 2-5 and S1-S4 in the manuscript.
For Figure 2, biochemical data on population RNA cleavage activity was measured as described in the manuscript and included in 'Papastavrou_23_fig_2_and_S2_data.xlsx' in the 'figure_data' directory. Fraction of population matching the known hammerhead motif was determined using the 'HH_population_values.R' notebook in R studio, plotted using Graphpad Prism, and data is included in 'Papastavrou_23_fig_2_and_S2_data.xlsx'. Data for violin plots of mutation distributions were generated using the Python script LevDistViolin.py, and then plotted using the matplotlib and seaborn libraries in Python as follows:
#installs pandas to manipulate and analyze data
Import pandas as pd
#opens txt file and creates a table with two columns
df = pd.read_csv(‘71-89_vpdata’, sep=’\t’, names=[‘Copy’, ’Round’], header=None
#installs two software packages. One to create the violin plot (seaborn) and the other to easily format it (matplotlib.pyplot)
import seaborn as sns
import matplotlib.pyplot as plt
#creates the figure frame
fig, ax = plt.subplots(figsize=(20,10))
#creates the violin plot
sns.violinplot(data=df, x="Round", y="Copy", bw=0.4, color="steelblue")
#formats the violin plot
plt.yscale('log')
plt.xlabel("Round", fontsize=50, weight='bold')
plt.ylabel("Copy", fontsize=50, weight='bold')
plt.xticks(fontsize=45)
plt.yticks(fontsize=45)
#exports the violin plot as a vector file
fig.savefig('89_C_violin plot.svg', format="svg", dpi=1200)
Underlying distribution of data from the LevDistViolin.py output is included in 'Papastavrou_23_fig_2_and_S2_data.xlsx'.
For Figure S1, Relative changes in frequency between round 7 and either round 8 or a mock round 8 were calculated for each sequence that was present in round 7 at >0.01% frequency, corresponding to a sequencing depth of at least 50 reads using the 'HH_population_values.R' notebook in Rstudio and plotted using ggplot2 in the same script.
For Figure S2, the average distance between sequences in each population of cleaved HHR+ RNAs was determined by randomly sampling 100,000 pairs of sequences from the list of distinct sequences, based on the frequency of the sequence in a given round and averaging the Levenshtein distance between each pair divided by the number of variable nucleotides in the first member of the pair. The normalized Shannon population entropy, which was also determined from a sample of 100,000 sequences, is defined as the sum over all distinct sequences: ∑ Fs • ln(Fs) / ln(1/N), where Fs is the frequency of each distinct sequence in the sample and N is the total number of sequences in the population. Sub-sampling of sequences was carried out to ensure that entropy values were determined at the same sequencing depth. Both approaches used the 'HH_population_values.R' notebook in Rstudio and were plotted using Graphpad Prism. Underlying data for plots is included in 'Papastavrou_23_fig_2_and_S2_data.xlsx'.
For Figure 3A and S3A, phylogenetic trees rooted to the wild-type sequence were generated using the neighbor-joining algorithm of Saitou and Nei (N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)), encompassing all sequences that reached a maximum frequency >0.1% for the 52-2 lineage and >0.5% for the 71-89 lineage during rounds 3–8 of evolution. For Figure 3B and S3B, the frequency distributions of peak sequences, phylogenetically neighboring sequences, and clusters were determined using sequence-frequency tables and phylogenetic trees generated from them. All calculations were performed and plotted in Rstudio using the script 'HH_phylogeny.R', data for plots is included in 'Papastavrou_23_fig_3_and_S4_data.xlsx' and 'Papastavrou_23_fig_S3_data.xlsx'.
Figure 4B was derived from data in 'HH_p71_n0_abd_clst_more_M_dist_refs_dists.txt' from the third reselection branch involving a single round of hammerhead evolution using the 71-89 polymerase and the wild type, Seq2, Seq3, Seq5, Seq15, or Seq 35 HHR+ template RNAs, both individually and as an equimolar mixture. The frequency of cleaved HHR+ RNA sequences in the mixed population were fit to a multiple linear regression of the frequencies of sequences in each population that were propagated individually. Regression coefficients for the individually propagated populations were used to assign the fractional contribution of corresponding starting RNAs to progeny RNAs in the mixed population. Fitness values for the new hammerhead variants (Seq2-35) relative to the starting variant (Seq0) were calculated in 3 different ways. First, by multiplying the experimentally determined values for HHR- synthesis, HHR+ synthesis, and RNA cleavage yields of each of the variants, as plotted in figure 4A. Second, by the same method, but replacing RNA cleavage yield with an estimate of specific activity by multiplying RNA cleavage yield by the fraction of pre-cleaved HHR+ RNA sequences that are consistent with the hammerhead motif from each population propagated from individual hammerhead variants . Third, by the relative enrichment of copy number for each variant, as determined from the frequency of each variant in the starting mixture of all six variants and the frequency of progeny from each variant in the HHR+ cleaved population. All population measures, including predicted progeny and fraction of hammerhead products, were determined in Rstudio using the HH_reselection.R notebook, and data was plotted in Graphpad Prism. Underlying data, including the determination of predicted progeny from relative fitness values, are provided in 'Papastavrou_23_fig_4B_data.xlsx'.
For Figure 5 and Movie S1, a 2-dimensional map of the evolving populations in sequence space was constructed based on the Levenshtein distance from each distinct sequence in the population to four reference sequences, as included in the 'HH_p71_n0_abd_clst_more_M_dist_refs_dists.txt' and 'HH_p52_n0_abd_clst_more_M_dist_refs_dists.txt' sequence-frequency tables. The reference sequences, included in 'dist_refs.fasta' were the wild type, a distant non-hammerhead from the 52-2 lineage (variable sequence 5′-…GCUGGUUGCUACAGCCG…-3′, lacking any discernible features of the hammerhead motif), and Seq15 and Seq35 from the 71-89 lineage. A 2-dimensional plane was defined by the first two principal components of variation of the distance matrix between these four sequences, and the position of all distinct cleaved HHR+ sequences was projected onto this plane. The density of hammerhead functionality across the 2-dimensional plane was estimated as the local average fraction of all sequences in the cleaved HHR+ RNA populations that match the hammerhead motif, using the 'HH_p52and71_n0_abd_M_dist_refs_dists.txt' sequence-frequency table. These calculations were performed and data plotted using Rstudio and the HH_scatterplots.R notebook.
Final formatting of figures was performed in Adobe Illustrator. MovieS1 was generated from Illustrator-formatted scatterplots in Apple Keynote.
Sharing/Access information
Licenses/restrictions placed on the data:
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below.
In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights. Unless expressly stated otherwise, the person who associated a work with this deed makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law. When using or citing the work, you should not imply endorsement by the author or the affirmer.
Links to publications that cite or use the data:
Nikos Papastavrou, David P Horning, Gerald F Joyce. "RNA-Catalyzed Evolution of Catalytic RNA" submitted. (2023)
Code/Software
The following Python scripts are described in the pipeline for analysis above and included in this dataset. A description of each is provided, including arguments for input and output format. These descriptions are also available in the header comment section of each script.
Sequence_Tabulator.py
This script takes sequencing data from multiple rounds of a selection, in the form of fastq files and constructs a list of distinct sequences and the number of occurences in each fastq file, as well as the total number of reads in each fastq file. It names each sequence numerically in descending order from most frequent (in any round) to least frequent, and prints them in descending order in a table of comma separated values, name first, then sequence, then number of occurences in each fastq file. The first line of the table provides column headings, followed by a line which lists total reads in each fastq file.
The distinct name given to each sequence is "SeqN" with N being an integer from 0 upward.
Optionally, one may specify whether a fasta file of these sequences is also generated. In that case, the fasta file will contain sequences in the same descending order, with the distinct name as the ID.
the script takes at least 4 arguments
- the first is the name for the compiled table of distinct sequences (and is also used for the fasta file if generated). This name can include directory paths if desired, but should contain no file extensions, which are added by the script as needed. The script appends the threshold value and some other specifying text to the output: "NAME_n#_abd.txt" and "NAME_n#.fasta" if specified.
- the second is simply "y" or "n" to specify whether a separate fasta file of the distinct sequences should be generated.
- the third is an integer put in the variable "threshold" and specifies a minimum frequency in reads per million a read must have in at least one fastq file to be included in the output file (0 includes all reads of any frequency)
- the fourth and additional arguments are fastq files to be scanned for distinct sequences to add into the fasta file output these files must all exist and be correctly formatted as fastq files; it is also advisable to have trimmed and filtered these files so any incidental variation doesn't expand the number of reads - you don't want one fastq file to have an extra "G" at the start of each sequence, or all those reads will become separate entries, even if the following sequence after the "G" all occur in other fastq files that you are combining. also note that the script will use the full text of the file name as an identifier in the heading line of the output table, starting from the last directory "/" to the ".fastq" extension. thus its highly advisable to use fastq filenames that are short but distinct.
the standard input argument for calling this scripts would be:
python Sequence_Tablulator.py {output name} {y/n} {threshold} {round1.fastq round2.fastq ...}
where
- threshold is an integer from 0 to 1000000 (typically less than 100, chosen to exclude low frequency reads)
- round1.fastq round2.fastq ... are all the fastq files to be searched for distinct reads
Sequence_Updator.py
This script takes a table of unique sequences generated from sequencing data from multiple rounds of a selection and a new set of sequencing data from additional rounds of selection, in the form of fastq files. It scans the input fastq file for unique sequences and the number of occurences in each new round, then line-by-line updates the input table with additional columns representing occurence in the new input rounds. If specified by the user, it will also add new sequences to the end of the table, ordered in frequency but not with original sequences, named numerically starting with one greater than the previous seq name. Regardless, all sequences in a fastq file are counted in the total read number for that round, which is added to the second line of the table.
The script takes at least 6 arguments
- the first is the name of the input table to modify. this table should have been generated by the "Uniq_Seq_Tabulator" script and will remain unmodified by this script
- the second is the name for the output table of unique sequences. it can be specified by any name, directory info, or extension as desired, but any existing file will be completely overwritten.
- the third is the threshold value used for new distinct sequences, in reads per million. Generally, this should match the previous threshold used, but this script does not check.
- the fourth is the point after which to insert new fastq file read totals in the output table. Typically, this should be the last fastq file read count column in the input file, before any non-readcount columns (cluster, motif match etc.)
- the fifth is 'y' or 'n' and specifies whether to add new sequences to the end of the output table that are found in the new fastq files but weren't in the original file. If 'y', these sequences are sorted in descending order, but are all placed at the end, not interspersed with previous distinct sequences, even if they have higher frequencies.
- the sixth and additional arguments are fastq files to be scanned for unique sequences that are inturn used to update occurence number of the existing sequences in the input table for new rounds. these files must all exist and be correctly formatted as fastq files; it is also advisable to have trimmed and filtered these files in the same way as the files used to generate the input table, to ensure a single sequence isn't represented by multiple sequences or a different sequence in the new fastq files. If 'y' is specified above, new sequences are appended to the end of the table. also note that the script will use the full text of the file name as an identifier in the heading line of the output table, starting from the last directory "/" to the ".fastq" extension. thus its highly advisable to use fastq filenames that are short but unique.
the standard input argument for calling this scripts would be:
python Sequence_Expandor.py {input name} {output name} {threshold} {insertion point} {add new seq? y/n} {round1.fastq round2.fastq ...}
where round1.fastq round2.fastq ... are all the fastq files to be searched for unique reads
If the input table layout looks like this:
name,sequence,round1,round2,round3,round4,HHmatch
total reads,,#,#,#,#,1
Seq0,NNNNNN,#,#,#,#,0
Seq1,NNNNNN,#,#,#,#,1
and so on, with # representing number of reads in the round and NNNNNN representing a unique sequence found at least once in one of the fastq files.
If the new fastq files are of the form ~/dir/roundN.fastq from N=5 to 8, then the output file will look like this:
name,sequence,round1,round2,round3,round4,round5,round6,round7,round8,HHmatch
total reads,,#,#,#,#,#,#,#,#,1
Seq0,NNNNNN,#,#,#,#,#,#,#,#,0
Seq1,NNNNNN,#,#,#,#,#,#,#,#,1
If new sequences are allowed, then at the end of the file, if the last entries from the input file are:
Seq{n-1},NNNNNN,#,#,#,#,1
Seq{n},NNNNNN,#,#,#,#,0
Then the new output file has additional sequences appended:
Seq{n-1},NNNNNN,#,#,#,#,#,#,#,#,1
Seq{n},NNNNNN,#,#,#,#,#,#,#,#,0
Seq{n+1},NNNNNN,0,0,0,0,#,#,#,#,
Seq{n+2},NNNNNN,0,0,0,0,#,#,#,#,
Seq{n+3},NNNNNN,0,0,0,0,#,#,#,#,
where n is the highest previous Seq number, and new NNNNNN sequences are distinct from all previous sequences.
note that the earlier fastq file read count columns are filled with 0, and any non read count columns are filled with null.
HHmotif_Matchor.py
This script uses the Python re regular expression library to search for sequences in a table generated by Sequence_Tabulator.py that match the definition of the hammerhead motif determined in vitro by Ruffner, DE, Stormo, GD, Uhlenbeck OC Biochemistry 29, 10695-10702 (1990)
The script searches a sequence for:
"GTCGCGA", which is not part of the hammerhead motif but is a flanking sequence that should not vary
during the directed evolution procedure for which this script was designed.followed by a variable number of bases
followed by a single base (B1), see below
followed by "CTGANGA", the central catalytic motif of the hammerhead
followed by a base (B2), see below
followed by 0 or more bases
followed by a base (B3), see below
followed by "GAAATGCGG", which includes the second catalytic motif of the hammerhead, "GAAA" and part of
the substrate binding arm "TGCGG" which can partially vary biochemically but was not free to do so during
the directed evolution procedure for which this script was designed.It then checks that B1 is 'A', thus matching the 'U' in the substrate to form the terminal base-pair of stem I.
It also checks that B2 matches B3 as either an R:Y or Y:Y pair, all of which showed biochemical activity as the terminal base pair of stem II.
The normal format for calling this script would be:
python HHmotif_Matchor.py DistinctSequenceTable.txt
If the input table named DistinctSequenceTable.txt looks like this:
name,sequence,round1,round2,round3,round4
total reads,,#,#,#,#
Seq0,NNNNNN,#,#,#,#
Seq1,NNNNNN,#,#,#,#
and Seq0 matched the hammerhead motif while Seq1 did not then the output table, named "DistinctSequenceTable_M.txt", will look like this:
name,sequence,round1,round2,round3,round4,HHmatch
total reads,,#,#,#,#,
Seq0,NNNNNN,#,#,#,#,1
Seq1,NNNNNN,#,#,#,#,0
Peak_Clusteror.py
This script is designed to take a table of sequences and abundances generated by the Sequence_Tabulator script, and identify peaks within sequence space.
This script is adapted from the ClusterBOSS script written by Celia Blanco and Irene Chen: https://github.com/ichen-lab-ucsb/ClusterBOSS any errors were introduced by our changes.
The user inputs the sequence abundance table and specifies a column of read counts within the table to use for peak determination.
They also specify the maximum peak radius, the minimum number of sequences in a peak, and the minimum height of a peak center and all members.
The script prints all peaks in a new table, from largest to smallest (by peak center abundance).
Each peak in the list has lines showing name, sequence, abundance, frequency, and distance from center. They are ordered from center to furthest.
A call of this script would be formatted as follows:
python Peak_Clusteror.py input_table.txt {column(str)} {peak radius(int)} {min peak size(int)} {min abundance(int)} {min peak abundance(int)}
The output file will be named:
input_table_r{round}_e{peak radius}_ms{min peak size}_ma{min abundance}_mac{min peak abundance}_peaks.txt
By default, this script uses the Levenshtein package for Python, https://pypi.org/project/python-Levenshtein/
This distance calculation is fast, as it's implemented in C. An alternative implemented directly in Python is available in commented code, if needed.
Peak_Updator.py
This script is designed to update sequences in a table generated by Sequence_Tabulator.py with sequences from peak files generated by Peak_Clusteror, from the same table. The user must specify the list of peak files to be used in a text file. The script will then process each peak file to get the peak assignment (or 0 for none) for each sequence. It then reads through each line of the input table, and adds columns for each peak file, giving the corresponding peak assignment for that sequence.
The normal format for calling this script would be:
python Peak_Updator.py Peak_file_list.txt DistinctSequenceTable.txt
Peak_file_list.txt should be formatted as 1 line for each distinct peak file (corresponding
to read counts from one column). In each line there is the name of the corresponding column
for peak read counts and the corresponding peak file name (with full path if in a different
directory than where script is called), separated by spaces.
If the input table named DistinctSequenceTable.txt looks like this:
name,sequence,round1,round2,round3,round4
total reads,,#,#,#,#
Seq0,NNNNNN,#,#,#,#
Seq1,NNNNNN,#,#,#,#
and Seq0 is in peak 1 for all 4 rounds, but Seq1 is not in a peak in round1 and 2, but peak 2 for 3 and 4, output will look like this:
name,sequence,round1,round2,round3,round4,r1_clst,r2_clst,r3_clst,r4_clst
total reads,,#,#,#,#,,,,
Seq0,NNNNNN,#,#,#,#,1,1,1,1
Seq1,NNNNNN,#,#,#,#,0,0,2,2
LevDist_Measuror.py
This script generates takes an input table of sequences generated by Sequence_Tabulator.py, and outputs a similar table where Levenshtein distances to a user specified number of reference sequences are listed, and read counts are converted to frequency (value 0 to 1), rounded to 8 decimal places. The 2nd row of the Sequence_Tabulator output, containing total read counts from each round, is removed. Other columns from other scripts, like HHmatch or peak identities, are preserved, as are sequence names and sequences.
This script is normally called in the following manner:
python LevDist_Measuror.py DistinctSequenceTable.txt reference.fasta number
where DistinctSequenceTable.txt is the Sequence_Tabulator generated sequence and read
count table.
reference.fasta is a list of reference sequences (no gaps allowed) used to calculate
distances.
number is an integer specifying the number of read count columns in the input file
output is formatted as a CSV text, with first column as sequence name, then N columns corresponding to Levenshtein distances to N reference sequences in input fasta file, then sequence, then frequency in each fastq file/round as an 8 digit number between 0 and 1, then any remaining columns from the input file.
LevDistViolin.py
This script converts an existing sequence-frequency table into a format compatible with an existing violin plot script. To do so, it takes each distinct sequence in the table, and input-specified column in the table for Levenshtein distance to a reference sequence and additional columns for sequence frequency in rounds of evolution, and generates a 2-column table, the first corresponding to the Levenshtein distance, the second to round number. Each entry is repeated N times, where N is copies per million, determined from the sequence frequency in a round.
This script is normally called in the following manner:
python LevDistViolin.py seqtable.txt dist_column first_round last_round
where seqtable.txt is the sequence and frequency table.
dist_column is the name of the column in the table specifying
levenshtein distances to a reference sequence.
first_round is the name of the column for population frequencies in the
first round of evolution
last_round is the same for the last round
output is formatted as a tab-separated text, with first column as distance from reference and second column as round number as "R#". Each distinct sequence is printed N times for each round, corresponding to the copies per million determined from round frequency.
The following R notebooks are are described in the pipeline for analysis above and included in this dataset. Each is intended for use in Rstudio. A brief description of the application for each is provided, including libraries used. Each notebook includes detailed comments describing intended use.
HH_population_values.R
This notebook is used to extract population features from sequence-frequency tables, including the frequency of functional hammerhead sequences plotted in Figure 2, the distribution of round 8 and mock round 8 sequence enrichment values in Figure S1, and the Shannon Population Entropy and Average Sequence Distance data plotted in Figure S2. This notebook uses the dplyr, tidyr, glue, ggplot2, and stringdist libraries.
HH_phylogeny.R
This notebook is used to determine neighbor-joining phylogenies and plot corresponding population frequencies by round for Figure 3 and Figure S3. This notebook uses the dplyr, tidyr, glue, ggplot2, stringdist, ape, tidytree, treeio, and ggtree libraries.
HH_reselection.R
This notebook is used to determine the contributions of each hammerhead variant to the mixed re-selection population in Figure 3D and Figure S4. This notebook uses the dplyr and tidyr libraries.
HH_scatterplots.R
This notebook uses sequence-frequency tables to prepare scatterplots of the evolving populations' distribution in sequence space for Figure 5 and movie S1. This notebook uses the ggplot2, ggfortify, plotly, dplyr, tidyr, and glue libraries.
Methods
Directed evolution of a hammerhead ribozyme sequence was carried out in three steps: 1) templated synthesis of the reverse-complement hammerhead RNA by a polymerase ribozyme; 2) templated synthesis of a new copy of the hammerhead RNA from the reverse-complement by a polymerase ribozyme; 3) selective recovery of hammerhead RNA that cleaved an attached RNA substrate. Cleaved RNA was reverse transcribed, PCR amplified, and archived for sequencing, while a portion was in vitro transcribed with T7 RNA polymerase to initiate the next round of evolution. Two distinct branches of evolution were carried out for 8 rounds, using the '52-2' or '71-89' polymerase ribozymes to replicate RNA, respectively. A third 'reselection' branch took six distinct hammerhead ribozyme sequences, and subjected them to a single round of directed evolution either individually or as an equimolar mixture, using the '71-89' polymerase for RNA replication.
RNA products from each of three steps of each round of evolution ('HHR-', 'HHR+ pre-cleaved', and 'HHR+ cleaved', respectively) were reverse transcribed, amplified by PCR, and sequenced in an Illumina NextSeq2000 with a 300-cycle paired-end run. Raw paired-end fastq files were trimmed of Illumina adapter sequences and reads lacking the correct primer sequences (allowing no more than 2 mismatches) were removed using cutadapt v4.3. Paired reads were merged using FLASH v1.2.11, then filtered for reads with a quality score ≥30 at every position using FASTX Toolkit v0.0.14.
Two different analytical pipelines were performed. In one, data from the first round of evolution in either of the first two branches were used to determine the per-nucleotide fidelity of each polymerase in synthesizing the HHR- and HHR+ pre-cleaved RNA populations.
- Fastq files for each library were processed using a custom Python script to enumerate the reads for each distinct sequence in each library, and the results were tabulated to include sequence name, nucleotide sequence, and number of reads. A fasta file of all distinct sequences was also generated.
- These fasta files were aligned to the corresponding wild-type sequence using bowtie2 v2.4.2, with a permissive scoring function to ensure that highly mutated sequences in the 52-2 lineage were included in the alignment.
- The generated sam file was converted to a sorted, indexed bam file using SAMtools v1.9.
- A gapped alignment was generated using the bamtoaln module of breseq v0.35.5.
- The corresponding sequence table was updated with the gapped, aligned sequences using a custom Python script.
- The results were tabulated to include the number of substitutions, insertions, or deletions at each nucleotide position using a custom Python script.
- These tables were processed using Microsoft Excel to obtain the average mutation frequencies for each of the four nucleotide bases.
In a second pipeline, data over all eight rounds of evolution were processed for evolution with either polymerase.
- Fastq files were processed into a table of each distinct sequence and corresponding number of reads in each round using a custom Python script.
- For each round of evolution, the peak sequences were identified using a modified version of the ClusterBOSS Python script of Blanco and Chen (E. Janzen, Y. Shen, A. Vázquez-Salazar, Z. Liu, C. Blanco, J. Kenchel, I. A. Chen, Emergent properties as by-products of prebiotic evolution of aminoacylation ribozymes. Nat. Commun. 13, 3631 (2022)). For each peak sequence, additional sequences were clustered together with the peak based on the following criteria: i) a peak sequence with ≥1000 reads; ii) additional sequences having no more than 2 mutations relative to the peak; and iii) a cluster, including the peak, with ≥2000 reads. The algorithm sorted all distinct sequences from most to least frequent, determined if a sequence was more abundant than all nearby sequences within 2 mutations, and if so, defined it as the peak sequence and clustered to that peak all other sequences within 2 mutations that were not already members of a cluster. A second script updated the sequence table with columns indicating membership within a cluster for a given round of evolution.
- Sequences that included the strictly conserved hammerhead nucleotides 5´‑CUGANGA… GAAA-3´, together with a Watson-Crick base pair at the base of stem I and either an R:Y or Y:Y pair at the base of stem II, were identified as matching the biochemically-defined hammerhead motif (D. E. Ruffner, G. D. Stormo, O. C. Uhlenbeck, Sequence requirements of the hammerhead RNA self-cleavage reaction. Biochemistry 29, 10695–10702 (1990)). A custom Python script updated the sequence table with a new column containing Boolean values for whether each sequence matched the hammerhead motif.
- Finally, a custom Python script was used to determine the Levenshtein distance between each distinct sequence and four reference sequences. The script outputs a sequence-frequency table containing the sequence name, nucleotide sequence, Levenshtein distance to each of the reference sequences, normalized read frequency, a Boolean value for matching the hammerhead motif, and membership within any cluster.
Output sequence-frequency tables were further processed to generate figures and tables as presented in N. Papastavrou, D. P. Horning, G. F. Joyce, RNA-Catalyzed Evolution of Catalytic RNA, submitted (2023). The following describes presentation related to the sequencing dataset directly. Methods for measuring and plotting biochemically-derived data is provided directly in the manuscript.
- Table S2 and S3, Copying and replication fidelity of the 52-2 and 71-89 polymerases. These tables were prepared directly from the spreadsheet generated by the polymerase fidelity pipeline described above, without further processing.
-
Fig. 2. RNA-catalyzed evolution of the hammerhead ribozyme.
- A custom Python script was used to plot the mutation-frequency distribution relative to the wild-type sequence as smoothed violin plots using the matplotlib and seaborn libraries.
- The frequency of sequences matching the hammerhead motif was determined for each round of each lineage in Rstudio using a custom R notebook and plotted using Graphpad Prism.
- These two plots, along with the plot of hammerhead population RNA cleavage activity, were combined in Adobe Illustrator.
- Fig. S1. Selective enrichment of hammerhead variants during RNA-catalyzed evolution. Relative changes in frequency between round 7 and either round 8 or a mock round 8 were calculated for each sequence that was present in round 7 at >0.01% frequency, corresponding to a sequencing depth of at least 50 reads, in Rstudio using a custom R notebook. Histograms were generated in Rstudio using ggplot2.
- Fig. S2. Sequence diversity over the course of RNA-catalyzed evolution. The average mutational distance between sequences and the Shannon entropy of the population were determined from sub-sampled sequence datasets in Rstudio using a custom R notebook and plotted using Graphpad Prism. The average distance between sequences in each population of cleaved HHR+ RNAs was determined by randomly sampling 100,000 pairs of sequences from the list of distinct sequences, based on the frequency of the sequence in a given round and averaging the Levenshtein distance between each pair divided by the number of variable nucleotides in the first member of the pair. The normalized Shannon population entropy, which was also determined from a sample of 100,000 sequences, is defined as the sum over all distinct sequences: ∑ Fs • ln(Fs) / ln(1/N), where Fs is the frequency of each distinct sequence in the sample and N is the total number of sequences in the population. Sub-sampling of sequences was carried out to ensure that entropy values were determined at the same sequencing depth (J. Gregori, C. Perales, F. Rodriguez-Frias, J. I. Esteban, J. Quer, E. Domingo, Viral quasispecies complexity measures. Virology 493, 227–237 (2016)).
-
Fig. 3 and S3. Emergence of peak sequences and clusters over the course of evolution catalyzed by the 71-89 and 52-2 polymerases.
- For 3A and S3A, phylogenetic trees rooted to the wild-type sequence were generated using the neighbor-joining algorithm of Saitou and Nei, encompassing all sequences that reached a maximum frequency >0.1% for the 52-2 lineage and >0.5% for the 71-89 lineage during rounds 3–8 of evolution. Trees were determined in Rstudio using a custom R notebook with the neighbor-joining function in the ape R library and plotted using the ggtree and treeio libraries.
- For 3B and S3B, the frequency distributions of peak sequences, phylogenetically neighboring sequences, and clusters were determined using the tidytree library and plotted using ggplot in Rstudio
-
Fig. 4B. relative fitness values were determined from experimentally determined values for HHR- RNA synthesis, HHR+ RNA synthesis, and RNA cleavage yields and population values for the fraction of functional hammerheads produced from each variant and the fraction of progeny derived from each variant after re-selection. These latter values were determined from the sequencing of RNA population libraries from the re-selection of hammerhead variants Seq0, Seq2, Seq3, Seq5, Seq15, and Seq35, alone or as a mixture, with all calculations performed in Rstudio using a custom R notebook. Relative fitness values were calculated from this data in Microsoft Excel and plotted in Graphpad Prism.
- The fraction of functional hammerheads produced from each variant was determined as the fraction of pre-cleaved HHR+ RNA sequences propagated from each individual variant that matched the biochemically determined hammerhead sequence motif. The specific activity of each variant was estimated by multiplying the experiementally determined yield of RNA cleavage by the fraction of functional hammerheads produced from each variant.
- The frequency of cleaved HHR+ RNA sequences in the mixed population were fit to a multiple linear regression of the frequencies of sequences in each population that were propagated individually. Regression coefficients for the individually propagated populations were used to assign the fractional contribution of corresponding starting RNAs to progeny RNAs in the mixed population.
- Fitness values for the new hammerhead variants (Seq2-35) relative to the starting variant (Seq0) were calculated in 3 different ways. First, by multiplying the experimentally determined values for HHR- synthesis, HHR+ synthesis, and RNA cleavage yields of each of the variants, as plotted in figure 4A. Second, by the same method, but replacing RNA cleavage yield with the estimate of specific activity determined above. Third, by the relative enrichment of copy number for each variant, as determined from the frequency of each variant in the starting mixture of all six variants and the frequency of progeny from each variant after selection.
- Fig. 5 and Movie S1. Scatterplots of the evolving populations of hammerhead ribozymes. A 2-dimensional map of the evolving populations in sequence space was constructed based on the Levenshtein distance from each distinct sequence in the population to four reference sequences. The reference sequences were the wild type, a distant non-hammerhead from the 52-2 lineage (variable sequence 5′-…GCUGGUUGCUACAGCCG…-3′, lacking any discernible features of the hammerhead motif) , and Seq15 and Seq35 from the 71-89 lineage. A 2-dimensional plane was defined by the first two principal components of variation of the distance matrix between these four sequences, and the position of all distinct cleaved HHR+ sequences was projected onto this plane. The density of hammerhead functionality across the 2-dimensional plane was estimated as the local average fraction of all sequences in the cleaved HHR+ RNA populations that match the hammerhead motif. Processing and plotting was done in Rstudio with a custom R notebook and the plotly library. Animation of the scatterplots was prepared in Apple Keynote.
Details of each pipeline are included in this dataset's README file, including detailed command line calls for each script or program. The dataset includes raw Illumina fastq files, processed fidelity and sequence-frequency tables, and Excel spreadsheets containing data generated by the pipeline that were used to plot manuscript figures. All custom scripts in python and R are also included.
Funding
National Aeronautics and Space Administration, Award: 80NSSC22K0973, Exobiology Program Element
Simons Foundation, Award: 287624, Simons Collaboration on the Origins of Life