Data and code from: A small polymerase ribozyme that can synthesize itself and its complementary strand
Data files
Feb 09, 2026 version files 139.94 GB
-
dryad_full_HH_fidelity.zip
2.48 GB
-
fidelity_dryad.zip
20.02 GB
-
fitness_landscape_dryad.zip
90.32 GB
-
optimized_fitness_landscape_dryad_v3_2.tar.gz
27.12 GB
-
README.md
34.69 KB
Abstract
The emergence of a chemical system capable of self-replication and evolution is a critical event in the origin of life. RNA polymerase ribozymes can replicate RNA, but their large size and structural complexity impede self-replication and preclude their spontaneous emergence. Here we describe QT45: a 45-nucleotide polymerase ribozyme, discovered from random sequence pools, that catalyzes general RNA-templated RNA synthesis using trinucleotide triphosphate (triplet) substrates in mildly alkaline eutectic ice. QT45 can synthesize both its complementary strand using a random triplet pool at 94.1% per-nucleotide fidelity, and a copy of itself using defined substrates, both with yields of ~0.2% in 72 days. The discovery of polymerase activity in a small RNA motif suggests that polymerase ribozymes are more abundant in RNA sequence space than previously thought.
https://doi.org/10.5061/dryad.6hdr7sr9n
Description of the data and file structure
Deep sequencing datasets from: A small polymerase ribozyme that can synthesize itself and its complementary strand
This dataset contains raw sequencing data from:
- The fitness landscapes of the QT45 polymerase ribozyme (optimized library, single UGC template, mutants + single deletions)
- The fitness landscapes of the QT45 polymerase ribozyme (library with multiple registers, three different templates, mutants only, no deletion data) and the QT39 polymerase ribozyme
- Fidelity measurements of
- the QT51 and 5TU ribozyme-catalyzed synthesis of a mini hammerhead ribozyme
- the QT45 and 5TU ribozyme-catalyzed synthesis of a full hammerhead ribozyme
- the QT45 and 5TU ribozyme-catalyzed synthesis of the QT45 (+) and (-) strand
Also contained within this repository are instructions in the form of bash scripts and commands that can be used to process this raw sequencing data into processed reads used for analysis and generation of figures found in the paper. Downstream processing code and processed data are available in a Zenodo repository.
Files and variables
File: optimized_fitness_landscape_dryad_v3_2.tar.gz
Description:
- optimised_QT45_dryad: dataset and scripts for processing raw .fastq reads of the QT45 fitness landscape (optimized library)
- raw_reads: .fastq files used for further processing
- run_pear.sh: bash script used for merging reads
- barcodes: barcodes used for demultiplexing using cutadapt
- demultiplex_fastqs.sh: bash script used for demultiplexing using cutadapt
- filter_fastqs.sh: bash script used for quality filtering reads using fastx-toolkit
- collapse_fastqs.sh: bash script used for collapsing reads using fastx-toolkit
File: fitness_landscape_dryad.zip
Description:
- QT45_dryad: dataset and scripts for processing raw .fastq reads of the QT45 fitness landscape
- raw_reads: .fastq files used for further processing
- run_pear.sh: bash script used for merging reads
- combine_fastqs.sh: bash script used for concatenating reads from different NGS runs
- barcodes: barcodes used for demultiplexing using cutadapt
- demultiplex_fastqs.sh: bash script used for demultiplexing using cutadapt
- filter_fastqs.sh: bash script used for quality filtering reads using fastx-toolkit
- collapse_fastqs.sh: bash script used for collapsing reads using fastx-toolkit
- QT39_dryad: dataset and scripts for processing raw .fastq reads of the QT39 fitness landscape
- raw_reads: .fastq files used for further processing
- run_pear.sh: bash script used for merging reads
- barcodes.txt: barcode used for demultiplexing using fastx-toolkit
- trim_bcsplit.sh: bash script used for trimming reads and demultiplexing using fastx-toolkit
- left_trim_fastqs.sh: bash script used for trimming adapters using cutadapt
- right_trim_fastqs.sh: bash script used for trimming adapters using cutadapt
- filter_fastqs.sh: bash script used for quality filtering reads using fastx-toolkit
- collapse_fastqs.sh: bash script used for collapsing reads using fastx-toolkit
File: fidelity_dryad.zip
Description:
- HHz_pre_processing_dryad
- barcodes: barcodes used for demultiplexing using cutadapt
- demultiplex_fastas.sh: bash script used for demultiplexing using cutadapt
- HHz_read_processing.txt: description of the commands used for processing raw reads
- seq3_R1.fastq.gz: raw read 1
- seq3_R2.fastq.gz: raw read 2
- self_pre_processing_dryad
- barcodes: barcodes used for demultiplexing using cutadapt
- demultiplex_fastas.sh: bash script used for demultiplexing using cutadapt
- self_read_processing.txt: description of the commands used for processing raw reads
- hello_S1_L001_R1_001.fastq.gz: raw reads file
File: dryad_full_HH_fidelity.zip
Description:
- barcodes: barcodes used for demultiplexing using cutadapt
- process_fullHH_reads.ipynb: jupyter notebook with commands for initial data processing
- seq0run_S1_L001_R1_001.fastq.gz: raw read 1
- seq0run_S1_L001_R2_001.fastq.gz: raw read 2
Code/software
Pre-processing of raw .fastq reads
Pre-processing of QT45 fitness landscape reads (optimized library, single UGC template, mutants + single deletions):
Here are the steps used to process the raw reads:
-
If not already installed, install PEAR, cutadapt, and fastx_toolkit, add to directories to PATH by editing your shell's configuration file, and source your shell’s configuration file. Also, note that a computing cluster was used and in steps 3 and 5, the flag -j 112 in the bash scripts indicates the number of cores used - if you do not have 112 cores available, change the number to the number of cores you have for usage.
-
Go to the directory of optimized_QT45_dryad
-
Merge reads with PEAR with the following command to run the bash script:
./run_pear.sh -
Demultiplex with cutadapt with the following command to run the bash script:
./demultiplex_fastqs.sh -
Quality filter with fastx with the following command to run the bash script:
./filter_fastqs.sh -
Collapse all reads with fastx with the following command to run the bash script:
./collapse_fastqs.sh
After the above steps, you will find the processed reads in the collapsed_reads folder (from step 6), reads from intermediate steps in appropriately named folders (merged_reads (from step 3), demultiplexed_reads (from step 4), filtered_reads (from step 5), collapsed_reads (from step 6)), and summaries of processing steps in the folder ‘summaries’.
Pre-processing of QT45 fitness landscape reads (library with multiple registers, three different templates, mutants only, no deletion data):
Here are the steps used to process the raw reads:
-
If not already installed, install PEAR, cutadapt, and fastx_toolkit, add to directories to PATH by editing your shell's configuration file, and source your shell’s configuration file. Also, note that a computing cluster was used and in steps 3 and 5, the flag -j 112 in the bash scripts indicates the number of cores used - if you do not have 112 cores available, change the number to the number of cores you have for usage.
-
Go to the directory of fitness_landscape_dryad/QT45_dryad
-
Merge reads with PEAR with the following command to run the bash script:
./run_pear.sh -
Combine all files with the following command to run the bash script:
./combine_fastqs.sh -
Demultiplex with cutadapt with the following command to run the bash script:
./demultiplex_fastqs.sh -
Quality filter with fastx with the following command to run the bash script:
./filter_fastqs.sh -
Collapse all reads with fastx with the following command to run the bash script:
./collapse_fastqs.sh
After the above steps, you will find the processed reads in the collapsed_reads folder (from step 7), reads from intermediate steps in appropriately named folders (merged_reads (from step 3), combined_reads (from step 4), demultiplexed_reads (from step 5), filtered_reads (from step 6), collapsed_reads (from step 7)), and summaries of processing steps in the folder ‘summaries’.
Pre-processing of QT39 fitness landscape reads:
Here are the steps used to process the raw reads:
-
If not already installed, install PEAR, cutadapt, and fastx_toolkit, add to directories to PATH by editing your shell's configuration file, and source your shell’s configuration file. Also, note that a computing cluster was used and in steps 3, 5, and 6, the flag -j 112 in the bash scripts indicates the number of cores used - if you do not have 112 cores available, change the number to the number of cores you have for usage.
-
Go to the directory of fitness_landscape_dryad/QT39_
-
Merge reads with PEAR with the following command to run the bash script:
./run_pear.sh -
Trim the leftmost 4 bases with fastx_trimmer and demultiplex with fastx_barcode_splitter with the following command to run the bash script:
./trim_bcsplit.sh -
After the barcode is trimmed away in the above step, trim away the next leftmost 20 bases with the following command to run the bash script:
./left_trim_fastqs.sh -
Trim away adaptor on right with cutadapt with the following command to run the bash script:
./right_trim_fastqs.sh -
Quality filter with fastx with the following command to run the bash script:
./filter_fastqs.sh -
Collapse all reads with fastx with the following command to run the bash script:
./collapse_fastqs.sh
After the above steps, you will find the processed reads in the collapsed_reads folder (from step 8), reads from intermediate steps in appropriately named folders (merged_reads (from step 3), demultiplexed_reads (from step 4), left_trimmed (from step 5), right_trimmed (from step 6), filtered_reads (from step 7)), and summaries of processing steps in the folder ‘summaries’.
Pre-processing of reads for fidelity of the mini hammerhead ribozyme synthesis:
Raw sequencing data and read processing pipeline used to measure the fidelity of synthesis of the mini hammerhead ribozyme. The forward read is seq3_R1.fastq.gz and the reverse read is seq3_R2.fastq.gz. These are found in fidelity_dryad.zip>HHz_pre_processing_dryad.
The reads were processed using the commands reported below. These require the installation of PEAR, BBTools, fastx_toolkit and cutadapt. They are written to run in the directory containing the raw fastq reads.
Reads were merged using PEAR:
pear -e -j 4 -f seq3_R1_001.fastq.gz -r seq3_R2_001.fastq.gz -o ./HHz_reads
The first three nucleotides of the reads were trimmed using BBTools and the reads where then quality filtered using the fastx_toolkit, requiring 100% of the reads to have Q30 or above:
bbduk.sh in1=HHz_reads.assembled.fastq out=stdout.fastq ftl=3 | fastq_quality_filter -Q 33 -q 30 -p 100 -v -o HHz_reads_trimmed.fastq
BBTools was then used to convert the reads to fasta format:
reformat.sh in=HHz_reads_trimmed.fastq out=HHz_processed.fasta fastawrap=300
Cutadapt was used for the demultiplexing of the reads using the barcodes found in the barcodes subfolder.
bash demultiplex_fastas.sh
The processed, demultiplexed reads and the code used to analyze them can be found on Zenodo.
Pre-processing of reads for fidelity of the full hammerhead ribozyme synthesis:
Raw sequencing data and read processing pipeline used to measure the fidelity of synthesis of the mini hammerhead ribozyme. This is found in dryad_full_HH_fidelity.zip. The forward read is seq0run_S1_L001_R1_001.fastq.gz and the reverse read is seq0run_S1_L001_R2_001.fastq.gz.
The reads were processed using the commands reported below. These require the installation of fastp and cutadapt. They are written to run in the directory containing the raw fastq reads.
Reads were trimmed, quality filtered, and merged using fastp:
fastp --in1 ./seq0run_S1_L001_R1_001.fastq.gz \
--in2 ./seq0run_S1_L001_R2_001.fastq.gz \
--out1 seq0run_S1_L001_R1_001_filt.fastq --out2 seq0run_S1_L001_R2_001_filt.fastq \
--merge --merged_out seq0run_S1_merged.fastq \
--trim_front1 3 \
--trim_front2 3 \
--qualified_quality_phred 30 --unqualified_percent_limit 5 \
--length_required 49
Cutadapt was used for the demultiplexing of the reads using the barcodes found in the barcodes subfolder. A first round of demultiplexing was used for the barcodes1.fasta with the following command:
mkdir -p round1
cutadapt -g file:barcodes1.fasta --no-indels -e 0 --action=retain -o "round1/{name}.fastq.gz" seq0run_S1_merged.fastq > ./round1/step1_report.txt
Then a second round of demultiplexing was done using the barcodes in barcodes2.fasta:
after making the new subfolder:
mkdir -p demultiplexed_final
using this python script:
import os
for file in os.listdir("round1/"):
if file.endswith(".fastq.gz"):
base_name = file.replace(".fastq.gz", "")
cmd = f"cutadapt -a file:./barcodes/barcodes2.fasta --no-indels -e 0 --action=retain -o 'demultiplexed_final/{base_name}_{{name}}.fasta' round1/{file} >> ./demultiplexed_final/step2_report.txt"
os.system(cmd)
The processed, demultiplexed reads and the code used to analyze them can be found on Zenodo.
Pre-processing of reads for fidelity of (-) and (+) strand self-synthesis:
Raw sequencing data and read processing pipeline to measure the fidelity of (-) and (+) strand synthesis. These are found in fidelity_dryad.zip>self_pre_processing_dryad.
The reads were processed using the commands reported below. These require the installation of BBTools, fastx_toolkit and cutadapt. They are written to run in the directory containing the raw fastq reads.
The first three nucleotides of the reads were trimmed using BBTools and the reads were then quality filtered using the fastx_toolkit, requiring 100% of the reads to have Q30 or above:
bbduk.sh in1=hello_S1_L001_R1_001.fastq.gz out=stdout.fastq ftl=3 | fastq_quality_filter -Q 33 -q 30 -p 100 -v -o self_reads_trimmed.fastq
BBTools was then used to convert the reads to fasta format:
reformat.sh in=self_reads_trimmed.fastq out=self_processed.fasta fastawrap=300
Cutadapt was used for the demultiplexing of the reads using the barcodes found in the barcodes subfolder.
bash demultiplex_fastas.sh
The processed, demultiplexed reads and the code used to analyze them can be found on Zenodo.
Analysis pipeline of processed reads
The processed reads can be generated with raw reads using the above pipeline found on Dryad, or are provided for convenience on Zenodo. The python code to analyze the processed reads is found on Zenodo.
The data structure of the files on Zenodo is the following:
optimized_fitness_landscape_zenodo.zip
- QT45 (optimized library, single UGC template, mutants + single deletions)
- collapsed_reads
- basepair_colour
- fitness_landscape_analysis.ipynb: code for analysis of the processed reads
- 5TU (to generate sub-figures to put together with sub-figures from QT45 optimized library)
- collapsed_reads
- 5TU_statistical_analysis.ipynb: code for analysis of the processed reads
fitness_landscape_zenodo.zip
- QT45
- processed_reads
- heatmap_analysis.ipynb: code for analysis of the processed reads
- QT39
- processed_reads
- heatmap_analysis.ipynb: code for analysis of the processed reads
- 5TU
- neutral_variant_percentage_calc.ipynb
fidelity_zenodo.zip
- HHz_analysis_zenodo.zip
- processed_HHz: processed reads for hammerhead
- p10HHp1.fasta: template for alignment
- HHz_fid_slim.ipynb: code for analysis of the processed reads
- self_analysis_zenodo.zip
- processed_self: processed reads for self synthesis
- self_fid_slim.ipynb
- QT45minusref.fasta: template for alignment
- QT45plusref.fasta: template for alignment
zenodo_full_HH.zip
- demultiplexed_final: processed reads for full hammerhead
- reference_fastas: templates for alignment
- seq0HHsynthesis_fidelity.ipynb
Fitness landscape analysis
QT45 analysis (optimized library, single UGC template, mutants + single deletions) (found on zenodo in optimized_fitness_landscape_zenodo>QT45)
This subfolder contains the dataset and code used to generate the fitness landscape of ribozyme QT45 (optimized library, single UGC template, mutants + single deletions). Data and code in this subfolder were used to generate Figure 2D (Fitness heatmap of all measured single and double mutants of QT45 ribozyme) and Figure 2E (Predicted secondary structure of QT45 ribozyme coloured according to single mutant fitness and single deletion fitness). Data and code in this subfolder were also used to generate Figure S14 (Replicates show strong correlation in fitness measurements), Figure S15 (Fitness value distributions suggest that QT45 shows greater sensitivity to mutations than 5TU), Figure S20 (QT45 epistasis is biased towards negative values, but less so than 5TU), Figure S21 (Overview of evidence for base pairing), and Figure S22 (Epistatic landscape reveals secondary structure interactions).
Jupyter notebook fitness_landscape_analysis.ipynb contains the pipeline for analyzing processed sequencing data found in collapsed_reads. Here is a general outline of the pipeline: The processed data is first read into a dictionary of read counts. To calculate single mutant fitness values used for plotting the colours on the secondary structure and to calculate double mutant fitness values used for plotting the fitness landscape, reads from ‘out nodel A’, ‘out nodel B’, ‘out nodel C’ were combined and reads from ‘out del A’, ‘out del B’, ‘out del C’ were combined. Then, sequences for all possible single and double mutants are generated and their read counts are looked up in the dictionary. These read counts are used to shortlist sequences that meet the minimum read count requirements (at least 10 reads in the starting library and at least one read in the combined after library). The read counts of shortlisted sequences are then used to calculate their fractional abundances in each library. The enrichment of each sequence is then calculated by dividing its fractional abundance after selection by its fractional abundance in the starting library. The enrichment of the sequence is then divided by the enrichment of the wild-type sequence to generate a normalized enrichment. Finally, taking the log2 of this normalized enrichment gives the fitness value, which is then plotted on a heatmap using seaborn.
To calculate epistasis values used to plot Figure S21 and S22, the two constituent single mutant fitness values calculated from combining replicates were subtracted from each double mutant fitness value. To calculate fitness and epistasis values shown in Table S4, S5, and S6, reads from ‘out nodel A’, ‘out nodel B’, ‘out nodel C’ were kept separate. Similarly, reads from ‘out del A’, ‘out del B’, ‘out del C’ were kept separate. Genotypes containing 10 reads or more in the input libraries, as well as at least 1 read in each of the three corresponding combined output libraries were retained for downstream analysis to generate fitness values and subsequently used to calculate epistasis values.
Running the code in heatmap_analysis.ipynb generates fitness values that are saved in single_double_mutant_csvs, figures that are saved in figures_from_code, and data found in basepair_colour that is used to make Figure S21 (Overview of evidence for base pairing) using Adobe Illustrator. After running the code, the folder single_double_mutant_csvs will contain:
- combined_single_mutant_fitness.csv that has the single mutant fitness values (calculated from combining replicates). This data is used to plot Figure 2D (Fitness heatmap of all measured single and double mutants of QT45 ribozyme) and Figure 2E (Predicted secondary structure of QT45 ribozyme coloured according to single mutant fitness and single deletion fitness).
- combined_del_single_mutant_fitness.csv that has the single deletion fitness values (calculated from combining replicates). This data is used to plot Figure 2E (Predicted secondary structure of QT45 ribozyme coloured according to single mutant fitness and single deletion fitness).
- double_mutant_fitness_all.csv that has the double mutant fitness values (calculated from combining replicates). This data is used to plot Figure 2D (Fitness heatmap of all measured single and double mutants of QT45 ribozyme) and used to calculate epistasis values used to plot Figure S22 (Epistatic landscape reveals secondary structure interactions).
- complete10_single_mutant_fitness_all.csv that has the single mutant fitness values (calculated from keeping replicates separate). Part of this data is used to calculate epistasis values shown in Tables S4, S5, and S6.
- complete10_double_mutant_fitness_all.csv that has the double mutant fitness values (calculated from keeping replicates separate). Part of this data is shown in Tables S4, S5, and S6 and used to calculate epistasis values shown in Tables S4, S5, and S6.
- deletion_mutation_analysis.csv that has the fitness values of sequences containing both a single mutation and a single deletion (calculated from combining replicates). This data is used in Figure S22 (Epistatic landscape reveals secondary structure interactions).
- bp_mutants_fitness.csv contains fitness values of both single and double mutants involved in base pair breaking and base pair retaining of base pairs in the predicted secondary structure (calculated from combining replicates). This data is used to plot Figure S21 (Overview of evidence for base pairing).
After running the code, the folder figures_from_code will contain: - combined_single_mutant_heatmap.svg that is not shown in the paper, but is a different visualisation of single mutants and single deletions that is shown in Figure 2E (Predicted secondary structure of QT45 ribozyme coloured according to single mutant fitness and single deletion fitness) in another form.
- double_mutant_FL_all.svg is used to make Figure 2D (Fitness heatmap of all measured single and double mutants of QT45 ribozyme).
- doublemut_epistasis_distribution.svg is used to make Figure S20 (QT45 epistasis is biased towards negative values, but less so than 5TU).
- doubleptmut_replicate_distribution.svg is used to make Figure S14 (Replicates show strong correlation in fitness measurements).
- epistasis_combined.svg is used to make Figure S22 (Epistatic landscape reveals secondary structure interactions).
- epistasis_vs_first_fitness.svg is used to make Figure S20 (QT45 epistasis is biased towards negative values, but less so than 5TU).
- fitness_distributions_comparison.png is used to make Figure S15 (Fitness value distributions suggest that QT45 shows greater sensitivity to mutations than 5TU).
- singleptmut_replicate_distribution.svg is used to make Figure S14 (Replicates show strong correlation in fitness measurements).
5TU analysis (found on zenodo in optimized_fitness_landscape_zenodo>5TU)
This subfolder contains the dataset (from [1]) (in subfolder collapsed_reads) and code (in 5TU_statistical_analysis.ipynb) used to analyse 5TU fitness data and make subfigures used in generating Figure S20 (QT45 epistasis is biased towards negative values, but less so than 5TU) and Figure S15 (Fitness value distributions suggest that QT45 shows greater sensitivity to mutations than 5TU). Epistasis and fitness values were calculated as in the QT45 analysis described above (keeping replicates separate). Running the code in 5TU_statistical_analysis.ipynb generates fitness values that are saved in 5TU_fitness.csv and figures that are saved in figures_from_code.
QT45 analysis (library with multiple registers, three different templates, mutants only, no deletion data) (found on zenodo in fitness_landscape_zenodo.zip>QT45)
This subfolder contains the dataset and code used to generate the fitness landscape of ribozyme QT45. Data and code in this subfolder were used to generate Figure 2D (Fitness heatmap of all measured single and double mutants of QT45 ribozyme) and Figure 2E (Predicted secondary structure of QT45 ribozyme coloured according to single mutant fitness).
Jupyter notebook heatmap_analysis.ipynb contains the pipeline for analyzing processed sequencing data found in processed_data. Here is a general outline of the pipeline: The processed data is first read into a dictionary of read counts. Then, sequences for all possible single and double mutants are generated and their read counts are looked up in the dictionary. These read counts are used to shortlist sequences that meet the minimum read count requirements (at least 10 reads in the starting library and at least one read in the after library). The read counts of shortlisted sequences are then used to calculate their fractional abundances in each library. The enrichment of each sequence is then calculated by dividing its fractional abundance after selection by its fractional abundance in the starting library. The enrichment of the sequence is then divided by the enrichment of the wild-type sequence to generate a normalized enrichment. Finally, taking the log2 of this normalized enrichment gives the fitness value, which is then plotted on a heatmap using seaborn.
Running the code in heatmap_analysis.ipynb generates fitness values that are saved in single_double_mutant_csvs and figures that are saved in figures_from_code. Note that in single_double_mutant_csvs, the numbering for bases starts after the first two ‘G’s in QT45 that are fixed and not mutagenised.
QT39 analysis (found on zenodo in fitness_landscape_zenodo.zip>QT39)
This subfolder contains the dataset and code used to generate the fitness landscape of ribozyme QT39. Data and code in this subfolder were used to generate Figure S12 (Fitness heatmap of all measured single and double mutants of QT39 ribozyme).
Jupyter notebook heatmap_analysis.ipynb contains the pipeline for analyzing processed sequencing data found in processed_data (as described in the Materials and Methods section). The general outline for this pipeline is the same as the one used in the QT45 anaylsis.
Running the code in heatmap_analysis.ipynb generates fitness values that are saved in single_double_mutant_csvs and figures that are saved in figures_from_code. Note that in single_double_mutant_csvs, the numbering for bases starts from the first base in QT39 including the first two ‘G’s that are fixed and not mutagenised.
Note that the processed data files splitUGC_A.fastq, splitUGC_B.fastq, and splitUGC_C.fastq are not used to generate figures in the paper, but have been included in this dataset just in case it is of interest to look through a similar dataset that utilized a different template. Reads in these files do not have to be reverse complemented, unlike the CUA and START files.
5TU (found on zenodo in fitness_landscape_zenodo.zip>5TU)
Jupyter notebook neutral_variant_percentage_calc.ipynb that calculates the percentage of neutral single and double mutants using fitness values obtained from ‘Cryo-EM structure and functional landscape of an RNA polymerase ribozyme’ (E. K. S. McRae, C. J. K. Wan et al. 2024) [1]. Briefly, a csv file containing fitness values is loaded into a dictionary and sequences with values equal to or greater than the threshold value of log2(0.9) are counted as neutral or better than wildtype.
Fidelity analysis
Mini hammerhead synthesis fidelity (found on zenodo in fidelity_zenodo.zip>HHz_analysis_zenodo)
Dataset and code used to calculate average per-nucleotide fidelity and generate Supplementary Figure 25 (Positional fidelity of QT ribozyme-catalyzed synthesis of a mini hammerhead ribozyme). Subfolders contain the processed reads (processing pipeline described on Dryad)
The pre-processed reads in fasta format and the necessary code are provided to be able to re-generate the figures. The subsequent data clean up, read alignment, and data plotting are described in the HHz_fidelity.ipynb jupyter notebook. Python 3.9 with the following libraries and modules are required to re-run the code: sys, pandas 1.4.4, re, collections, numpy 1.21.5, matplotlib 3.5.2, seaborn 0.11.2, scipy 1.9.1.
Broadly, the analysis pipeline involves an initial clean-up step where the reads from each individual library are converted into their reverse complement. Then only reads containing exactly the primer sequence, and the sequence of the adapter used in recovery (HDVlig) with no mismatches and of the correct length are shortlisted, with the flanking sequence trimmed away. These reads are then aligned using BBTools and Samtools to a reference sequence (p10HHp1.fasta), and a pileup file is generated. This file is then parsed using a custom python script, and the resulting data stored as a pandas dataframe. This dataframe is used for figure generation and fidelity calculations.
Full hammerhead (seq0HH) synthesis fidelity (found on zenodo in zenodo_full_HH_fidelity.zip)
Dataset and code used to calculate average per-nucleotide fidelity and generate Figure 3D (Positional fidelity of QT ribozyme-catalyzed synthesis of an active ribozyme) and Supplementary Figure 24. Subfolders contain the processed reads (processing pipeline described on Dryad)
The pre-processed reads in fasta format and the necessary code are provided to be able to re-generate the figures. The subsequent data clean up, read alignment, and data plotting are described in the seq0HHsynthesis_fidelity.ipynb jupyter notebook. Python 3.9 with the following libraries and modules are required to re-run the code: sys, pandas 1.4.4, re, collections, numpy 1.21.5, matplotlib 3.5.2, scipy 1.9.1.
Broadly, the analysis pipeline involves an initial clean-up step where the reads from each individual library are converted into their reverse complement. Then only reads containing exactly the primer sequence (P10), and the sequence of the adapter used in recovery (HDVlig) with no mismatches and of the correct length are shortlisted, with the flanking sequence trimmed away. These reads are then aligned using BBTools and Samtools to a reference sequence (in the reference_fastas folder), and a pileup file is generated. This file is then parsed using a custom python script, and the resulting data stored as a pandas dataframe. This dataframe is used for figure generation and fidelity calculations.
(+) and (-) strand synthesis fidelity analysis (found on zenodo in fidelity_zenodo.zip>self_analysis_zenodo)
Dataset and code used to calculate average per-nucleotide fidelity and generate Figure 4C (Positional fidelity of QT-ribozyme-catalyzed synthesis of its complementary (-) strand and of itself (+ strand)).
The pre-processed reads in fasta format and the necessary code are provided to be able to re-generate the figures. The data clean up, read alignment, and plotting are described in the fidelity_self_analysis.ipynb jupyter notebook. Python 3.9 with the following libraries and modules are required to re-run the code: sys, pandas 1.4.4, re, collections, numpy 1.21.5, matplotlib 3.5.2, seaborn 0.11.2, scipy 1.9.1.
Broadly, the analysis pipeline involves an initial clean-up step where the reads from each individual library are converted into their reverse complement. Then only reads containing exactly the primer sequence, and the sequence of the adapter used in recovery (HDVlig) with no mismatches and of the correct length are shortlisted, with the flanking sequence trimmed away. These reads are then aligned using BBTools and Samtools to a reference sequence (QT45minusref.fasta for (-) strand synthesis, QT45plusref.fasta for (+) strand synthesis), and a pileup file is generated. This file is then parsed using a custom python script, and the resulting data stored as a pandas dataframe. This dataframe is used for figure generation and fidelity calculations.
QT ribozyme Quasispecies
This repository contains code and data to simulate and analyze quasispecies dynamics of a catalytic RNA (QTribozyme) under different mutation rates and fitness landscapes.
Repository structure
code/ Python scripts for simulation and plotting
data/ Input data (mutation tables, fitness files, master sequence)
results/ Output directory where simulation results are generated
Usage
Run the main simulation:
python3 code/QT_quasispecies_dynamics.py -mutation_table data/mutation_table_950.csv
This will generate population dynamics files (CSV) in results/ and plots in figures/.
Plot the steady-state results across fidelities:
python3 code/plot_v6_mutrate_vs_quasispecies.py results/*.csv -o mutation_rate_vs_concentration.png
Options for QT_quasispecies_dynamics.py
| Option | Description | Default |
|---|---|---|
-mutation_table <file> |
CSV with mutation probabilities (e.g. data/mutation_table_950.csv) |
data/mutation_table.csv |
-fasta <file> |
FASTA file for the master sequence | data/master_sequence.fasta |
-f_hd2 <float> |
Fitness value for Hamming-distance-2 mutants | 0.01 |
-fitness_scale <factor> |
Multiplies fitness tables to test scaling effects | (none) |
Example:
python3 code/QT_quasispecies_dynamics.py -mutation_table data/mutation_table_980.csv -f_hd2 0.02
Requirements
Python ≥ 3.8 with:
pip install numpy pandas scipy matplotlib
Notes
Running the simulation will automatically create the results/ and figures/ directories if they do not exist.
Large result files are not included in the repository but are generated locally when the model is run.
Access information
Licenses/restrictions placed on the data:
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
No Copyright
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights. Unless expressly stated otherwise, the person who associated a work with this deed makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law. When using or citing the work, you should not imply endorsement by the author or the affirmer.
5TU data was derived from the following source:
[1] E. K. S. McRae, C. J. K. Wan, E. L. Kristoffersen, K. Hansen, E. Gianni, I. Gallego, J. F. Curran, J. Attwater, P. Holliger, E. S. Andersen, Cryo-EM structure and functional landscape of an RNA polymerase ribozyme. Proc Natl Acad Sci U S A 121 (2024).
This dataset contains raw sequencing data from (1) the fitness landscapes of the QT45 and QT39 polymerase ribozymes and (2) fidelity measurements of the QT ribozyme-catalyzed synthesis of a mini-hammerhead ribozyme, a full length hammerhead ribozyme, as well as the synthesis of its complementary strand and of itself, as described in ‘A small polymerase ribozyme that can synthesize itself and its complementary strand’.
Also contained within this repository are instructions in the form of bash scripts and commands that can be used to process this raw sequencing data into processed reads used for analysis and generation of figures found in the paper. Downstream processing code and processed data is available at a separate dataset in Zenodo.
