RNA polymerase evolution data files and code (1/2)

Choudhury, Alaksh, ESPCI Paris, https://orcid.org/0000-0002-2080-3551

alaksh.choudhury@colorado.edu

Published Nov 07, 2023 on Dryad. https://doi.org/10.5061/dryad.n8pk0p30n

Cite this dataset

Choudhury, Alaksh (2023). RNA polymerase evolution data files and code (1/2) [Dataset]. Dryad. https://doi.org/10.5061/dryad.n8pk0p30n

Abstract

RNA polymerase (RNAP) is emblematic of complex biological systems that control multiple traits involving trade-offs such as growth versus maintenance. Laboratory evolution has revealed that mutations in RNAP subunits, including RpoB, are frequently selected. However, we lack a systems view of how mutations alter the RNAP molecular functions to promote adaptation. We, therefore, measured the fitness of thousands of rpoB variants under multiple conditions and genetic backgrounds, to find that adaptive mutations cluster in two separate modules. Mutations in one module favor growth over maintenance through a partial loss of an interaction associated with faster elongation. Mutations in the other favor maintenance over growth through a destabilized RNAP-DNA complex. The two molecular handles capture the versatile RNAP-mediated adaptations. Combining both interaction losses simultaneously improved maintenance and growth, challenging the idea that growth-maintenance tradeoff resorts only from limited resources, and revealing how compensatory evolution operates within RNAP. The current dataset contains code files associated with the above study. You can follow the readme for details on the code files. The data is submitted as a separate submission: DOI: 10.5061/dryad.zw3r228c4.

README: RNA polymerase evolution

The current dataset contains data files associated with the above study. You can follow the Readme for details on the code files. The FASTQ files are submitted as a separate submission: DOI: 10.5061/dryad.zw3r228c4.

Description of the data and file structure

The information about the fastq data and the associated samples from which they were extracted can be found in the tsv file: Sample checklist_1674571220193.tsv. Details on the file experimental sample map can be found in the tsv file: fastq2_template_1674570722019.tsv

Here is a description of the other files.
Place all these files in a subfolder Fitness\ to be able to run the code:

_Preenrichment.csv There are multiple files with the description as "_Preenrichment.csv". The "x_Preenrichment.csv" file contains the raw reads count data for each evolution experiment. The name of the selection and is mentioned before the "_". For example, GlucoseA_Preenrichment contains the counts data for each variant for selection in Glucose. In addition, the preenrichment file also contains information about the NUcleotide change and amino acid change in each variant.
_slopes.csv _slopes.csv contains the fitness/enrichment value for each variant in the database. The methods for slope calculation are explained by the material methods and the code in the GitHub repository.
table_with_slopes.csv Combines the fitness and enrichment from multiple experiments.
rpoB_structure.txt: the file contains the PDB coordinates for the RNA polymerase Beta subunit. They have been used in the code to map the correlation between fitness and distance from ligands in the RNA polymerase.
Grantham.tsv: Contains the Granham scores for the change in amino acids. It has been used in the code to correlate the Grantham score to fitness.
combined_with_cluster.csv: A dataset with the reads for all the experiments combined and clustered for analysis. The Clustering was done using the code 2 described below.
dummy_data.csv: A smaller dummy dataset to run the code in case the big file is computationally expensive.

Code/Software

The detailed code can be found at: https://github.com/Alaksh/RNA-Polymerase-Evolution.git and Zenodo link 10.5281/zenodo.8144064. All code was written in Python.

RNA-Polymerase-Evolution

Code for submission of RNA polymerase evolution data
The supplied code summarizes all the preprocessing, calculations, and codes to generate graphs for the submitted manuscript.

The code can be downloaded from GitHub and run on the computer.

In order to run the code, you will need the Anaconda Python with dependencies: Pandas, Biopythin, Seaborn, Scipy, and Numpy installed.
Additionally, all sequence analysis was done using Usearch algorithm, which needs to be installed as well.
Please check specific installation instructions for each of these tools.

Here is a brief description of each file.

For several codes, the preenrichment tables could not be uploaded. So, I have uploaded alternate processed files to run the code if needed.
The code needed to run the files have been marked as comments to run if needed

Preenrichment_calculation.ipynb: Preliminary code to process the sequencing data and generate counts tables.

Codes 1 through 3 cannot be run without downloading files from the sequencing repository: DOI: 10.5061/dryad.zw3r228c4.

Code 1 Preprocessing:
The preprocessing code is to process the raw reads: merge the Fasta files and map it to the RNA polymerase target sequence.

Code 2 Clustering of Data:
The code is to cluster the reads. We wrote a custom algorithm to cluster the sequences, which takes the sequence read count into consideration.
We had regions within the target that were sequences, which were not mutated. We used these regions to estimate an error frequency.
We set a threshold above which, we considered a variant to be real and not an artifact of sequencing errors.
We then used the error frequency to identify possible sequences that were a part of the cluster.

Code 3 Fitness estimates:
For each condition, the fitness was estimated using the algorithm described in the manuscript.

Code 4 Code 4 Glucose fitness effects (Alternate file: Dummy_data.csv)
The code covers all figures for Figure 5, where we estimate the growth-associated fitness.

Code 5 Analysis KEIO deltolC (Alternate file: keio_deltoC_slopes.csv)
The code covers all figures for Figure 5, where we estimate the CBR703-associated enrichment.

Code 6 and 7 epistasis (Alternate file: Dummy_data.csv and keio_deltoC_slopes.csv):
The code covers all figures for Figure 5, with growth-associated and CBR703-associated epistasis.

Code 8 (Alternate file: Dummy_data.csv and keio_deltoC_slopes.csv):
Code 8 covers Figure 2, where all the comparison of fitness was done between conditions

Code 9 and code 10 fitness and epistasis in delta relA spoT strains (Alternate file: relA_spoT_slopes.csv, Dummy_data.csv and keio_deltoC_slopes.csv):
Code 9 covers figure 4, where we describe strignent mutations and epistasis for stringent enrichment.

Code 11 (delrelAspoT comparision and analysis): where we describe strignent mutations and epistasis for stringent enrichment.

Funding

Agence Nationale pour le Développement de la Recherche Universitaire, Award: ANR-18-CE35-0005-01

Fondation pour la Recherche Médicale, Award: EQU201903007848

United States Department of Energy, Award: DE-SC0018368