Data from: In vivo functional phenotypes from a computational epistatic model of evolution
Data files
Jan 17, 2024 version files 342.46 MB
-
Figure_3_Sanger_Sequencing_50_ug_ml_Ampicillin_Data.zip
-
Figure_4_Sanger_Sequencing_100_ug_ml_Ampicillin_Data.zip
-
PhaseI_MSA.fasta
-
PhaseII_bmDCA_Parameters.mat
-
PhaseII_mfDCA_Parameters.mat
-
PhaseII_MSA.fasta
-
README.md
-
SEEC_nt_sequence_trajectories.mat
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways, and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo beta-lactamase activity in E. coli TEM-1 variants. These variants retain family-like functionality while being more active than their WT predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. In this dataset, we include input datasets, simulation trajectories as well as experimental data to support the publication: "In vivo functional phenotypes from a computationa epistatic model of evolution".
README: Data from: In vivo functional phenotypes from a computational epistatic model of evolution
This dataset includes sequence data, model parameters, similation trajectories and experimental data for Sanger sequecing related to a model of sequence evolution called Sequence Evolution with Epistatic Contributions (SEEC) applied to beta-lactamase TEM-1.
Description of the data and file structure
SI Dataset S4 (PhaseI_MSA.fasta)Phase I multiple sequence alignment used for SEEC-AA mfDCA and bmDCA statistical inference was obtained from Pfam and pre-processed to remove sequences with more than 5% consecuitve gaps.
SI Dataset S5 (PhaseII_MSA.fasta)Phase II multiple sequence alignment used for SEEC-NT mfDCA and bmDCA statistical inference was generated using HMMTools with TEM-1 sequence as seed (excluding signal petide) and default parameters.
SI Dataset S6 (PhaseII_mfDCA_Parameters.mat)Coupling and local field matrices inferred using mean field DCA with the PhaseII MSA as the input.
Objects:
- PhaseII_mfDCA_eij (size=5523x5523)
- PhaseII_mfDCA_hi (size=21x263)
This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT)
SI Dataset S7 (PhaseII_bmDCA_Parameters.mat)
Objects:
- eij (size=5523x5523)
- hi (size=21x263)
Coupling (eij) and local field (hi) matrices inferred using Boltzmann machine learning DCA with the PhaseII MSA as the input. The eij matrices have been converted into the format that matches the output of mfDCA. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT)
SI Dataset S8 (SEEC_nt_sequence_trajectories.mat)
Objects:
- SEEC_nt_bmDCA_Trajectory_amino_T0_75_3 (size=5000x263)
- SEEC_nt_mfDCA_Trajectory_amino_T1_5_1 (size=5000x263)
Sequences output from SEEC-nt used for variant selection. A .mat file readable in Matlab.
SI Dataset S9 (Figure_3_Sanger_Sequencing_Data.zip),SI Dataset S10 (Figure_4_Sanger_Sequencing_Data.zip)
Raw Sanger Sequencing chromatograms collected from plasmid samples isolated from assay cultures. Naming of chromatograms is as follows:
First number refers to the batch of sequencing.
Second number is the sample run within that batch.
"For" or "rev" refers to the sequencing forward or reverse directions, respectively.
The rest of the name comes from the variant name as used in the manuscript, where Beg, Mid or Late refer to positions in the simulation trajectory, bm or mf refer to the DCA implementation used to infer the coupling and local field parameters, the number is the variant number, and NT indicates the algorithm used was SEEC-nucleotide.
Sanger sequencing Chromatograms can be viewed using free software such as 4peaks (https://nucleobytes.com/4peaks/), Benchling (https://www.benchling.com/), or a number of other platforms.
Code/Software
Code and scripts used to generate the data in this repository can be found at https://github.com/morcoslab/SEEC-NT
The Boltzman machine DCA (bmDCA) implementation used can be found at https://github.com/matteofigliuzzi/bmDCA
Methods
Direct coupling analysis methods used were mean field (https://github.com/morcoslab/SEEC-nt) or Boltzman machine learning (https://github.com/matteofigliuzzi/bmDCA)
SI Dataset S4 (PhaseI_MSA.fasta)
Phase I multiple sequence alignment used for SEEC-AA mfDCA and bmDCA statistical inference was obtained from Pfam and pre-processed to remove sequences with more than 5% consecuitve gaps.
SI Dataset S5 (PhaseII_MSA.fasta)
Phase II multiple sequence alignment used for SEEC-NT mfDCA and bmDCA statistical inference was generated using HMMTools with TEM-1 sequence as seed (excluding signal petide) and default parameters.
SI Dataset S6 (PhaseII_mfDCA_Parameters.mat)
1. PhaseII_mfDCA_eij (size=5523x5523)
2. PhaseII_mfDCA_hi (size=21x263)
Coupling and local field matrices inferred using mean field DCA with the PhaseII MSA as the input. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT)
SI Dataset S7 (PhaseII_bmDCA_Parameters.mat)
1. eij (size=5523x5523)
2. hi (size=21x263)
Coupling (eij) and local field (hi) matrices inferred using Boltzmann machine learning DCA with the PhaseII MSA as the input. The eij matrices have been converted into the format that matches the output of mfDCA. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT)
SI Dataset S8 (SEEC_nt_sequence_trajectories.mat)
Variables:
1. SEEC_nt_bmDCA_Trajectory_amino_T0_75_3 (size=5000x263)
2. SEEC_nt_mfDCA_Trajectory_amino_T1_5_1 (size=5000x263)
Sequences output from SEEC-nt used for variant selection. A .mat file readable in Matlab.
SI Dataset S9 (Figure_3_Sanger_Sequencing_Data.zip),SI Dataset S10 (Figure_4_Sanger_Sequencing_Data.zip)
Raw Sanger Sequencing chromatograms collected from plasmid samples isolated from assay cultures.
Naming of chromatograms is as follows:
first number refers to the batch of sequencing
second number is the sample run within that batch
"for" or "rev" refers to the sequencing forward or reverse directions, respectively.
The rest of the name comes from the variant name as used in the manuscript, where Beg, Mid or Late refer to positions in the simulation trajectory, bm or mf refer to the DCA implementation used to infer the coupling and local field parameters, the number is the variant number, and NT indicates the algorithm used was SEEC-nucleotide.
Sanger sequencing Chromatograms can be viewed using free software such as 4peaks (https://nucleobytes.com/4peaks/), Benchling (https://www.benchling.com/), or a number of other platforms.
Usage notes
.mat files must be opened using Matlab
Sanger sequencing data can be viewed using 4peaks (https://nucleobytes.com/4peaks/), Benchling (https://www.benchling.com/), or a number of other platforms.
Fasta files can be read using Matlab, bioPython or any Multiple Sequence Alignment visualization software.