Deep learning guided design of dynamic proteins
Data files
Jul 02, 2025 version files 1.85 GB
-
frame2seq_scores.csv
171.50 KB
-
MD_files.zip
1.82 GB
-
PDBs.zip
33.57 MB
-
Plasmids.zip
11.30 KB
-
README.md
16.85 KB
-
Scripts.zip
455.17 KB
Abstract
Deep learning (DL) has advanced the design of static protein structures, but the controlled conformational changes that are hallmarks of natural signaling proteins have remained inaccessible to de novo design. Here, we describe a general DL-guided approach for the de novo design of dynamic changes between intradomain geometries of proteins, similar to switch mechanisms prevalent in nature, with atomic-level precision. Our method involves three general stages: (1) identifying alternative structural states through computational conformational sampling, (2) using DL sequence-to-structure models to restrict the designable sequence space explored during multi-state design, and (3) understanding the molecular basis underlying dynamics through simulations and DL predictions. We solve four structures that validate the designed conformations, demonstrate modulation of the conformational landscape by orthosteric ligands and allosteric mutations, and show that physics-based simulations are in agreement with DL predictions and experimental data. Our approach demonstrates that new modes of motion can now be realized through de novo design and provides a framework for constructing biology-inspired, tunable, and controllable protein signaling behavior de novo. This dataset includes the necessary python scripts, plasmid backbones, computational structural models, experimental data, and simulation trajectories to reproduce our results.
This dataset contains the plasmid backbone sequences (Plasmids.zip), molecular dynamics trajectory data (MD_files.zip), design scripts (Scripts.zip), and computational structure files (PDBs.zip) associated with the publication “Deep learning guided design of dynamic proteins”.
Note: The PDB files have been renumbered to be consistent with our experimentally solved structures deposited in the PDB (i.e. indexed starting from 1 including the 4 residue N-terminal thrombin cleavage site scar - if the scar is not modeled explicitly, then the numbering begins from 5). All data deposited in this repository is numbered according to this convention. However, the single-state and multi-state design scripts using Rosetta/ProteinMPNN use a numbering system where the first residue of the PDB file is indexed as position one (regardless of what residue number is assigned in the PDB file itself), as is standard for Rosetta/ProteinMPNN software.
Description of the data and code
File usage notes
To view the file types included in this repository, we recommend the following open source/free software:
*.gro - molecular structure in Gromos87 format (https://manual.gromacs.org/archive/5.0.4/online/gro.html) - use PyMOL (https://www.pymol.org/) or VMD (https://www.ks.uiuc.edu/Research/vmd/)
*.pdb - molecular structure in Protein Data Bank format (https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) - use PyMOL or VMD
*.xtc - molecular dynamics trajectory file (https://manual.gromacs.org/archive/5.0.4/online/xtc.html) - use PyMOL or VMD
*.mdp - molecular dynamics parameters options (https://manual.gromacs.org/current/user-guide/mdp-options.html) - use any text editor
*.py - Python code - use any text editor to view or a shell/interactive development environment to run (e.g. Spyder https://www.spyder-ide.org/ or VS-code https://code.visualstudio.com/)
*.ipynb - Jupyter notebook - use Jupyer (https://jupyter.org/) or Google Colab (https://colab.research.google.com/) to view/run these notebooks
*.R - R code - use any text editor to view or R studio to run (https://posit.co/download/rstudio-desktop/)
*.sh - Bash shell script - run in a terminal window
*.gb - GenBank file containing a DNA sequence (https://www.genbeans.org/ibe/5.3/help/org-genbeans-modules-seqfiles/working_genbank.html) - can open in a regular text editor or sequence analysis software such as Benchling (https://benchling.com/)
Design scripts (Scripts.zip)
Scripts are organized in sequential order by directory prefix (i.e. 1, 2, 3, etc.). Here are descriptions of the scripts in each subdirectory and their purpose:
1_state_2_generation/1_loop_helix_loop_reshaping
These scripts and input files are used to generate backbone conformations for state 2.
1smg_cleaned.pdb - Initial PDB structure to reshape using the loop-helix-loop unit combinatorial sampling method (LUCS).
insertion_points.json - Input file for LUCS describing the first and last residue spanning the reshaped region and their secondary structure (used for selecting appropriate loops).
select_linkers.py, screen_single_insertion_loop_helix_loop_units.py, screen_compatible_loop_helix_loop_units.py - Job scripts for LUCS to sample and build new loop-helix-loop geometries. For more detail on this method generally, please see: https://doi.org/10.1126/science.abc0881 and https://github.com/Kortemme-Lab/loop_helix_loop_reshaping. These scripts were used to generate candidate backbone conformations for state 2 using Rosetta software with the modification that the clash probe residue was changed from valine to alanine given the all-helical backbone of our initial structure.
calculate_reshaped_helix_RMSD.py - Calculates the C-alpha root mean squared deviation (RMSD) of the reshaped helix (defined by having helical secondary structure in both states) for all the LUCS output backbones in reference to the original starting structure and outputs the values in a json file. Used for filtering backbones (i.e. C-alpha RMSD should be > 3 Angstroms to be experimentally distinguishable from the initial starting structure but generally < 10 Angstroms to maintain a well-packed core).
2_single_state_design
These scripts were used to do single-state design on the raw LUCS output backbones, which currently have a poly-alanine or poly-valine sequence in the reshaped region.
1_mutate_metal_ligands.py - Mutates the residues critical for coordinating the ligand to the appropriate amino acid.
2_run_design_info_file_generator.py - Wrapper to run the design_info_file_generator.py script, which outputs design_info.json for each LUCS output backbone describing which residues should be designable and repackable (includes all residues in the reshaped region and their neighbors).
3_single_state_design.py - Runs FastRelax in Rosetta to output n=20 possible sequences for stabilizing the LUCS backbone and outputs their Rosetta score into design_number_info.json.
1_state_2_generation/3_rosetta_abinitio
1_run_ab_initio_frag_generator.py - Runs the appropriate scripts to generate fragments for fragment-based forward folding in Rosetta.
2_ab_initio_frag_dependencies.py - Contains local paths to fragment picker dependencies and defines a function to get fragment quality scores (i.e. C-alpha RMSD between fragments from the PDB matching the amino acid sequence compared to the segment in the desired structure). You must change these paths to match your system.
3_ab_initio_frag_func_lib.py - Function definitions for running fragment-picking.
4_run_biased_frag_generator.py - Wrapper for running the 5_biased_frag_generator.py script. This outputs fragment files where only the 3 fragments closest in C-alpha RMSD to the segment of the desired structure are used during forward folding simulations.
6_run_biased_ff.py - Runs the forward folding simulation in Rosetta using the biased selected fragments. Calculates 30 decoy structures by default.
placeholder_seqs - Basic Local Alignment Search Tool (BLAST, see: https://blast.ncbi.nlm.nih.gov/Blast.cgi) sequences used during fragment generation (regular text file).
standard.wghts - Weights to use during fragment generation (regular text file).
2_multi_state_design/1_ProteinMPNN_scripts
This script is used to run multi-state design with position-tied ProteinMPNN.
submit_weighted_biased_homomer.sh
This script runs position-tied ProteinMPNN with equal weighting between structural states with a bias towards certain amino acid types at certain positions. See the Github repository for ProteinMPNN for more information about this design algorithm: https://doi.org/10.1126/science.add2187 and https://github.com/dauparas/ProteinMPNN.
2_multi_state_design/2_ColabFold
These scripts run the computational mutational scanning to restrict the designable sequence space during multi-state design and also predict the structures of the multi-state designs with ColabFold/analyze C-alpha RMSD between the predictions and desired states to select designs for characterization.
1_make_pt_mut_fastas.py - Makes fasta files for computational mutational scanning. These output fasta files should be input for ColabFold (for details, see: https://www.nature.com/articles/s41592-022-01488-1 and https://github.com/sokrypton/ColabFold). There are two mutational scanning strategies: (1) conservative - test mutations to the amino acid in the corresponding position in the single-state design for the other state and (2) deep - test mutations to all allowable amino acid types at all positions that differ between single-state designs.
2_run_colabfold.py - Runs ColabFold for all fasta files in an input directory.
3_run_colabfold.sh - Job submission shell script for running ColabFold on GPUs.
4_structure_pred_eval.py - Calculates the C-alpha RMSD after alignment on the non-reshaped residues between the ColabFold predictions and both user-defined states both over the entire backbone and also within the reshaped region only. Also outputs the confidence score averaged over the entire backbone and in the reshaped region only. The output file is called structure_pred_eval.json.
5_summarize_structure_pred_eval.py - Concatenates all of the outputs from 4_structure_pred_eval.py (which should be run in parallel for speedup) into one summary file called summarized_structure_pred_eval.json.
6_select_switch_designs.py - Contains functions for selecting switch designs based on the C-alpha RMSD of ColabFold predictions to the user-defined structural states as well as the prediction confidence and variance between ColabFold models. A function is also provided for finding designs that are predicted to only adopt state 1 according to ColabFold that are highly similar sequence-wise to designs predicted to only adopt state 2. This allows the user to find dynamic designs to characterize that may make it easier to draw sequence-structure relationships (note: despite ColabFold only predicting one state for these particular designs, we have found them to be dynamic experimentally).
3_data_analysis/NMR
These scripts analyze experimental NMR data and reproduce the figures shown in our study.
I85_coord.pdb, S85_coord.pdb - Coordinates for the backbone of state 1 and state 2, respectively, after alignment to the non-reshaped region.
NOE_contact_map_plotter.py - Plots the upper limit distance restraints and the difference between state 1 and state 2 contact maps to visualize whether Nuclear Overhauser Effect Spectroscopy (NOESY)-derived distance restraints are consistent with state 1, state 2, or both.
plot_chemical_env_difference.py - Plots the median change in C-alpha RMSD between close contacts (defined by dist_cutoff in Angstroms) between states 1 and 2 as a proxy for changes in the local environment for each residue.
3_data_analysis/NMR/two_timescale_analysis
R2/ - Site-specific observed R2 rate. The files are named with the following convention: field-strength__temp.txt
R2_15C.txt - Consolidated R2 rates measured at each effective field strength for each residue position at 15C.
R2_25C.txt - Consolidated R2 rates measured at each effective field strength for each residue position at 25C.
weff_15C.txt - Effective field strength for each residue corresponding to each R2 rate in R2_15C.txt.
weff_25C.txt - Effective field strength for each residue corresponding to each R2 rate in R2_25C.txt.
weff.txt - Effective field strength for each residue corrected by the offset from the center carrier frequency.
r2_two_timescale_fit.R - R script to fit two fast exchange processes with separate timescales simultaneously to the relaxation dispersion data (R2_15C.txt, R2_25C.txt, weff_15C.txt, weff_25C.txt, weff.txt). Will conduct a grid search over a range of possible timescale values and plot error contours.
4_MSM_jupyter_notebooks
MSM.ipynb - Jupyter notebook for fitting Markov state models to molecular dynamics (MD) simulation data.
RMSD_featurization.ipynb - Jupyter notebook for generating features from MD simulation trajectories (C-alpha RMSD of the reshaped region compared to user-defined states 1 and 2).
PDB files (PDBs.zip)
alternative_backbones_undesigned
Contain the raw LUCS output backbones derived from Scripts/1_state_2_generation. These have been filtered to only include outputs with the same length as the original structure.
RMSD_data/ - The C-alpha RMSD between the reshaped helix from the original PDB ID: 1SMG backbone.
ordered_single_state_alternative_backbones_designed_PDBs
PDB files of ordered single-state state 2 designs.
multistate_design_input_structures
The PDB files used as inputs to position-tied ProteinMPNN. These should be combined into one PDB file where each structure is separated by a large distance (e.g. over twice the diameter of a given state) - this can be done simply in a modeling software such as PyMOL.
state_1.pdb - 1SMG after a constrained relax in Rosetta with the Ca2+ ion removed.
high_sequence_identity_state_2.pdb - The ColabFold prediction of the single-state state 2 design after mutations to increase sequence identity to state 1 were made. These mutations can be identified with Scripts/2_multi_state_design/2_ColabFold.
two_state_design_AF2_models
Contains ColabFold predictions for all dynamic designs (i.e. point mutations at position 89) and the rational mutants identified through MD analysis (I89_K68E, I89_Y64F).
Sequences (Plasmids.zip)
The bacterial expression vector and yeast display vector used. Full DNA sequences can be found in the paper supplementary information and on Addgene.
MD trajectories (MD.zip)
1_I89_state_1_apo - Trajectories for all n=10 runs of design I89 initialized from the ColabFold prediction (i.e. state 1) without Ca2+. Runs 1-3 are 2µs long while the rest are 1µs long.
2_I89_state_1_Ca2 - Trajectories for n=3 2µs long runs of design I89 initialized from the ColabFold prediction (i.e. state 1) with Ca2+.
3_I89_state_2_apo - Trajectories for all n=10 runs of design I89 initialized from a state 2-like conformation identified from 1_I89_state_1_apo/run3 without Ca2+.
4_I89_transition_state - Trajectories for all n=10 runs of design I89 initialized from an intermediate-like conformation identified from 3_I89_state_2_apo/run4 without Ca2+.
5_S89_apo - Trajectories for n=3 2µs long runs of design S89 initialized from the ColabFold prediction (i.e. state 2) without Ca2+.
6_S89_Ca2 - Trajectories for n=3 2µs long runs of design S89 initialized from the ColabFold prediction (i.e. state 2) with Ca2+.
MDP_files - GROMACS (see: https://www.gromacs.org/) input files for preparing the system (solvation - ions.mdp, minimization - minim.mdp, NVT equilibration - nvt.mdp, NPT equilibration - npt.mdp) and running the simulation (md_1_us.mdp). For a relevant GROMACS tutorial using the same workflow, see: http://www.mdtutorials.com/gmx/lysozyme/. The forcefield used is described in https://pmc.ncbi.nlm.nih.gov/articles/PMC6003505/ and can be downloaded from https://github.com/paulrobustelli/Force-Fields.
all_res_I85.reslist-nsims2-structs1001-bin30_bootstrap_avg_mutinf_res_sum_0diag.txt - Mutual information matrix.
Movie_S1.mpg - Movie showing a transition from state 1 to state 2 (derived from the trajectory 1_I89_state_1_apo/run3).
Other miscellaneous data files
frame2seq_scores.csv - Frame2seq scores of each amino acid at each position given the structure for state 1 or state 2 compared to the original amino acid. For more details on Frame2seq, please see the Github repository https://github.com/dakpinaroglu/Frame2seq.
Sharing/Access information
This data is also available as a Github repository (https://github.com/amyguo1997/dynamic_protein_design).
Expression plasmids for single-state and switch designs have been deposited to Addgene with accession codes 231958, 231959, 231960, 231961, 231962, 231963, 231964, 231965, and 231966. Experimentally solved structures have been deposited to the Protein Data Bank (PDB) with accession codes 9CIC, 9CID, 9CIE, 9CIF, and 9CIG. NMR data have been deposited to the Biological Magnetic Resonance Data Bank with accession codes 31182, 31183, 31184, 31185, and 31186. The original Ca2+ binding protein structure used as a starting point for design can be found in the PDB with accession code 1SMG.
