A Discard-and-Restart MD algorithm for the sampling of protein intermediate states
Data files
Mar 07, 2025 version files 5.83 GB
-
D_R_repository.7z
5.83 GB
-
README.md
3.28 KB
Abstract
We introduce a Discard‑and‑Restart molecular dynamics (MD) algorithm tailored for the sampling of realistic protein intermediate states. It aids computational structure‑based drug discovery by reducing the simulation times to compute a "quick sketch" of folding pathways by up to 2000x. The algorithm iteratively performs short MD simulations and measures their proximity to a target state via a collective variable (CV) loss, which can be defined in a flexible fashion, locally or globally. Using the loss, if the trajectory proceeds toward the target, the MD simulation continues. Otherwise, it is discarded, and a new MD simulation is restarted, with new initial velocities randomly drawn from a Maxwell‑Boltzmann distribution. The discard‑and‑restart algorithm demonstrates efficacy and atomistic accuracy in capturing the folding pathways in several contexts: (1) fast‑folding small protein domains; (2) the folding intermediate of the prion protein PrP; and (3) the spontaneous partial unfolding of α‑Tubulin, a crucial event for microtubule severing. During each iteration of the algorithm, we can perform AI-based analysis of the transitory conformations to find potential binding pockets, which could represent druggable sites. Overall, our algorithm enables systematic and computationally efficient exploration of conformational landscapes, enhancing the design of ligands targeting dynamic protein states.
This README file was generated by Alan Ianeselli on Feb 27th, 2025
DOI
10.5061/dryad.cc2fqz6h7
Date of Data Collection
2023 - 2024
Contributors
Alan Ianeselli, Yale University, New Haven CT, USA
Joe Howard, Yale University, New Haven CT, USA
Mark B Gerstein, Yale University, New Haven CT, USA
Summary
This repository is integral to our research paper, and contains the data that support the findings of our study in order to ensure transparency and reproducibility. It includes the frames and trajectories generated using the discard-and-restart (D&R) algorithm as well as the Python script and model files.
Description of the data and file structure
The data for each protein is contained in the respective folder (beta_hairpin, Fip35, Prion, TrpCage_explicit, TrpCage_implicit, Tubulin, Villin_explicit, Villin_implicit). Pca_model.pkl is the file for the pca model used to calculate the CV for folding; run_folding.py corresponds to the Python script for the D&R algorithm; utilities.py contains functions used for the D&R algorithm. The following libraries are needed to run the Python script: numpy, joblib, mdtraj, sklean, matplotlib, termcolor. It is also necessary to have a local installation of GROMACS (>2021 for explicit solvent, 4.6.5 for implicit solvent).
In every folder, information about the native state is contained in the folder "NATIVE_FILES", which contains:
-the energy-minimized structure (em.pdb)
-the distogram .pkl file output of AlphaFold 2
-the atom indexes used to calculate the CA dihedrals (c_index_array.ndx)
-the subfolder "data" contains the rmsd value of the native structure (rmsd.xvg), the distance matrix between CA (mindist.xvg) used to calculate the contact maps.
The single trajectories are contained in the folders with the name "run#". Each "run#" folder contains:
-the structure's topology (topol.top)
-the molecular dynamics protocol (md.mdp)
-the CV values of the individual short simulations (figure#.txt, figure#.png)
-OUTPUT_RC.txt, which contains the value of various collective variables of the selected configurations (respectively: iteration #, pca comp 0, pca comp 1, # of attempts, CA_dih_difference, z, alphafold CV, rmsd)
-subfolder full_traj/ which contains the frames of the selected configurations together with the concatenation (concatenated.pdb)
-subfolder step_#, which contains the short trajectory and its rmsd over time and CA-CA distances over time in the data/ subfolder
Data formats
*.pdb , coordinates and atom types of the structures
*.xtc , compressed coordinate file
*.txt , raw data files
*.png , image files of the plots
*.py , Python scripts
*.pkl , Python pickle format
*.top , GROMACS topology file
*.mdp , GROMACS protocol file
*.xvg , data produced by GROMACS
*.ndx , index file produced by GROMACS
Usage, Compatibility, and Accessibility
We provided the data in their original formats. Other researchers are encouraged to use these data for replicating the study, to perform further analyses or research and to ensure the reliability of our results. Researchers are welcome to contact the authors for further assistance with data conversion, analysis and explanation.
