Data from: Towards high-resolution modeling of small molecule - ion channel interactions
Data files
Abstract
Ion channels are critical drug targets for a range of pathologies, such as epilepsy, pain, itch, autoimmunity, and cardiac arrhythmias. To develop effective and safe therapeutics, it is necessary to design small molecules with high potency and selectivity for specific ion channel subtypes. There has been increasing implementation of structure-guided drug design for the development of small molecules targeting ion channels. We evaluated the performance of two Rosetta ligand docking methods, RosettaLigand and GALigandDock, on structures of known ligand - cation channel complexes. Ligands were docked to voltage-gated sodium (NaV), voltage-gated calcium (CaV), and transient receptor potential vanilloid (TRPV) channel families. For each test case, RosettaLigand and GALigandDock methods were able to frequently sample a ligand binding pose within 1-2 Å root mean square deviation (RMSD) relative to the experimental ligand coordinates. However, RosettaLigand and GALigandDock scoring functions cannot consistently identify experimental ligand coordinates as top-scoring models. Our study reveals that the proper scoring criteria for RosettaLigand and GALigandDock modeling of ligand - ion channel complexes should be assessed on a case-by-case basis using sufficient ligand and receptor interface sampling, knowledge about state specific interactions of the ion channel and inherent receptor site flexibility that could influence ligand binding.
README: Data from: Towards high-resolution modeling of small molecule - ion channel interactions
This README file was generated on 2024-05-02 by Brandon Harris and Vladimir Yarov-Yarovoy.
GENERAL INFORMATION
1. Title of Dataset: Towards high-resolution modeling of small molecule – ion channel interactions
2. Author Information
A. Principal Investigator Contact Information
Name: Vladimir Yarov-Yarovoy
Institution: University of California, Davis
Address: 1275 Med Science Drive, Davis, CA 95616-5271 USA
Email: yarovoy@ucdavis.edu
3. Date of data collection (single date, range, approximate date): 2023-2024
4. Geographic location of data collection: 1275 Med Science Drive, Davis, CA 95616-5271 USA
5. Information about funding sources that supported the collection of the data:
National Institutes of Health, Award: R61NS127285
National Institutes of Health, Award: R01HL128537
National Institutes of Health, Award: R01HL159304
National Institutes of Health, Award: R01GM132110
National Institutes of Health, Award: R56NS9706
National Institutes of Health, Award: R01NS128180
National Institutes of Health, Award: F31NS124337
National Institutes of Health, Award: T32GM007377
SHARING/ACCESS INFORMATION
1. Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
2. Links to publications that cite or use the data:
Brandon J Harris, Phuong T. Nguyen, Guangfeng Zhou, Heike Wulff, Frank DiMaio, Vladimir Yarov-Yarovoy (2024). Towards high-resolution modeling of small molecule – ion channel interactions. bioRxiv 2024.04.02.587818; doi: https://doi.org/10.1101/2024.04.02.587818
3. Links to other publicly accessible locations of the data: None
4. Links/relationships to ancillary data sets: None
5. Was data derived from another source? No
A. If yes, list source(s): NA
6. Recommended citation for this dataset:
Brandon J Harris, Phuong T. Nguyen, Guangfeng Zhou, Heike Wulff, Frank DiMaio, Vladimir Yarov-Yarovoy (2024). Towards high-resolution modeling of small molecule – ion channel interactions. bioRxiv 2024.04.02.587818; https://doi.org/10.5061/dryad.m63xsj49v
DATA & FILE OVERVIEW
1. File List:
A) 5ek0.zip
B) 6j8g.zip
C) 6j8i.zip
D) 6jp5.zip
E) 6jp8.zip
F) 6jpaV1.zip
G) 6jpaV2.zip
H) 6jpb.zip
I) 6kzp.zip
J) 6u88.zip
K) 6uz0.zip
2. Relationship between files, if important: None
3. Additional related data collected that was not included in the current data package: The entire pose population for each test case ran was not provided, as that would result in approximately 6.6 million poses. Instead, for each test case, the input files for RosettaLigand and GALigandDock, and all score file data are provided. All models used to generate the main and supplemental figures are provided. Together, these are the minimal components needed for replication of the data presented in the study.
4. Are there multiple versions of the dataset? No
A. If yes, name of file(s) that was updated: NA
i. Why was the file updated? NA
ii. When was the file updated? NA
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 5ek0.zip
1. Data for PDB ID: 5ek0 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6j8g.zip
1. Data for PDB ID: 6j8g contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6j8i.zip
1. Data for PDB ID: 6j8i contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6jp5.zip
1. Data for PDB ID: 6jp5 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6jp8.zip
1. Data for PDB ID: 6jp8 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6jpaV1.zip
1. Data for PDB ID: 6jpaV1 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6jpaV2.zip
1. Data for PDB ID: 6jpaV2 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6jpb.zip
1. Data for PDB ID: 6jpb contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6kzp.zip
1. Data for PDB ID: 6kzp contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6u88.zip
1. Data for PDB ID: 6u88 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
DATA-SPECIFIC INFORMATION FOR: 6uz0.zip
1. Data for PDB ID: 6uz0 contains RosettaLigand inputs, GALigandDock inputs, text-readable score files from all test conditions, and models of the top ten RMSD and interface score poses from each test condition (in Protein Data Bank .pdb format).:
2. Each .pdb file should be opened using UCSF ChimeraX software: https://www.cgl.ucsf.edu/chimerax/
3. Missing data codes: None
4. Specialized formats or other abbreviations used: A .slurm file is a Slurm Workload Manager-executable file. A .sc file is a score file in text-readable format.
#########################################################################
Methods
Ligand generation
Ligands were extracted as Structure Data Files (.sdf) from PubChem (Kim et al, 2023). Using the Avogadro software (Hanwell et al, 2012), each ligand structure underwent bond-correction, protonation at pH 7.4, and energy minimization using the Merck Molecular Force Field (Halgren, 1996a; b; c; Halgren and Nachbar, 1999; Halgren, 1999a; b). The resulting models were saved as Tripos Mol2 (.mol2) files. The protonation and bond order of saxitoxin and tetrodotoxin was matched to experimentally reported work (Hinman and DuBois, 2003; Thomas-Tran and Du Bois, 2016). Both experimentally resolved structures of verapamil docked to rabbit CaV1.1 were tested (Zhao et al, 2019).
Next, using Amber Tools’ Antechamber protocol, the partial atomic charge, atom, and bond type assignments for each ligand were AM1-BCC corrected (Case et al, 2021; Salomon-Ferrer, Case, and Walker, 2012; Appendix S1). AM1-BCC correction is commonly used in Rosetta-based ligand docking protocols (Smith and Meiler, 2020; Park et al, 2021) and has demonstrated a similar performance correlation with other RosettaLigand input preparation protocols (Smith and Meiler, 2020).
The AM1-BCC corrected ligands were used for subsequent steps specific to each method. For RosettaLigand, an in-house script (Appendix S2) using the OpenEye Omega toolkit (Hawkins et al, 2010) was used to generate the conformer library, followed by Rosetta to generate the associated ligand parameters file. For GALigandDock, the input conformer was generated using the RosettaGenFF crystal structure prediction protocol (Park et al, 2021; Appendix S3), taking the lowest energy packing arrangement as input.
Ion channel preparation
Ion channel structures were downloaded from the Protein Data Bank (Berman et al, 2000). Prior to RosettaLigand docking, structures were relaxed with backbone constraints using the RosettaRelax protocol (Nivón, Moretti, and Baker, 2013). This protocol allows the repacking of protein sidechains and minimization of the structure into the Rosetta score function for comparison between poses. The lowest energy pose from 100 relaxed poses was used for docking.
RosettaLigand docking
RosettaLigand docking was performed using previously described RosettaScripts protocols (Davis and Baker, 2009; DeLuca, Khar, and Meiler, 2015; Appendix S4-S7). Briefly, the initial placement of all ligands into their respective ion channels was performed by superimposing the initial ligand poses onto the experimental ligand coordinates.
RosettaLigand uses grid-based sampling to score the ion channel – ligand interface, with a low-resolution and high-resolution sampling phase. For an unbiased sampling of the local space, an initial transformation during the low-resolution sampling phase is performed on the initial ligand position using Rosetta’s Transform mover, The low-resolution Transform mover is a grid-based Monte Carlo simulation, where the ligand is translated up to 0.2 Å and rotated up to 20 degrees per iteration, for a total of 500 iterations. A box size of 7-8 Å restrains the ligand center, while all ligand atoms are constrained to a grid with a user defined width to prevent ligand scoring outside the target site. This scoring grid width was calculated uniquely for each ligand to ensure all ligand conformers would not automatically fail the Transform step. As used previously, the scoring grid width was calculated as the maximum conformer atom-atom distance plus twice the box size value used in the Transform mover (Moretti, Bender, Allison, and Meiler, 2016; Appendix S4). The pose with the lowest score is then used as the starting pose for high-resolution docking. For high-resolution docking using Rosetta’s HighResDocker mover, six docking cycles of rotamer sampling are performed with repacking of sidechains every third iteration. Lastly, the ion channel – ligand complex is minimized using Rosetta’s FinalMinimizer mover, and the interface scores are reported using the InterfaceScoreCalculator.
For all RosettaLigand docking runs, the ligand_soft_rep and hard_rep scoring functions were reweighted based on previous work assessing Rosetta score functions with the Comparative Assessment of Score Function 2016 (CASF-2016) dataset (Su et al., 2019; Smith and Meiler, 2020). Specifically, reweights of Coulombic electrostatic potential (fa_elec), Lennard-Jones repulsive energy between atoms in the same residue (fa_intra_rep), sidechain-backbone hydrogen bond energy (hbond_bb_sc), sidechain-sidechain hydrogen bond energy (hbond_sc), and Ramachandran preferences (rama) were applied for the soft-repulsive and hard-repulsive docking phases (Appendix S6).
For each docking run, either 20,000 poses or 100,000 poses were generated to assess if there are statistically significant differences in the lowest recorded pose RMSD (RMSDMin) to experimental ligand coordinates. In RosettaLigand, a ligand interface is defined either by a representative ligand atom (a “neighbor atom”, defined as the geometric center of mass by default), or all ligand atoms, relative to all ion channel Cβ atoms within a specified radius from the ligand (commonly default to 6 or 7 Å). For RosettaLigand, two mutually exclusive ligand area interface modes are available for scoring the pose interface: the ligand neighbor atom cutoffs mode (add_nbr_radius=”True”), and the all ligand atom cutoffs mode (all_atom_mode=”True”). In previous work, both modes were used (Moretti, Bender, Allison, and Meiler, 2016; Smith and Meiler, 2020), thus, both modes were evaluated for any differences in performance. Four individual RosettaLigand docking sets were performed for each PDB structure, by combining different pose totals and ligand area interface modes, for performance comparison.
GALigandDock docking
As described in the original study, docking was performed using the RosettaScripts’ GALigandDock mover in DockFlex mode (Park et al, 2021; Appendix S8-S10). Replicate runs of GALigandDock were performed in parallel for each structure evaluated. Each run consisted of 20 generations with a pool of 100 poses, where each generation updates the pool by total energy. By default, GALigandDock outputs the top 20 structures from the final generation, however, for this study, the entire pool of poses were used. For each docking run, 20,000 poses (1,000 runs) or 100,000 poses (5,000 runs) were output with a padding value of 2 Å, 4 Å, or 7 Å to test for statistical differences in the RMSDMin. In total, six individual GALigandDock docking sets were performed for each PDB structure, by combining different pose totals and padding sizes, for performance comparison.
PNear calculation
PNear is a quantitative metric to evaluate an energy funnel’s quality from a population of poses, with the lowest-scoring pose as the converged minima, or reference state. PNear was calculated as previously defined (Bhardwaj, Mulligan, Bahl, et al., 2016), and applied as previously described for ligand docking (Smith and Meiler, 2020), with the reference state being the experimental ligand coordinates. PNear ranges from 0 (protein will not converge to reference state) to 1 (protein will always converge to the reference state).
The numerator is the Boltzmann probability of an individual pose being near the reference state, governed by the “likeness” parameter (λ), and the thermal energy, the product of the Boltzmann constant and absolute temperature (kBT). The denominator is the partition function of a canonical ensemble. For small molecule docking, the root mean square deviation (RMSD) of the pose’s ligand coordinates relative to experimental ligand coordinates when the protein pose is superimposed to the entire protein structure, or the receptor site is used. The energy scoring metric (Ei) is the interface energy: the energy solely composed of protein-ligand interactions.
A previous Rosetta small-molecule docking study (Park et al., 2021) categorized reference-like poses at 1 Å or 2 Å RMSD without calculating PNear, while another study evaluating RosettaLigand performance calculated PNear using λ =1.5 Å, and kBT =0.62 (Meiler and Smith, 2020). Therefore, we calculated PNear with reference-like poses defined by λ=2.0 Å and kBT =0.62 Rosetta energy units (REU), however we calculated PNear using all previously reported values for the parameters λ and kBT (Tables S4.1-4.10). A specific PNear cutoff indicative for drug discovery pipelines has not been established, hence we will refer to PNear ≥ 0.5 as a ‘first pass’ cutoff for this study when evaluating energy funnel convergence to the experimental ligand coordinates.
Statistical tests
Tests for normality, heteroscedasticity, and Pearson’s correlation between covariates were performed in Python using NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), and Pingouin (Vallat, 2018) with a significance level (α) of 0.05. Population data consisted of the RMSDMin for each docking set. Not all RMSDMin data for each docking set fit a normal distribution. Since RMSDMin data should skew towards zero Å, a lognormal base 10 transformation was applied to all RMSDMin population data when comparing across docking sets. Shapiro Wilk tests for normality and Levene heteroscedasticity tests were performed to ensure transformed data were normal and of equal variance, respectively.
Covariates for both methods were the number of rotatable bonds of the ligand, the number of ligand heavy atoms, and the resolution of the structure. Statistical tests for RosettaLigand also included the number of conformers generated, transformed with a lognormal base 10, as a covariate. The ligand molecular weight was initially included as a covariate but was discarded after identifying strong Pearson’s correlation with the number of ligand heavy atoms (Table S3, Figure S1.).
To assess RMSDMin across all docking sets for a given method, a repeated measures factorial analysis of variance (ANOVA) with covariates were performed using IBM SPSS (version 29). For RosettaLigand, the two factors were sample size and ligand area interface mode, while for GALigandDock the two factors were sample size and padding value. Mauchly’s test for sphericity was performed for levels of padding (p = 0.63).