Fragment and torsion biasing algorithms for construction of small organic molecules in proteins using DOCK

Rizzo, Robert C.1 ; Bickel, John D.1 ; Boysan, Brock T.1

Published Jul 22, 2025 on Dryad. https://doi.org/10.5061/dryad.905qfttxf

Data files

Jul 22, 2025 version files 60.52 MB

000_1AJV_system_files.tar.gz

23.40 MB
001_de_novo_inputs.tar.gz

186.76 KB
002_serial_vs_parallel.tar.gz

149.75 KB
003_single_biasing_experiments.tar.gz

30.25 MB
004_combination_biasing_experiments.tar.gz

6.53 MB
README.md

5.68 KB

Abstract

The computational construction of small organic molecules (de novo design), directly in a protein binding site, is an effective means for generating novel ligands tailored to fit the pocket environment. In this work, we present two new methods, which aim to improve de novo design outcomes using (1) biasing algorithms to prioritize selection and/or acceptance of fragments and torsions during growth, and (2) parallel‐based clustering and pruning algorithms to remove duplicate molecules as candidate fragment are added. Large‐scale testing encompassing thousands of simulations were employed to interrogate the methods in terms of multiple metrics which include numbers of duplicate molecules generated, pairwise‐similarity, focused library reconstruction rates, fragment and torsion frequencies, fragment and torsion rank scores, interaction energy and drug‐likeness scores, and 3D pose comparisons. The biasing algorithms, particularly those that include fragment and torsion components simultaneously, led to molecules that more closely mimicked the distributions of fragments and torsions found in drug‐like libraries. The new parallel‐based clustering and pruning algorithms, compared with the existing serial approach, also led to larger ensembles comprised of topologically unique molecules with much greater efficiency by removing redundant growth paths.

Access this dataset on Dryad

This dataset corresponds to several experiments performed in the parent manuscript of the same title.

Each compressed directory corresponds to different sections of the manuscript as follows:

Description of the data and file structure

All data collected for this set were collected using DOCK6's De Novo Design algorithm (DOCK_DN) under various conditions to examine the effects of biasing fragment library selection and acceptance by a distribution of populations. All input files provided are for experiments using 1AJV with anchor0, but all anchors are provided in 001_de_novo_inputs.

Files and variables

000_1AJV_system_files.tar.gz:

All base files (protein + cognate ligand setup) needed to run a DOCK6 experiment on HIV Protease PDB 1AJV. These files (and those of the other 56 systems used in the manuscript) were generated using a standard process to generate the SB2012 test set. 1AJV was the individual test system used for parallel vs. serial analysis in Section 4.1 and Figures 5 and 6 of the parent manuscript.

1AJV.lig.am1bcc.mol & 1AJV.lig.gast.mol2: The cognate crystallized ligand for PDB 1AJV with two different charge models. While most of our experiments are done with AM1BCC charges, DOCK_DN experiments use a Gasteiger model, so both are provided for appropriate comparison.

1AJV_gridbox.pdb: The grid box used to calculate the energy grid.

1AJV.rec.bmp, 1AJV.rec.nrg: Calculated energy grid files for running a DOCK experiment.

1AJV.rec.clean.mol2: The cleaned receptor file that is docked or built, used for initial energy grid calculations and eventual Cartesian scoring.

1AJV.rec.clust.close.sph: Processed spheres defining the binding pocket, utilized for orienting initial anchors.

001_de_novo_inputs.tar.gz:

Fragment library used for standard de novo design experiments with DOCK6 (DOCK_DN). Includes the 30 anchors used for initial seeds to begin growth. Along with 000_1AJV_system_files.tar.gz and an input file, these files can be used to perform a basic de novo design experiment.

anchors: 30 most populous fragments used as anchors for DOCK_DN

fraglib: The standard ZINC13M Fragment Library used for DOCK_DN
fraglib_linker.mol2, fraglib_scaffold.mol2, fraglib_sidechain.mol2: Multi-mol2 files of the fragments.
fraglib_torenv.dat: All torsion environments, used to ensure chemical sanity during growth.

002_serial_vs_parallel.tar.gz

Input files for the Serial Vs. Parallel section of the manuscript (Sections 4.1 and 4.2, Figures 5 and 6, and Table 2), and results for the Serial portion of the experiments. The parallel portion of the experiments are included as the R data sets in 003.

*dock_input.in: DOCK input files for Serial or Parallel experiments.

*duplicate_info.dat: Results for duplicate analysis of all tested anchors for 1AJV in Serial and Parallel. Used to plot Figure 5B.

1AJV_serial_results: The output fragment library and number of generated molecules for the serial experiment of 1AJV.

003_single_biasing_experiments.tar.gz

Input files and output results for a "standard" DOCK_DN experiment (Random, R) as well as for the individual biasing methods outlined in the manuscript (FAF, FSF, TAF).

*.in: Input file used for DOCK_DN experiments for each biasing type.

FAF, FSF, R, TAF directories: Fragment library values, number of molecules constructed, and Grid Scores for individual molecules. These were used in the generation of Figures 7 and 9 of the parent manuscript.

004_combination_biasing_experiments.tar.gz

Input files and output results for for the combination biasing methods outlined in the manuscript (FAF_TAF, FSF_FAF, FSF_FAF_TAF, FSF_TAF).

*.in: Input file used for DOCK_DN experiments for each biasing combination.

FAF_TAF, FSF_FAF, FSF_FAF_TAF, FSF_TAF, directories: Fragment library values, number of molecules constructed, and Grid Scores for individual molecules. These were used in the generation of Figures 8 and 9 of the parent manuscript.

Sharing/Access information

The remaining data for the remaining 56 systems is available upon request from authors, though the working example here is sufficient for replication of the work.

Code/Software

All molecular construction performed for this data set was done using a development branch of DOCK6.12, with code modifications to be released as DOCK6.13. Analyses performed are described in the parent manuscript, and were done utilizing standard Python packages for scientific computing (matplotlib, numpy, pandas).

For accessing files:

*.pdb, *.mol2, *.sph, *.dat, *.num, and *.in files are readable and editable via standard text editors such as Notepad, Sublime Text, and VIM. They are also readable via Python and other languages' file reading methods.
Structures found in *.pdb, *.mol2, and *.sph files are also visualizable in most molecular visualization softwares (such as Chimera, ChimeraX or PyMol).

*.nrg and *.bmp files are generated by Grid and readable by DOCK6 (free academic license) for use in experiments and by Chimera for visualization.