Skip to main content
Dryad

Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores

Cite this dataset

Bucinsky, Lukas et al. (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. Dryad. https://doi.org/10.5061/dryad.zgmsbccg7

Abstract

Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF).

Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.

The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.

The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).

Methods

Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1].

Reference
[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.

Funding

Science and Technology Assistance Agency, Award: APVV-20-0213

Slovak Grant Agency VEGA, Award: 1/0718/19

Ministry of Education, Science, Research and Sport of the Slovak Republic, Award: Excellent research teams scheme

Slovak Infrastructure of High Performance Computing (SIVVP) project funded by ERDF, Award: ITMS code 26210120002

"Strategic research in the field of SMART monitoring, treatment and preventive protection against coronavirus (SARS-CoV-2)" project co-financed by ERDF, Award: 313011ASS8

Science and Technology Assistance Agency, Award: APVV-20-0127

Science and Technology Assistance Agency, Award: APVV-19-0087

Science and Technology Assistance Agency, Award: APVV-17-0513

Slovak Grant Agency VEGA, Award: 1/0139/20

Slovak Grant Agency VEGA, Award: 1/0777/19

Slovak Infrastructure of High Performance Computing (SIVVP) project funded by ERDF, Award: ITM code 26230120002