Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores
Citation
Bucinsky, Lukas et al. (2023), Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores, Dryad, Dataset, https://doi.org/10.5061/dryad.zgmsbccg7
Abstract
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF).
Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.
The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.
The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
Methods
Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1].
Reference
[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.
Funding
Science and Technology Assistance Agency, Award: APVV-20-0213
Slovak Grant Agency VEGA, Award: 1/0718/19
Ministry of Education, Science, Research and Sport of the Slovak Republic, Award: Excellent research teams scheme
Slovak Infrastructure of High Performance Computing (SIVVP) project funded by ERDF, Award: ITMS code 26210120002
"Strategic research in the field of SMART monitoring, treatment and preventive protection against coronavirus (SARS-CoV-2)" project co-financed by ERDF, Award: 313011ASS8
Science and Technology Assistance Agency, Award: APVV-20-0127
Science and Technology Assistance Agency, Award: APVV-19-0087
Science and Technology Assistance Agency, Award: APVV-17-0513
Slovak Grant Agency VEGA, Award: 1/0139/20
Slovak Grant Agency VEGA, Award: 1/0777/19
Slovak Infrastructure of High Performance Computing (SIVVP) project funded by ERDF, Award: ITM code 26230120002