Curated in vivo subset of ZINC15: Correction of formal charge and molecular structure in mol2 format
Data files
May 14, 2026 version files 20.05 MB
-
download.sh
5.66 KB
-
In_vivo.csv
18.53 MB
-
In_vivo.zip
1.51 MB
-
README.md
4.04 KB
May 19, 2026 version files 20.05 MB
-
download.sh
5.66 KB
-
In_vivo.csv
18.53 MB
-
In_vivo.zip
1.51 MB
-
README.md
4.04 KB
Abstract
The database of molecules, referred to as in vivo, was downloaded on 10/12/2021 from ZINC15 (https://zinc15.docking.org/). The database contained a total of 60,411 molecules, which were available in mol2 format. Errors related to structure (number of atoms) and formal charge were found when comparing information from the mol2 formats of individual molecules and their InChI codes present on the ZINC15 website. The reference number of atoms was obtained from the InChI code, as the sum of the number of atoms from the formula (main layer of InChI code), which was modified by adding/subtracting the (de)protonation information from the protonation sublayer (/p). The reference number of atoms was compared with the number of atoms from the mol2 format. The reference formal charge was obtained as the sum of the charges given in the charge sublayer (/q) and the protonation sublayer (/p) of the InChI code. The reference formal charge was compared with the formal charge obtained by summing the partial charges from the mol2 format, with 0.1 e chosen as the acceptable deviation. Together 1,115 corrected molecules (curated by Open Babel 3.1.1) are provided in mol2 format. In addition, a bash script is available for downloading the in vivo database, along with detailed instructions for reproducing the described workflow.
File In_vivo.csv (semicolon as a column separator)
1. column (name):
• {ZINC-name}_{protomer-number}
2. column (smiles):
• SMILES code
3. column (inchi):
• InChI code
4. column (inchikey):
• InChIKey code
5. column (link):
• Protomer link for download from ZINC15 (https://zinc15.docking.org/).
6. column (correct):
• -1 (4)
▸ Presence of a Si atom.
▸ Molecules are not provided in In_vivo.zip.
• 0 (1,115)
▸ Molecule needed a correction.
▸ Molecules are provided in In_vivo.zip.
• 1 (59,292)
▸ Molecule did not need a correction, only Gasteiger charge was assigned.
▸ Molecules are not provided in In_vivo.zip.
File In_vivo.zip
• A total of 1,115 corrected molecules in mol2 format.
File download.sh
• Bash script for downloading molecules of in vivo ZINC15 database in mol2 format.
• Script requires standalone Linux commands wget and gunzip.
• Time limit for downloading a single molecule is set to 180 seconds.
• Time limit can be changed by adjusting the time_limit variable directly in the script.
To reproduce the described workflow, follow these steps:
1. Prepare the download script and input file:
• Ensure that download.sh and In_vivo.csv are located in the same, current, directory.
• Make the script executable:
▸ "chmod +x download.sh"
2a. Download only correct molecules from the in vivo database (ZINC15):
• Run:
▸ "./download.sh In_vivo.csv 1" (59,292 molecules).
2b. Download the original in vivo database (ZINC15):
• Run:
▸ "./download.sh In_vivo.csv" (60,411 molecules).
• This step is not necessary to reproduce the described workflow.
3. If molecule downloads fail in step 2a or 2b, rerun the script using the generated error file:
• Run:
▸"./download.sh In_vivo_err.csv".
• This step is necessary only if the error file contains entries.
• Repeat until the error file contains only the header.
• The In_vivo_err.csv file is not created in step 2a or 2b when the execution of the script is interrupted.
▸ Proceed step 2a or 2b again (no need for removing mol2 directory).
4. Assign Gasteiger charges with Open Babel 3.1.1:
• For each of 59,292 current molecules run the command:
▸ "obabel ZINCXXXXXXXXXXXX_Y.mol2 -O ZINCXXXXXXXXXXXX_Y_gas.mol2 --partialcharge gasteiger"
5. Move modified molecules (1,115) to mol2 directory.
*Please note that the possibility of downloading the database is conditional on the existence of links on the ZINC15 website [3 February 2026].
Citation
The usage of modified compounds of in vivo dataset, have to be followed by citing the original ZINC15 database:
Sterling, T., & Irwin, J. J. (2015). ZINC 15–ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324-2337.
http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559
If any idea related to this Dryad contribution is used, please cite the Dryad dataset and following article:
Boršová, V., Zajaček, D., Bucinsky, L., Štekláč, M. (2026). Integrated computational protocol for sampling molecular databases towards allosteric inhibition of SARS-CoV-2 spike activation. J. Biomol. Struct. Dyn.
https://doi.org/10.1080/07391102.2026.2658697
A total of 60,411 molecules were downloaded from the in vivo subset of the ZINC15 [1] database. Subsequently, a working protocol was defined, that was used in the preparation of the molecules involving the following steps:
a) Control of non-parameterized atoms.
1a) Four molecules were excluded from the database due to the presence of Si atom.
2a) Parameterized atoms (H, C, O, N, S, F, Cl, Br, I and P) were exclusively present in 60,407 molecules.
b) Correct structure - agreement between the reference number of atoms and the number of atoms obtained from the mol2 format (59,806 molecules).
1b) Present charge
Partial charges were given in mol2 format for 59,292 molecules. The reference formal charge and the formal charge obtained from the mol2 format (sum of partial charges) were identical for all of them. In a final step, these molecules were assigned the Gastieger charge using the Open Babel 3.1.1 [2] (--partialcharge gasteiger).
2b) Absent charge
In total, 514 molecules were missing partial charge information in the mol2 format. Three procedures were used to achieve a match between the reference formal charge and the formal charge from mol2. Firstly, the Gastieger charge assignment was successful in two cases. Secondly, an initial conversion with added mmff94 charge (--partialcharge mmff94), followed by a conversion to the Gasteiger charges, was utilized for 486 molecules. Finally, mol2 format corrections (C.cat to C.2, N.pl3 to N.4, N.ar to N.4., etc.) made before assigning the Gasteiger charge were necessary for the remaining 26 molecules.
c) Incorrect structure - Discrepancy between the reference number of atoms and the number of atoms obtained from mol2 format (601 molecules).
For the molecules with the incorrect number of atoms it was necessary to re-generate its 3D structure regardless of whether the molecule had (592 molecules) or did not have (9 molecules) the partial charges listed in the mol2 format. The structure was generated from the SMILES code (available at ZINC15 [1]) with the mmff94 force field using the Open Babel 3.1.1 [2] (--gen3D --best). Subsequently, the 3D geometries of the molecules were refined by additional conformational search, using the Open Babel 3.1.1 [2] (--confab). The correct formal charge was ensured by directly assigning the Gasteiger charge to 588 molecules. An initial generation of mmff94 charges followed by conversion to the Gasteiger charge was required for ten molecules. Finally, an initial Mulliken charge assignment in the Gaussian 16 [3] (B3LYP/6-31G*) followed by conversion to the Gasteiger charge was utilized for the remaining three molecules.
References:
[1] Sterling, T., & Irwin, J. J. (2015). J. Chem. Inf. Model. 55, 2324-2337.
[2] O'Boyle, N. M., et al. (2011). J. Cheminf. 3, 1-14.
[3] Frisch, M. J., et al. (2016). Gaussian 16, Revision C.01.
