Skip to main content

Dual loop active learning of hydrophobicity of patterned SAMs

Cite this dataset

Kelkar, Atharva; Dallin, Bradley; Van lehn, Reid (2022). Dual loop active learning of hydrophobicity of patterned SAMs [Dataset]. Dryad.


Hydrophobic interactions drive numerous biological and synthetic processes. The materials used in these processes often possess chemically heterogeneous surfaces that are characterized by diverse chemical groups positioned in close proximity at the nanoscale; examples include functionalized nanomaterials and biomolecules like proteins and peptides. Nonadditive contributions to the hydrophobicity of such surfaces depend on the chemical identities and spatial patterns of polar and nonpolar groups in ways that remain poorly understood. Here, we develop a dual-loop active learning framework that combines a fast, reduced-accuracy method (a convolutional neural network) with a slow, higher-accuracy method (molecular dynamics simulations with enhanced sampling) to efficiently predict the hydration free energy, a thermodynamic descriptor of hydrophobicity, for nearly 200,000 chemically heterogeneous self-assembled monolayers (SAMs). Analysis of this data set reveals that SAMs with distinct polar groups exhibit substantial variations in hydrophobicity as a function of their composition and patterning, but the clustering of nonpolar groups is a common signature of highly hydrophobic patterns. Further MD analysis relates such clustering to the perturbation of interfacial water structure. These results provide new insight into the influence of chemical heterogeneity on hydrophobicity via quantitative analysis of a large set of surfaces, enabled by the active learning approach.

Paper title: Identifying Nonadditive Contributions to the Hydrophobicity of Chemically Heterogeneous Surfaces via Dual-Loop Active Learning
Authors: Atharva Kelkar, Bradley Dallin, Reid Van Lehn


This folder contains files to reproduce and analyze molecular dynamics (MD) trajectories of a large set of patterned self-assembled monolayers (SAMs), with patterns of a nonpolar and a polar group (either amine, amide, or hydroxyl). The dataset is split into 2 major parts -

1. trajectories - Tar files containing equilibrium and short production trajectories (GROMACS xtc files) for INDUS-labelled patterns from 3 different parts of the dual loop active learning algorithm method. This folder also contains initial configurations, topology files, mdp files, and CHARMM inputs needed to reproduce or extend trajectories. The 3 different parts of the dual loop active learning process are as follows -
    a. Seed runs - Randomly-chosen patterns used to initiate the dual loop active learning process.
    b. GPR runs - Patterns identified during the slow loop of the active learning loop
    c. Max-dev runs - Patterns which were predicted to have the highest and lowest HFEs for a given polar area fraction, identified after the completion of training of the active learning loop

Each trajectory folder contains a file titled "hfe_label.txt" which contains the calculated value of the HFE in units of kBT (with T = 300K). All simulations were performed using the force field files supplied in the top-level charmm36-jul2017.ff
directory using the TIP4P/2005 water model and at constant volume and temperature (NVT). The name of the polar end group for each SAM is specified in the folder name.

2. 'collated_histograms.pickle' - Pickle containing a pre-processed dataset with 20x20 oxygen and hydrogen number density histograms corresponding to the trajectories in the 'trajectories/' folder. Each pickle file has the following data -
    a. 'histograms' - Hydrogen and water density histograms (numpy arrays) of size [n_frames, 2, 400]
    b. 'labels' - INDUS-calculated HFE labels for each of the histograms
    c. 'ligand' - Ligands associated with each histograms
    d. 'run_type' - Classification of category of runs (from point 1 above - Seed, GPR, or Max-dev)
    e. 'folder_name' - Folder name of the trajectory associated with each histogram

The objective of collated histograms is to enable scientists to load in a curated dataset with labels and histograms and apply data-centric tools to study the hydrophobicity of a large set of chemically heterogeneous surfaces with diverse end group chemistries.

All the data needed to train the 3D CNN, i.e., idealized SAMs with amine, amide, and hydroxyl end groups, referenced in the paper have already been shared publicly with our previous publication (Kelkar, Dallin, and Van Lehn, J Phys Chem B 124 (41), 2020) at the following link:

All the codes required to generate results and analyze data using the dual-loop active learning algorithm, with trained GPR models, are uploaded to a git repo:

Usage notes

A README file has been added to the base folder which details the data present in the backup folder.


National Science Foundation, Award: 2044997

National Science Foundation, Award: ACI-1548562