Data from: Machine learning-based discovery of molecular descriptors that control polymer gas permeation
Data files
Feb 29, 2024 version files 1.16 MB
-
Perm_Data_refs-1.csv
-
Perm_Data-1.csv
-
README.md
Abstract
While machine learning has found increasing use in predicting the properties of polymeric materials with only a knowledge of chain architecture, determining the molecular factors underpinning properties ("interpretable AI") has remained less well explored. We show that encoding chain chemistry in commonly employed formats, e.g., binary-valued fingerprints, leads to uniqueness issues during the hashing process to save storage space. This is because the hashing algorithm can map several chemical moieties into the same bit. These issues carry over into the ML algorithms, especially for “inverse” design and interpretable AI, and cannot be avoided by changing the length of the fingerprint. Using MACCS key featurizations of monomer repeats resolves some of these issues, and we show that a few substructures consistently appear in top features for maximizing permeability across several gases and ML models. These are carbon-carbon double bonds (as in polyacetylenes) especially when they are associated with methyl groups (found in branching architectures). These results, derived from the limited data set of ~500 polymers with experimental gas permeation data, are in agreement with physical insight and thus provide a robust foundation which could further enable study of these material classes through detailed experiments and simulations.
README: Machine learning-based discovery of molecular descriptors that control polymer gas permeation
https://doi.org/10.5061/dryad.5x69p8dbm
Description of the data and file structure
Machine learning-based discovery of molecular descriptors that control polymer gas permeation
Dataset, Shastry et al. Journal of Membrane Science (2024)
The dataset within Perm_Data.csv contains pure gas permeability values for the polymer membranes used to train machine learning models in the paper. Perm_Data_refs.csv contains citation information for any publications used in the analysis, with specific lines left blank due to removal of redundant data without offsetting the indexing within the main database file.
Columns 1 and 2 contain the polymer name and Simplified Molecular Input Line Entry System (SMILES) strings. Columns 3-8 contain permeability values corresponding to each of the 6 gases under study. Columns 9 and 10 contain experimental temperature and pressure values, if available in the literature. Column 11 contains the index of the literature reference corresponding to the entry of Perm_Data_refs.csv from which the data were sourced.
Blank lines are used to separate families of polymers, and blank cells within a row indicate that data for a specific polymer mentioned in a publication was not available in the right form to easily yield SMILES strings. These cells are intentionally left blank so as to not introduce text which may interfere with any filtering scripts.
Sharing/Access information
Data was derived from the following sources:
Papers whose citations are provided in the references.csv file.