Data from: Machine learning assisted designing of organic solar cell hole-transport molecules with promising short circuit current density

Noreen, Sadaf 1 ; Aljaafreh, Mamduh2; Sumrra, Sajjad1

Published Apr 13, 2026 on Dryad. https://doi.org/10.5061/dryad.xsj3tx9t4

Data files

Apr 13, 2026 version files 3.82 MB

mordred_-_Copy-checkpoint.ipynb

2.07 MB
rdkitpce.xlsx

1.43 MB
README.md

26.55 KB
sali.xlsx

121.57 KB
SM.xlsx

84.86 KB
smiles.xlsx

84.86 KB
Untitled-checkpoint.ipynb

72 B

Abstract

Organic solar cells (OSCs) have shown tremendous potential as a renewable energy source, but their efficiency is largely dependent on the design of the hole-transport layer. In this study, we employed machine learning (ML) techniques to design and optimize organic donors for OSCs. A dataset of 940 small molecule donors (SMDs) was curated from peer-reviewed research papers, along with their experimental short-circuit voltage (Jsc) values. Using gradient boost and AdaBoost regressors, we achieved a high prediction accuracy for Jsc with an R-Squared (R2) value of over 0.90. Our feature importance analysis revealed that MinAbsEStateIndex and fr_thiazole have a significant impact on the model. Leveraging the trained model, we designed 1726 new SMDs with a high structure-activity landscape index (SALI) score of up to 9.6, indicating their potential as efficient hole-transport materials. Further, t-SNE and K-Means clustering analysis was performed to identify patterns and clusters in the designed SMDs. This work demonstrates the power of ML in reducing computational and experimental costs associated with the design and optimization of SMDs for OSCs. By streamlining the design process, our approach can accelerate the development of more efficient OSCs, ultimately contributing to the advancement of renewable energy technologies.

Access this dataset on Dryad — Dataset DOI: 10.5061/dryad.xsj3tx9t4

This dataset accompanies a study that employs machine learning (ML) methods to screen, predict, and generate novel organic hole-transport molecules (HTMs) for use in organic solar cells (OSCs), with a specific focus on maximizing short circuit current density (Jsc). The workflow integrates molecular descriptor calculation (via Mordred), quantitative structure–property relationship (QSPR) modeling, and structure–activity landscape analysis (SALI) to identify promising HTM candidates from a large virtual chemical space.

Description of the Data and File Structure

The dataset consists of seven files:

File 1: `SM.xlsx`

Format: Microsoft Excel Workbook (.xlsx)
Role in workflow: Input file referenced in mordred_-_Copy-checkpoint.ipynb.
Contents: Curated molecular structures used for descriptor calculation and model training.
Usage: Provides molecular inputs for Mordred descriptor generation.

File 2: `smiles.xlsx`

Format: Microsoft Excel Workbook (.xlsx)
Role in workflow: Input file referenced in mordred_-_Copy-checkpoint.ipynb.
Contents: Extended set of SMILES strings for molecules used in descriptor calculation.
Usage: Supplies additional molecular structures for training and validation.

File 3: `rdkitpce.xlsx`

Format: Microsoft Excel Workbook (.xlsx)
Role in workflow: Input file referenced in mordred_-_Copy-checkpoint.ipynb.
Contents: RDKit-derived molecular descriptors and short circuit current density (Jsc) values for 902 molecules (211 columns total). All descriptor values are computed using RDKit.
Usage: Enables correlation analysis between Jsc and molecular descriptors, and serves as the primary feature matrix for QSPR model training.

Variable Definitions for `rdkitpce.xlsx`

Note (addressing Query 1): Definitions, units, and interpretation keys for all 211 variables in rdkitpce.xlsx are provided below, grouped by descriptor category. All descriptors are computed using RDKit unless otherwise stated.

Target Variable

Variable	Description	Unit	Notes
`JSC`	Short circuit current density	mA cm⁻²	Experimentally reported or ML-predicted value; the response variable for QSPR modeling

Electronic State Descriptors (EState)

Variable	Description	Unit
`MaxAbsEStateIndex`	Maximum absolute EState index across all atoms	Dimensionless
`MaxEStateIndex`	Maximum EState index	Dimensionless
`MinAbsEStateIndex`	Minimum absolute EState index	Dimensionless
`MinEStateIndex`	Minimum EState index	Dimensionless

Drug-likeness and Complexity

Variable	Description	Unit
`qed`	Quantitative Estimate of Drug-likeness (Bickerton et al.)	Dimensionless; range 0–1 (higher = more drug-like)
`SPS`	Synthetic accessibility score (Ertl & Schuffenhauer)	Dimensionless; higher = harder to synthesize

Molecular Weight Descriptors

Variable	Description	Unit
`MolWt`	Average molecular weight (including isotopes)	g mol⁻¹
`HeavyAtomMolWt`	Molecular weight of heavy atoms only	g mol⁻¹
`ExactMolWt`	Exact monoisotopic molecular weight	g mol⁻¹

Electronic Properties

Variable	Description	Unit
`NumValenceElectrons`	Total number of valence electrons	Count (integer)
`NumRadicalElectrons`	Total number of radical electrons	Count (integer)
`MaxPartialCharge`	Maximum Gasteiger partial charge on any atom	e (elementary charge)
`MinPartialCharge`	Minimum Gasteiger partial charge on any atom	e
`MaxAbsPartialCharge`	Maximum absolute Gasteiger partial charge	e
`MinAbsPartialCharge`	Minimum absolute Gasteiger partial charge	e

Morgan Fingerprint Density

Variable	Description	Unit
`FpDensityMorgan1`	Morgan fingerprint bit density at radius 1	Dimensionless; range 0–1
`FpDensityMorgan2`	Morgan fingerprint bit density at radius 2	Dimensionless; range 0–1
`FpDensityMorgan3`	Morgan fingerprint bit density at radius 3	Dimensionless; range 0–1

BCUT2D Descriptors (eigenvalue-based topological descriptors encoding MW, charge, logP, and molar refractivity)

Variable	Description
`BCUT2D_MWHI`	Highest eigenvalue of the Burden matrix weighted by atomic mass
`BCUT2D_MWLOW`	Lowest eigenvalue of the Burden matrix weighted by atomic mass
`BCUT2D_CHGHI`	Highest eigenvalue weighted by Gasteiger charge
`BCUT2D_CHGLO`	Lowest eigenvalue weighted by Gasteiger charge
`BCUT2D_LOGPHI`	Highest eigenvalue weighted by atomic contribution to logP
`BCUT2D_LOGPLOW`	Lowest eigenvalue weighted by atomic contribution to logP
`BCUT2D_MRHI`	Highest eigenvalue weighted by molar refractivity contribution
`BCUT2D_MRLOW`	Lowest eigenvalue weighted by molar refractivity contribution

All BCUT2D descriptors are dimensionless.

Topological/Graph-based Descriptors

Variable	Description	Unit
`AvgIpc`	Average information content of the coefficients of the characteristic polynomial of the adjacency matrix	Dimensionless
`BalabanJ`	Balaban's connectivity index (J)	Dimensionless
`BertzCT`	Bertz complexity index	Dimensionless

Molecular Connectivity Indices (Chi)

These are topological indices encoding molecular branching and connectivity. All are dimensionless.

Variable	Description
`Chi0`	Zeroth-order molecular connectivity index
`Chi0n`	Zeroth-order normalized connectivity index
`Chi0v`	Zeroth-order valence connectivity index
`Chi1`	First-order connectivity index
`Chi1n`	First-order normalized connectivity index
`Chi1v`	First-order valence connectivity index
`Chi2n`	Second-order normalized connectivity index
`Chi2v`	Second-order valence connectivity index
`Chi3n`	Third-order normalized connectivity index
`Chi3v`	Third-order valence connectivity index
`Chi4n`	Fourth-order normalized connectivity index
`Chi4v`	Fourth-order valence connectivity index
`HallKierAlpha`	Hall–Kier alpha correction term for connectivity indices
`Ipc`	Information content of the coefficients of the characteristic polynomial of the adjacency matrix
`Kappa1`	First Kier shape index
`Kappa2`	Second Kier shape index
`Kappa3`	Third Kier shape index

Surface Area Descriptors

Variable	Description	Unit
`LabuteASA`	Labute's approximate surface area	Å²
`TPSA`	Topological polar surface area	Å²

PEOE_VSA Descriptors (Partial Equalization of Orbital Electronegativities — Van der Waals Surface Area, binned by charge)

PEOE_VSA1 through PEOE_VSA14: Each variable reports the surface area (Å²) of atoms falling within a specific partial charge bin. Higher bin indices correspond to more positive partial charges.

SMR_VSA Descriptors (Molar Refractivity — VSA, binned by atomic molar refractivity contribution)

SMR_VSA1 through SMR_VSA10: Each variable reports the surface area (Å²) of atoms within a specific molar refractivity bin.

SlogP_VSA Descriptors (Wildman–Crippen logP contribution — VSA, binned by logP contribution)

SlogP_VSA1 through SlogP_VSA12: Each variable reports the surface area (Å²) of atoms within a specific logP contribution bin.

EState_VSA and VSA_EState Descriptors

EState_VSA1 through EState_VSA11: Surface area (Å²) of atoms in EState index bins.

VSA_EState1 through VSA_EState10: Sum of EState indices for atoms in surface area bins. Units are dimensionless (EState contribution per bin).

Simple Count Descriptors

Variable	Description	Unit
`FractionCSP3`	Fraction of sp³-hybridized carbons	Dimensionless; range 0–1
`HeavyAtomCount`	Total number of heavy atoms	Count (integer)
`NHOHCount`	Number of N–H and O–H bonds	Count (integer)
`NOCount`	Number of nitrogen and oxygen atoms	Count (integer)
`NumAliphaticCarbocycles`	Number of aliphatic carbocyclic rings	Count (integer)
`NumAliphaticHeterocycles`	Number of aliphatic heterocyclic rings	Count (integer)
`NumAliphaticRings`	Total number of aliphatic rings	Count (integer)
`NumAromaticCarbocycles`	Number of aromatic carbocyclic rings	Count (integer)
`NumAromaticHeterocycles`	Number of aromatic heterocyclic rings	Count (integer)
`NumAromaticRings`	Total number of aromatic rings	Count (integer)
`NumHAcceptors`	Number of hydrogen bond acceptors	Count (integer)
`NumHDonors`	Number of hydrogen bond donors	Count (integer)
`NumHeteroatoms`	Number of heteroatoms (non-C, non-H)	Count (integer)
`NumRotatableBonds`	Number of rotatable bonds	Count (integer)
`NumSaturatedCarbocycles`	Number of saturated carbocyclic rings	Count (integer)
`NumSaturatedHeterocycles`	Number of saturated heterocyclic rings	Count (integer)
`NumSaturatedRings`	Total number of saturated rings	Count (integer)
`RingCount`	Total number of rings	Count (integer)

Lipophilicity and Refractivity

Variable	Description	Unit
`MolLogP`	Wildman–Crippen logP (octanol–water partition coefficient)	Dimensionless (log-scale)
`MolMR`	Wildman–Crippen molar refractivity	cm³ mol⁻¹

Functional Group Fragment Counts (fr_*)

All fr_* variables are integer counts of specific functional group substructures, identified by SMARTS pattern matching in RDKit. All values are dimensionless counts (integers ≥ 0). The complete set of 85 fragment descriptors is listed below:

Variable	Functional Group / Substructure
`fr_Al_COO`	Aliphatic carboxylic acids
`fr_Al_OH`	Aliphatic hydroxyl groups
`fr_Al_OH_noTert`	Aliphatic hydroxyl groups (excluding tertiary)
`fr_ArN`	N atoms in aromatic rings
`fr_Ar_COO`	Aromatic carboxylic acids
`fr_Ar_N`	Aromatic N (general)
`fr_Ar_NH`	Aromatic N–H
`fr_Ar_OH`	Aromatic hydroxyl groups (phenols)
`fr_COO`	Carboxylate groups (–COO⁻)
`fr_COO2`	Carboxylate groups (broader match)
`fr_C_O`	C=O groups (general)
`fr_C_O_noCOO`	C=O groups excluding carboxylates
`fr_C_S`	C=S groups
`fr_HOCCN`	HOCCN motifs
`fr_Imine`	Imine groups (C=N)
`fr_NH0`	Tertiary amines (N with no H)
`fr_NH1`	Secondary amines (N with one H)
`fr_NH2`	Primary amines (N with two H)
`fr_N_O`	Hydroxamic acid groups (N–O)
`fr_Ndealkylation1`	N-dealkylation sites (type 1)
`fr_Ndealkylation2`	N-dealkylation sites (type 2)
`fr_Nhpyrrole`	Pyrrole-type N–H
`fr_SH`	Thiol groups (–SH)
`fr_aldehyde`	Aldehyde groups
`fr_alkyl_carbamate`	Alkyl carbamate groups
`fr_alkyl_halide`	Alkyl halides
`fr_allylic_oxid`	Allylic oxidation sites
`fr_amide`	Amide bonds
`fr_amidine`	Amidine groups
`fr_aniline`	Aniline substructures (aromatic amine)
`fr_aryl_methyl`	Aryl methyl groups
`fr_azide`	Azide groups
`fr_azo`	Azo groups (–N=N–)
`fr_barbitur`	Barbiturate substructures
`fr_benzene`	Benzene rings
`fr_benzodiazepine`	Benzodiazepine substructures
`fr_bicyclic`	Bicyclic systems
`fr_diazo`	Diazo groups
`fr_dihydropyridine`	Dihydropyridine substructures
`fr_epoxide`	Epoxide rings
`fr_ester`	Ester groups
`fr_ether`	Ether linkages
`fr_furan`	Furan rings
`fr_guanido`	Guanidine groups
`fr_halogen`	Halogen atoms (F, Cl, Br, I)
`fr_hdrzine`	Hydrazine groups
`fr_hdrzone`	Hydrazone groups
`fr_imidazole`	Imidazole rings
`fr_imide`	Imide groups
`fr_isocyan`	Isocyanate groups
`fr_isothiocyan`	Isothiocyanate groups
`fr_ketone`	Ketone groups
`fr_ketone_Topliss`	Ketone groups (Topliss definition)
`fr_lactam`	Lactam rings
`fr_lactone`	Lactone rings
`fr_methoxy`	Methoxy groups (–OCH₃)
`fr_morpholine`	Morpholine rings
`fr_nitrile`	Nitrile groups (–C≡N)
`fr_nitro`	Nitro groups (–NO₂)
`fr_nitro_arom`	Aromatic nitro groups
`fr_nitro_arom_nonortho`	Non-ortho aromatic nitro groups
`fr_nitroso`	Nitroso groups (–N=O)
`fr_oxazole`	Oxazole rings
`fr_oxime`	Oxime groups
`fr_para_hydroxylation`	Para-hydroxylation sites
`fr_phenol`	Phenol groups
`fr_phenol_noOrthoHbond`	Phenols without ortho H-bonding
`fr_phos_acid`	Phosphoric acid groups
`fr_phos_ester`	Phosphate ester groups
`fr_piperdine`	Piperidine rings
`fr_piperzine`	Piperazine rings
`fr_priamide`	Primary amide groups
`fr_prisulfonamd`	Primary sulfonamide groups
`fr_pyridine`	Pyridine rings
`fr_quatN`	Quaternary nitrogen atoms
`fr_sulfide`	Sulfide groups (–S–)
`fr_sulfonamd`	Sulfonamide groups
`fr_sulfone`	Sulfone groups
`fr_term_acetylene`	Terminal alkyne groups
`fr_tetrazole`	Tetrazole rings
`fr_thiazole`	Thiazole rings
`fr_thiocyan`	Thiocyanate groups
`fr_thiophene`	Thiophene rings
`fr_unbrch_alkane`	Unbranched alkane carbons
`fr_urea`	Urea groups

File 4: `new_smiles.xlsx`

Format: Microsoft Excel Workbook (.xlsx)
Role in workflow: Contains SMILES strings of newly generated virtual HTM candidates.
Contents: Canonical SMILES strings and unique compound identifiers.
Usage: Used for virtual compound generation and screening.
Availability Note (addressing Query 3): This file is not included directly in the Dryad submission due to licensing constraints (non-compliance with the CC0 1.0 Universal license waiver required by Dryad). The data are instead provided as a supplementary file associated with the corresponding journal publication. Users wishing to access new_smiles.xlsx should refer to the supplementary materials of the published article. If access difficulties are encountered, the corresponding author may be contacted directly.

File 5: `sali.xlsx`

Format: Microsoft Excel Workbook (.xlsx)
Role in workflow: Output of the SALI analysis step in Untitled-checkpoint.ipynb.
Contents: Structure–Activity Landscape Index (SALI) values for 1,700 molecule entries, used to identify activity cliffs among the HTM candidates.

Variable Definitions for `sali.xlsx` (addressing Query 2)

Clarification Note: An earlier version of this README described five columns for sali.xlsx (Compound_1, Compound_2, Tanimoto_Similarity, Delta_Jsc, SALI_Value), which did not match the three columns actually present in the data file. The definitions below reflect the actual column names as they appear in sali.xlsx (SMILES, JSC, SALI). These are the authoritative, correct definitions.

Variable	Description	Type	Unit / Range	Interpretation
`SMILES`	Canonical SMILES string of the molecule	Text	N/A	Unique molecular structure identifier; can be used to reconstruct the molecule in RDKit or other cheminformatics tools
`JSC`	Predicted short circuit current density for the molecule	Numeric (float)	mA cm⁻²	Higher values indicate greater predicted photovoltaic performance; used as the activity measure in SALI calculation
`SALI`	Structure–Activity Landscape Index score for the molecule	Numeric (float)	Dimensionless (observed range: ~2.4–9.6)	Quantifies how abruptly Jsc changes relative to structural similarity for each molecule across its pairwise comparisons. Higher SALI values indicate that the molecule participates in activity cliffs — small structural changes lead to large changes in Jsc. Values above ~7 may be considered high-cliff candidates.

How SALI is calculated: For a given molecule i paired with molecule j, SALI(i,j) = |Jsc(i) − Jsc(j)| / (1 − Tanimoto similarity(i,j)), where Tanimoto similarity is calculated from Morgan fingerprints (radius 2) using RDKit. The per-molecule SALI value reported in this file represents the maximum SALI score across all pairwise comparisons involving that molecule.

File 6: `mordred_-_Copy-checkpoint.ipynb`

Format: Jupyter Notebook (.ipynb)
Role in workflow: Step 1 — Descriptor calculation pipeline.
Contents: Computes >1,800 2D molecular descriptors using Mordred + RDKit.
Dependencies: Python 3.13.5, RDKit 2025.03.3, Mordred, Pandas, NumPy.
Input files used: SM.xlsx, smiles.xlsx, rdkitpce.xlsx.

File 7: `Untitled-checkpoint.ipynb`

Format: Jupyter Notebook (.ipynb)
Role in workflow: Step 2 — ML modeling and analysis.
Contents: Feature selection, regression model training, evaluation, virtual screening, SALI analysis, and figure generation.
Dependencies: Python 3.13.5, RDKit 2025.03.3, Pandas, NumPy, Scikit-learn, Matplotlib.

Sharing/Access information

Other publicly available locations where the data may be accessed: N/A

Data were derived from the following sources: Molecular structures and associated Jsc values were curated from published literature and supplemented with computationally generated structures. All descriptor calculations were performed using RDKit (open source; https://www.rdkit.org) and Mordred (open source; https://github.com/mordred-descriptor/mordred).

The new_smiles.xlsx file, containing newly generated virtual HTM candidates, is available as supplementary data in the associated journal publication and is not included in this Dryad deposit due to licensing constraints.

Data from: Machine learning assisted designing of organic solar cell hole-transport molecules with promising short circuit current density

Data files

Abstract

README: Data from: Machine learning assisted designing of organic solar cell hole-transport molecules with promising short circuit current density

Description of the Data and File Structure

File 1: SM.xlsx

File 2: smiles.xlsx

File 3: rdkitpce.xlsx

Variable Definitions for rdkitpce.xlsx

File 4: new_smiles.xlsx

File 5: sali.xlsx

Variable Definitions for sali.xlsx (addressing Query 2)

File 6: mordred_-_Copy-checkpoint.ipynb

File 7: Untitled-checkpoint.ipynb