Data from: Machine learning assisted designing of organic solar cell hole-transport molecules with promising short circuit current density
Data files
Apr 13, 2026 version files 3.82 MB
-
mordred_-_Copy-checkpoint.ipynb
2.07 MB
-
rdkitpce.xlsx
1.43 MB
-
README.md
26.55 KB
-
sali.xlsx
121.57 KB
-
SM.xlsx
84.86 KB
-
smiles.xlsx
84.86 KB
-
Untitled-checkpoint.ipynb
72 B
Abstract
Organic solar cells (OSCs) have shown tremendous potential as a renewable energy source, but their efficiency is largely dependent on the design of the hole-transport layer. In this study, we employed machine learning (ML) techniques to design and optimize organic donors for OSCs. A dataset of 940 small molecule donors (SMDs) was curated from peer-reviewed research papers, along with their experimental short-circuit voltage (Jsc) values. Using gradient boost and AdaBoost regressors, we achieved a high prediction accuracy for Jsc with an R-Squared (R2) value of over 0.90. Our feature importance analysis revealed that MinAbsEStateIndex and fr_thiazole have a significant impact on the model. Leveraging the trained model, we designed 1726 new SMDs with a high structure-activity landscape index (SALI) score of up to 9.6, indicating their potential as efficient hole-transport materials. Further, t-SNE and K-Means clustering analysis was performed to identify patterns and clusters in the designed SMDs. This work demonstrates the power of ML in reducing computational and experimental costs associated with the design and optimization of SMDs for OSCs. By streamlining the design process, our approach can accelerate the development of more efficient OSCs, ultimately contributing to the advancement of renewable energy technologies.
Access this dataset on Dryad — Dataset DOI: 10.5061/dryad.xsj3tx9t4
This dataset accompanies a study that employs machine learning (ML) methods to screen, predict, and generate novel organic hole-transport molecules (HTMs) for use in organic solar cells (OSCs), with a specific focus on maximizing short circuit current density (Jsc). The workflow integrates molecular descriptor calculation (via Mordred), quantitative structure–property relationship (QSPR) modeling, and structure–activity landscape analysis (SALI) to identify promising HTM candidates from a large virtual chemical space.
Description of the Data and File Structure
The dataset consists of seven files:
File 1: SM.xlsx
- Format: Microsoft Excel Workbook (.xlsx)
- Role in workflow: Input file referenced in
mordred_-_Copy-checkpoint.ipynb. - Contents: Curated molecular structures used for descriptor calculation and model training.
- Usage: Provides molecular inputs for Mordred descriptor generation.
File 2: smiles.xlsx
- Format: Microsoft Excel Workbook (.xlsx)
- Role in workflow: Input file referenced in
mordred_-_Copy-checkpoint.ipynb. - Contents: Extended set of SMILES strings for molecules used in descriptor calculation.
- Usage: Supplies additional molecular structures for training and validation.
File 3: rdkitpce.xlsx
- Format: Microsoft Excel Workbook (.xlsx)
- Role in workflow: Input file referenced in
mordred_-_Copy-checkpoint.ipynb. - Contents: RDKit-derived molecular descriptors and short circuit current density (Jsc) values for 902 molecules (211 columns total). All descriptor values are computed using RDKit.
- Usage: Enables correlation analysis between Jsc and molecular descriptors, and serves as the primary feature matrix for QSPR model training.
Variable Definitions for rdkitpce.xlsx
Note (addressing Query 1): Definitions, units, and interpretation keys for all 211 variables in rdkitpce.xlsx are provided below, grouped by descriptor category. All descriptors are computed using RDKit unless otherwise stated.
Target Variable
| Variable | Description | Unit | Notes |
|---|---|---|---|
JSC |
Short circuit current density | mA cm⁻² | Experimentally reported or ML-predicted value; the response variable for QSPR modeling |
Electronic State Descriptors (EState)
| Variable | Description | Unit |
|---|---|---|
MaxAbsEStateIndex |
Maximum absolute EState index across all atoms | Dimensionless |
MaxEStateIndex |
Maximum EState index | Dimensionless |
MinAbsEStateIndex |
Minimum absolute EState index | Dimensionless |
MinEStateIndex |
Minimum EState index | Dimensionless |
Drug-likeness and Complexity
| Variable | Description | Unit |
|---|---|---|
qed |
Quantitative Estimate of Drug-likeness (Bickerton et al.) | Dimensionless; range 0–1 (higher = more drug-like) |
SPS |
Synthetic accessibility score (Ertl & Schuffenhauer) | Dimensionless; higher = harder to synthesize |
Molecular Weight Descriptors
| Variable | Description | Unit |
|---|---|---|
MolWt |
Average molecular weight (including isotopes) | g mol⁻¹ |
HeavyAtomMolWt |
Molecular weight of heavy atoms only | g mol⁻¹ |
ExactMolWt |
Exact monoisotopic molecular weight | g mol⁻¹ |
Electronic Properties
| Variable | Description | Unit |
|---|---|---|
NumValenceElectrons |
Total number of valence electrons | Count (integer) |
NumRadicalElectrons |
Total number of radical electrons | Count (integer) |
MaxPartialCharge |
Maximum Gasteiger partial charge on any atom | e (elementary charge) |
MinPartialCharge |
Minimum Gasteiger partial charge on any atom | e |
MaxAbsPartialCharge |
Maximum absolute Gasteiger partial charge | e |
MinAbsPartialCharge |
Minimum absolute Gasteiger partial charge | e |
Morgan Fingerprint Density
| Variable | Description | Unit |
|---|---|---|
FpDensityMorgan1 |
Morgan fingerprint bit density at radius 1 | Dimensionless; range 0–1 |
FpDensityMorgan2 |
Morgan fingerprint bit density at radius 2 | Dimensionless; range 0–1 |
FpDensityMorgan3 |
Morgan fingerprint bit density at radius 3 | Dimensionless; range 0–1 |
BCUT2D Descriptors (eigenvalue-based topological descriptors encoding MW, charge, logP, and molar refractivity)
| Variable | Description |
|---|---|
BCUT2D_MWHI |
Highest eigenvalue of the Burden matrix weighted by atomic mass |
BCUT2D_MWLOW |
Lowest eigenvalue of the Burden matrix weighted by atomic mass |
BCUT2D_CHGHI |
Highest eigenvalue weighted by Gasteiger charge |
BCUT2D_CHGLO |
Lowest eigenvalue weighted by Gasteiger charge |
BCUT2D_LOGPHI |
Highest eigenvalue weighted by atomic contribution to logP |
BCUT2D_LOGPLOW |
Lowest eigenvalue weighted by atomic contribution to logP |
BCUT2D_MRHI |
Highest eigenvalue weighted by molar refractivity contribution |
BCUT2D_MRLOW |
Lowest eigenvalue weighted by molar refractivity contribution |
All BCUT2D descriptors are dimensionless.
Topological/Graph-based Descriptors
| Variable | Description | Unit |
|---|---|---|
AvgIpc |
Average information content of the coefficients of the characteristic polynomial of the adjacency matrix | Dimensionless |
BalabanJ |
Balaban's connectivity index (J) | Dimensionless |
BertzCT |
Bertz complexity index | Dimensionless |
Molecular Connectivity Indices (Chi)
These are topological indices encoding molecular branching and connectivity. All are dimensionless.
| Variable | Description |
|---|---|
Chi0 |
Zeroth-order molecular connectivity index |
Chi0n |
Zeroth-order normalized connectivity index |
Chi0v |
Zeroth-order valence connectivity index |
Chi1 |
First-order connectivity index |
Chi1n |
First-order normalized connectivity index |
Chi1v |
First-order valence connectivity index |
Chi2n |
Second-order normalized connectivity index |
Chi2v |
Second-order valence connectivity index |
Chi3n |
Third-order normalized connectivity index |
Chi3v |
Third-order valence connectivity index |
Chi4n |
Fourth-order normalized connectivity index |
Chi4v |
Fourth-order valence connectivity index |
HallKierAlpha |
Hall–Kier alpha correction term for connectivity indices |
Ipc |
Information content of the coefficients of the characteristic polynomial of the adjacency matrix |
Kappa1 |
First Kier shape index |
Kappa2 |
Second Kier shape index |
Kappa3 |
Third Kier shape index |
Surface Area Descriptors
| Variable | Description | Unit |
|---|---|---|
LabuteASA |
Labute's approximate surface area | Ų |
TPSA |
Topological polar surface area | Ų |
PEOE_VSA Descriptors (Partial Equalization of Orbital Electronegativities — Van der Waals Surface Area, binned by charge)
PEOE_VSA1 through PEOE_VSA14: Each variable reports the surface area (Ų) of atoms falling within a specific partial charge bin. Higher bin indices correspond to more positive partial charges.
SMR_VSA Descriptors (Molar Refractivity — VSA, binned by atomic molar refractivity contribution)
SMR_VSA1 through SMR_VSA10: Each variable reports the surface area (Ų) of atoms within a specific molar refractivity bin.
SlogP_VSA Descriptors (Wildman–Crippen logP contribution — VSA, binned by logP contribution)
SlogP_VSA1 through SlogP_VSA12: Each variable reports the surface area (Ų) of atoms within a specific logP contribution bin.
EState_VSA and VSA_EState Descriptors
EState_VSA1 through EState_VSA11: Surface area (Ų) of atoms in EState index bins.
VSA_EState1 through VSA_EState10: Sum of EState indices for atoms in surface area bins. Units are dimensionless (EState contribution per bin).
Simple Count Descriptors
| Variable | Description | Unit |
|---|---|---|
FractionCSP3 |
Fraction of sp³-hybridized carbons | Dimensionless; range 0–1 |
HeavyAtomCount |
Total number of heavy atoms | Count (integer) |
NHOHCount |
Number of N–H and O–H bonds | Count (integer) |
NOCount |
Number of nitrogen and oxygen atoms | Count (integer) |
NumAliphaticCarbocycles |
Number of aliphatic carbocyclic rings | Count (integer) |
NumAliphaticHeterocycles |
Number of aliphatic heterocyclic rings | Count (integer) |
NumAliphaticRings |
Total number of aliphatic rings | Count (integer) |
NumAromaticCarbocycles |
Number of aromatic carbocyclic rings | Count (integer) |
NumAromaticHeterocycles |
Number of aromatic heterocyclic rings | Count (integer) |
NumAromaticRings |
Total number of aromatic rings | Count (integer) |
NumHAcceptors |
Number of hydrogen bond acceptors | Count (integer) |
NumHDonors |
Number of hydrogen bond donors | Count (integer) |
NumHeteroatoms |
Number of heteroatoms (non-C, non-H) | Count (integer) |
NumRotatableBonds |
Number of rotatable bonds | Count (integer) |
NumSaturatedCarbocycles |
Number of saturated carbocyclic rings | Count (integer) |
NumSaturatedHeterocycles |
Number of saturated heterocyclic rings | Count (integer) |
NumSaturatedRings |
Total number of saturated rings | Count (integer) |
RingCount |
Total number of rings | Count (integer) |
Lipophilicity and Refractivity
| Variable | Description | Unit |
|---|---|---|
MolLogP |
Wildman–Crippen logP (octanol–water partition coefficient) | Dimensionless (log-scale) |
MolMR |
Wildman–Crippen molar refractivity | cm³ mol⁻¹ |
Functional Group Fragment Counts (fr_*)
All fr_* variables are integer counts of specific functional group substructures, identified by SMARTS pattern matching in RDKit. All values are dimensionless counts (integers ≥ 0). The complete set of 85 fragment descriptors is listed below:
| Variable | Functional Group / Substructure |
|---|---|
fr_Al_COO |
Aliphatic carboxylic acids |
fr_Al_OH |
Aliphatic hydroxyl groups |
fr_Al_OH_noTert |
Aliphatic hydroxyl groups (excluding tertiary) |
fr_ArN |
N atoms in aromatic rings |
fr_Ar_COO |
Aromatic carboxylic acids |
fr_Ar_N |
Aromatic N (general) |
fr_Ar_NH |
Aromatic N–H |
fr_Ar_OH |
Aromatic hydroxyl groups (phenols) |
fr_COO |
Carboxylate groups (–COO⁻) |
fr_COO2 |
Carboxylate groups (broader match) |
fr_C_O |
C=O groups (general) |
fr_C_O_noCOO |
C=O groups excluding carboxylates |
fr_C_S |
C=S groups |
fr_HOCCN |
HOCCN motifs |
fr_Imine |
Imine groups (C=N) |
fr_NH0 |
Tertiary amines (N with no H) |
fr_NH1 |
Secondary amines (N with one H) |
fr_NH2 |
Primary amines (N with two H) |
fr_N_O |
Hydroxamic acid groups (N–O) |
fr_Ndealkylation1 |
N-dealkylation sites (type 1) |
fr_Ndealkylation2 |
N-dealkylation sites (type 2) |
fr_Nhpyrrole |
Pyrrole-type N–H |
fr_SH |
Thiol groups (–SH) |
fr_aldehyde |
Aldehyde groups |
fr_alkyl_carbamate |
Alkyl carbamate groups |
fr_alkyl_halide |
Alkyl halides |
fr_allylic_oxid |
Allylic oxidation sites |
fr_amide |
Amide bonds |
fr_amidine |
Amidine groups |
fr_aniline |
Aniline substructures (aromatic amine) |
fr_aryl_methyl |
Aryl methyl groups |
fr_azide |
Azide groups |
fr_azo |
Azo groups (–N=N–) |
fr_barbitur |
Barbiturate substructures |
fr_benzene |
Benzene rings |
fr_benzodiazepine |
Benzodiazepine substructures |
fr_bicyclic |
Bicyclic systems |
fr_diazo |
Diazo groups |
fr_dihydropyridine |
Dihydropyridine substructures |
fr_epoxide |
Epoxide rings |
fr_ester |
Ester groups |
fr_ether |
Ether linkages |
fr_furan |
Furan rings |
fr_guanido |
Guanidine groups |
fr_halogen |
Halogen atoms (F, Cl, Br, I) |
fr_hdrzine |
Hydrazine groups |
fr_hdrzone |
Hydrazone groups |
fr_imidazole |
Imidazole rings |
fr_imide |
Imide groups |
fr_isocyan |
Isocyanate groups |
fr_isothiocyan |
Isothiocyanate groups |
fr_ketone |
Ketone groups |
fr_ketone_Topliss |
Ketone groups (Topliss definition) |
fr_lactam |
Lactam rings |
fr_lactone |
Lactone rings |
fr_methoxy |
Methoxy groups (–OCH₃) |
fr_morpholine |
Morpholine rings |
fr_nitrile |
Nitrile groups (–C≡N) |
fr_nitro |
Nitro groups (–NO₂) |
fr_nitro_arom |
Aromatic nitro groups |
fr_nitro_arom_nonortho |
Non-ortho aromatic nitro groups |
fr_nitroso |
Nitroso groups (–N=O) |
fr_oxazole |
Oxazole rings |
fr_oxime |
Oxime groups |
fr_para_hydroxylation |
Para-hydroxylation sites |
fr_phenol |
Phenol groups |
fr_phenol_noOrthoHbond |
Phenols without ortho H-bonding |
fr_phos_acid |
Phosphoric acid groups |
fr_phos_ester |
Phosphate ester groups |
fr_piperdine |
Piperidine rings |
fr_piperzine |
Piperazine rings |
fr_priamide |
Primary amide groups |
fr_prisulfonamd |
Primary sulfonamide groups |
fr_pyridine |
Pyridine rings |
fr_quatN |
Quaternary nitrogen atoms |
fr_sulfide |
Sulfide groups (–S–) |
fr_sulfonamd |
Sulfonamide groups |
fr_sulfone |
Sulfone groups |
fr_term_acetylene |
Terminal alkyne groups |
fr_tetrazole |
Tetrazole rings |
fr_thiazole |
Thiazole rings |
fr_thiocyan |
Thiocyanate groups |
fr_thiophene |
Thiophene rings |
fr_unbrch_alkane |
Unbranched alkane carbons |
fr_urea |
Urea groups |
File 4: new_smiles.xlsx
- Format: Microsoft Excel Workbook (.xlsx)
- Role in workflow: Contains SMILES strings of newly generated virtual HTM candidates.
- Contents: Canonical SMILES strings and unique compound identifiers.
- Usage: Used for virtual compound generation and screening.
- Availability Note (addressing Query 3): This file is not included directly in the Dryad submission due to licensing constraints (non-compliance with the CC0 1.0 Universal license waiver required by Dryad). The data are instead provided as a supplementary file associated with the corresponding journal publication. Users wishing to access
new_smiles.xlsxshould refer to the supplementary materials of the published article. If access difficulties are encountered, the corresponding author may be contacted directly.
File 5: sali.xlsx
- Format: Microsoft Excel Workbook (.xlsx)
- Role in workflow: Output of the SALI analysis step in
Untitled-checkpoint.ipynb. - Contents: Structure–Activity Landscape Index (SALI) values for 1,700 molecule entries, used to identify activity cliffs among the HTM candidates.
Variable Definitions for sali.xlsx (addressing Query 2)
Clarification Note: An earlier version of this README described five columns for sali.xlsx (Compound_1, Compound_2, Tanimoto_Similarity, Delta_Jsc, SALI_Value), which did not match the three columns actually present in the data file. The definitions below reflect the actual column names as they appear in sali.xlsx (SMILES, JSC, SALI). These are the authoritative, correct definitions.
| Variable | Description | Type | Unit / Range | Interpretation |
|---|---|---|---|---|
SMILES |
Canonical SMILES string of the molecule | Text | N/A | Unique molecular structure identifier; can be used to reconstruct the molecule in RDKit or other cheminformatics tools |
JSC |
Predicted short circuit current density for the molecule | Numeric (float) | mA cm⁻² | Higher values indicate greater predicted photovoltaic performance; used as the activity measure in SALI calculation |
SALI |
Structure–Activity Landscape Index score for the molecule | Numeric (float) | Dimensionless (observed range: ~2.4–9.6) | Quantifies how abruptly Jsc changes relative to structural similarity for each molecule across its pairwise comparisons. Higher SALI values indicate that the molecule participates in activity cliffs — small structural changes lead to large changes in Jsc. Values above ~7 may be considered high-cliff candidates. |
How SALI is calculated: For a given molecule i paired with molecule j, SALI(i,j) = |Jsc(i) − Jsc(j)| / (1 − Tanimoto similarity(i,j)), where Tanimoto similarity is calculated from Morgan fingerprints (radius 2) using RDKit. The per-molecule SALI value reported in this file represents the maximum SALI score across all pairwise comparisons involving that molecule.
File 6: mordred_-_Copy-checkpoint.ipynb
- Format: Jupyter Notebook (.ipynb)
- Role in workflow: Step 1 — Descriptor calculation pipeline.
- Contents: Computes >1,800 2D molecular descriptors using Mordred + RDKit.
- Dependencies: Python 3.13.5, RDKit 2025.03.3, Mordred, Pandas, NumPy.
- Input files used:
SM.xlsx,smiles.xlsx,rdkitpce.xlsx.
File 7: Untitled-checkpoint.ipynb
- Format: Jupyter Notebook (.ipynb)
- Role in workflow: Step 2 — ML modeling and analysis.
- Contents: Feature selection, regression model training, evaluation, virtual screening, SALI analysis, and figure generation.
- Dependencies: Python 3.13.5, RDKit 2025.03.3, Pandas, NumPy, Scikit-learn, Matplotlib.
Sharing/Access information
Other publicly available locations where the data may be accessed: N/A
Data were derived from the following sources: Molecular structures and associated Jsc values were curated from published literature and supplemented with computationally generated structures. All descriptor calculations were performed using RDKit (open source; https://www.rdkit.org) and Mordred (open source; https://github.com/mordred-descriptor/mordred).
The new_smiles.xlsx file, containing newly generated virtual HTM candidates, is available as supplementary data in the associated journal publication and is not included in this Dryad deposit due to licensing constraints.
