Data and code from: Discovery of wurtzite solid solutions with enhanced piezoelectric response using machine learning
Data files
May 28, 2026 version files 547.98 MB
-
PiezoML-dryad.zip
547.97 MB
-
README.md
8.78 KB
Abstract
While many piezoelectric materials are known, there is still great potential to improve the figures of merit of existing materials through compositional doping and forming solid solutions. Specifically, it has been shown that doping and alloying wurtzite-structured materials can improve the piezoelectric response; however, a vast compositional space has remained unexplored. In this work, we apply a multilevel screening protocol combining machine learning, chemical intuition, and thermodynamics to systematically discover dopant combinations in the wurtzite material space that improve the desired piezoelectric response. Through our protocol, we use computationally inexpensive screening calculations to consider more than 3000 possible ternary wurtzite solid solutions from nine different wurtzite base systems: AlN, BeO, CdS, CdSe, GaN, ZnO, ZnS, ZnSe, and AgI. Finally, based on thermodynamic analysis and explicit piezoelectric response calculations, we predict 11 materials with improved piezoelectric response, due to the incorporation of electropositive dopants.
Dataset DOI: https://doi.org/10.5061/dryad.vmcvdnd6s
1. Dataset Overview
This dataset accompanies the publication:
Behrendt, D.; Banerjee, S.; Zhang, J.; Rappe, A. M.
Discovery of wurtzite solid solutions with enhanced piezoelectric response using machine learning, Phys. Rev. Materials (2024).
The dataset contains all computational data, machine learning features, and analysis used to identify doped wurtzite materials with enhanced piezoelectric response.
The study explores over 3000 candidate ternary solid solutions derived from nine base wurtzite materials:
AlN, BeO, CdS, CdSe, GaN, ZnO, ZnS, ZnSe, and AgI.
Using a multilevel screening workflow combining density functional theory (DFT) and machine learning:
- ~500 candidates were screened using a proxy model
- 30 promising materials were identified
- 11 materials were predicted to have significantly improved piezoelectric response
2. File Inventory and Structure
Main archive:
PiezoML-dryad.zip
2.1 Basemats/
Optimized base wurtzite materials used as starting structures.
Contents:
- Relaxed structures (DFT outputs)
- Piezoelectric tensor calculations (including e₃₃)
- Full input/output files for each calculation
2.2 Firstrun/
Initial dataset (~53 materials) used for proxy selection and model training.
2.2.1 AlN/
Dataset for dopant screening in AlN.
Subdirectories:
AlN/: pure AlN calculationsdopingtest*/: individual dopant configurations
Files:
.csv: summary of calculated propertiesproxies.csv: computed proxy features used in screeningdope.py: dopant structure generationfeatures.py,configfeatures.py: feature extraction scripts.sh: workflow scripts
2.2.2 generice33/
General workflow for computing the piezoelectric coefficient e₃₃.
Method:
- Structures are strained along the c-axis from −1% to +1%
- Atomic positions are relaxed at each strain
- Polarization is computed using the Berry phase method
- e₃₃ is obtained from the slope of polarization vs. strain
Key scripts:
generate.py: applies strain and runs calculationsgete33.py: extracts e₃₃ and uncertainty
2.3 Secondrun/
Active learning workflow used for large-scale screening of doped wurtzite materials.
This directory contains the primary machine learning and high-throughput computational workflow used to screen more than 3000 candidate ternary solid solutions derived from nine wurtzite base materials.
Directories and workflow components:
candidates/: untested candidate materials generated for scra eeningdataset/: database of tested materials and calculated properties stored in.csvformatnewcalcs/: DFT calculations for newly selected candidate materialsgeneric/: template calculation files and scripts for base materials, modified automatically for new doped systemspiezotest/: materials selected for explicit e₃₃ piezoelectric response calculationstools/: workflow automation and machine learning scripts used throughout the active learning procedurehigh-throughput tools/: scripts and utilities used for automated calculation setup, batch job submission, file organization, workflow management, and automated data collection during large-scale screening calculationsformation energies/: calculations and analysis used to evaluate thermodynamic stability and formation energies of candidate solid solutionsformationenergies-fixcell/: fixed-cell formation energy calculations used to compare energetic stability while constraining lattice parameters for consistency across candidate materials
Filess:
.csvfiles are used for intermediate datasets, machine learning training data, candidate tracking, and calculated material properties./tools/runall.sh: executes one full iteration of the active learning and screening workflow
Machine learning methods used:
- Linear regression
- LASSO
- Ridge regression
- Recursive feature elimination (RFE)
- Random forest
Workflow overview:
- Generate candidate materials
- Compute primitive atomic descriptors and material features
- Train machine learning models to predict lattice c/a ratio
- Select promising low c/a candidates
- Run DFT calculations for selected materiathe ls
- Update dataset and repeat iterative screening
Python scripts in tools/:
-
cleancandidates.py: removes previously studied materials from the candidate database -
finishprediction.py: organizes and transfers files required to complete a screening iteration -
generatecandidate.py: generates possible doped atomic configurations according to user-defined constraints -
generatefeature.py: converts material compositions and atomic descriptor databases into machine learning feature vectorsUsage:
python3 generatefeature.py (matlist.csv) (outputfeaturelist.csv) -
featureselection.py: identifies important features correlated with target material response using multiple machine learning methods -
pearson.py: generates Pearson correlation matrices for material descriptors -
materialselection.py: predicts c/a ratios for candidate materials using trained machine learning models and selects new materials for DFT calculations -
runnewmats.py: performs vc-relax calculations for newly selected materials using template inputs from thegeneric/directory -
collectratios.py: collects calculated c/a ratios from completed calculations and appends them to the growing dataset
2.4 plots/
Data used to generate figures in the publication:
crossvalidateplot/→ model validation (Figure 3)primaryfeatsel/→ feature importance (Figure 4)e33scatter/→ final results (Figure 5)proxies/→ proxy selection (Figure 2)
3. Data Generation Methods
All first-principles calculations were performed using Quantum ESPRESSO.
Key computational details:
- 2 × 2 × 1 wurtzite supercells (16 atoms total)
- Two cation sites substituted with dopants
- Polarization compthe uted via Berry phase method
- Piezoelectric coefficient e₃₃ obtained via finite differences
4. Data Processing and Machine Learning
- Primitive features derived from atomic properties:
- Atomic mass
- Electronegativity
- Atomic radius
- Preferred valence
- Periodic table row and column
- Elemental melting temperature
- Feature statistics:
- Mean
- Standard deviation
- Minimum
- Maximum
- Range
- Approximately 50 primitive descriptors were generated for each material
Key insight:
- The lattice c/a ratio was identified as the most effective proxy for predicting enhanced piezoelectric response.
5. Variables and DataFilesats
.csvfiles contain tabular material data- Each row corresponds to a material composition or DFT calculation
Typical columns include:
- Composition and dopant identity
- Lattice parameters (including c/a ratio)
- Piezoelectric response (e₃₃, units: C/m²)
- Formation energies
- Machine learning feature descriptors
- Predicted screening metrics
6. Reuse Instructions
To reproduce the workflow:
- Install:
- Quantum ESPRESSO
- Python 3.x
- scikit-learn and standard scientific Python libraries
- Run:
generice33/for piezoelectric calculationsSecondrun/tools/for machine learning workflows- formation energy workflows as described in the
formation energies/directoriethe s
- Follow iterative screening workflow:
- Generate candidates
- Compute material descriptors
- Train machine learning models
- Select promising candidates
- Run DFT calculations
- Update datasets and repeat screening
7. Notes and Caveats
- Some candidate materials may become structurally unstable and deviate from the wurtzite phase during relaxation
- Failed calculations are assigned high c/a values to exclude them from future candidate selection
- Batch scripts and submission files are HPC-dependent and may require modification for different computing environments
- Not all low c/a materials exhibit enhanced piezoelectric response
- Some workflows assume directory structures consistent with the included automation scripts
8. Access Information
No additional external repositories are associated with this dataset.
9. Software Versions
- Quantum ESPRESSO
- Python 3.x
- scikit-learn for machine learning workflows
10. Contact Information
For questions regarding the dataset or workflow implementation, please contact the corresponding author of the associated publication.
