Decoratype-based materials informatics: Polaritype identification, convex hull DFT calculations, training data, and predicted compounds data
Data files
Dec 18, 2025 version files 1.95 GB
-
candidate_data_0_feat.csv
8.52 MB
-
candidate_data_1_imputed.pkl
14.11 MB
-
hull_calcs.tar.gz
1.78 GB
-
known_data_compounds.json
123.69 MB
-
known_data_prototypes.json
22.27 MB
-
README.md
23.63 KB
-
scaler_means.npy
736 B
-
scaler_stds.npy
736 B
-
scaler.joblib.gz
2.98 KB
-
training_data.pkl
2.16 MB
Abstract
We introduce decoratypes as a structure taxonomy that classifies compounds based on site decorations of specific structural prototypes. Building on this foundation, a ferroelectric materials discovery framework is developed, integrating decorotypes with an active learning approach to accelerate exploration. In addition, six novel ferroelectric candidates are predicted, including three strain-activated ferroelectrics and three strain-activated hyperferroelectrics. These findings highlight the potential of the decoratype taxonomy to enhance our understanding of structure-driven material properties and facilitate the discovery of promising yet underexplored regions of chemical space. This repository contains density functional theory (DFT) convex hull calculations, materials data used to train the polaritype-based active learning model, and candidate compounds predicted by the recommender model.
This directory contains density functional theory (DFT) convex hull calculations, materials data used to train the polaritype-based recommender model, and candidate compounds predicted by the recommender model. Directory contents are described below.
Convex Hull DFT Data
hull_calcs.tar.gz
Compressed archive of DFT input and output files for convex hull stability analysis. Calculation directories are sorted into the following directory structure:
.
├── hull_calcs/
│ ├── comp_A4B2C/
| | ├── calc_A2B3C4_<identifier>--<AFLOW prototype designation>/
| | | ├── relax
| | | └── scf
| | └── calc_.../
│ ├── comp.../
| └── incomplete_members/
| ├── calc_AxBy_<identifier>--<AFLOW prototype designation>/
| | ├── relax
| | └── scf
| └── calc_.../
The comp_A4B2C directories contain calculations of all ternaries in the same chemical system of the candidate compound A4B2C. Directories with compknown_ represent calculations for known compounds in the polaritype of interest. Each calc_* directory contains a subdirectory for the ionic relaxation and the self-consistent field (SCF), or static calculation. The calc_* directories are labeled with the chemical formula, the structure identifier (or "candidate" for the predicted compounds), and the AFLOW prototype designation. Output files include the std_isX_ibY_itZ.out which are the VASP standard output files for the relaxation iteration Z with ISIF = X and IBRION = Y. The incomplete members directory contains the cumulative group of all convex hull-relevant compounds with less than 3 elements, e.g. A3B2. When calculating convex hulls for each candidate compound, the appropriate unaries and binaries from within the candidate compound's chemical system were selected for analysis from this directory. Since prototyping/decoratyping was not important for the incomplete members, the prototypes all show up as "None".
Training Data (Materials Properties)
known_data_compounds.json
Augmented dataset containing materials properties for experimentally validated structures/compounds from the Materials Project and the ICSD. Also includes assigned polaritypes. This dataset contains element-resolved, site-resolved, and aggregate statistical descriptors of crystalline compounds. Column names follow consistent naming conventions that encode their meaning.
Aggregate Statistical Descriptors Across Elements
Columns of the form:
<statistic>_<property>
describe aggregate statistics computed across all distinct elements in the compound.
1.1 Supported Statistics
| Statistic | Meaning |
|---|---|
minimum |
Minimum value among the elements |
maximum |
Maximum value among the elements |
mean |
Arithmetic mean across the elements |
range |
Difference between maximum and minimum |
avg_dev |
Average absolute deviation from the mean |
mode |
Most frequently occurring value |
1.2 Supported Properties
| Property | Description | Units |
|---|---|---|
AtomicWeight |
Atomic weight | amu |
Number |
Atomic number | – |
MendeleevNumber |
Mendeleev number | – |
Column |
Periodic table group (column) | – |
Row |
Periodic table period (row) | – |
CovalentRadius |
Covalent radius | Å |
Electronegativity |
Pauling electronegativity | – |
MeltingT |
Melting temperature | K |
GSvolume_pa |
Ground-state atomic volume per atom | ų/atom |
GSbandgap |
Ground-state bandgap | eV |
GSmagmom |
Ground-state magnetic moment | μB |
SpaceGroupNumber |
Ground-state space group number | – |
NValence |
Total number of valence electrons | – |
NUnfilled |
Total number of unfilled valence states | – |
NsValence, NpValence, NdValence, NfValence |
Valence electrons in s/p/d/f orbitals | – |
NsUnfilled, NpUnfilled, NdUnfilled, NfUnfilled |
Unfilled valence states in s/p/d/f orbitals | – |
1.3 Example Interpretation
| Column | Meaning |
|---|---|
mean_CovalentRadius |
Mean covalent radius of all elements in the compound |
maximum_NdValence |
Maximum number of d-valence electrons among elements |
avg_dev_Electronegativity |
Average deviation of electronegativity values |
range_GSbandgap |
Range of elemental ground-state bandgaps |
2. Site-Resolved Elemental Descriptors (typeX_*)
Columns of the form:
typeX_<property>
describe properties of the element occupying Wyckoff site X, where:
| X | Site |
|---|---|
0 |
Site B |
1 |
Site C |
2 |
Site A |
2.1 Supported Site Properties
These mirror the elemental properties listed in Section 1.2, but apply to a specific crystallographic site.
| Property | Description | Units |
|---|---|---|
sym |
Element symbol | – |
oxi |
Oxidation state | – |
AtomicWeight |
Atomic weight | amu |
Number |
Atomic number | – |
MendeleevNumber |
Mendeleev number | – |
Column |
Periodic table group | – |
Row |
Periodic table period | – |
CovalentRadius |
Covalent radius | Å |
Electronegativity |
Pauling electronegativity | – |
MeltingT |
Melting temperature | K |
GSvolume_pa |
Ground-state atomic volume per atom | ų/atom |
GSbandgap |
Ground-state bandgap | eV |
GSmagmom |
Ground-state magnetic moment | μB |
SpaceGroupNumber |
Ground-state space group number | – |
first_ioniz |
First ionization energy | eV |
NValence, NUnfilled |
Total valence / unfilled states | – |
Ns*, Np*, Nd*, Nf* |
Orbital-resolved valence / unfilled counts | – |
2.2 Example Interpretation
| Column | Meaning |
|---|---|
type1_Electronegativity |
Electronegativity of element at site C |
type0_NdUnfilled |
Unfilled d-valence states at site B |
type2_first_ioniz |
First ionization energy of element at site A |
3. Multi-Site Vector Descriptors (type_*)
These columns store ordered vectors corresponding to Wyckoff sites, consistent with the typeX_* ordering.
| Column | Description |
|---|---|
type_sym |
Element symbols at each Wyckoff site |
type_oxi |
Oxidation states at each Wyckoff site |
type_ion |
Ionic polarity at each Wyckoff site |
type_counts |
Number of atoms at each Wyckoff site |
4. Composition & Structural Metadata
| Column | Description |
|---|---|
formula |
Chemical formula |
anonymized_formula |
Stoichiometry-only anonymized formula |
species |
List of distinct chemical species |
n_elem |
Number of distinct elements |
pearson |
Pearson symbol |
sg_num |
Space group number |
struc_rep |
representative structure/compound serving as a unique key for this structural prototype, used to find member compounds in the dataset described above |
polaritype_idx |
Polaritype index |
in_dom_polaritype |
Whether compound is in dominant polaritype |
source |
Data source: MP (Materials Project) or ICSD (International Crystal Structure Database) |
ident |
Database identifier |
fname |
Associated structure filename |
5. Electronic & Bonding Descriptors
| Column | Description | Units |
|---|---|---|
band_gap |
Electronic band gap | eV |
band_center |
Band center relative to vacuum | eV |
e_above_hull |
Energy above convex hull | eV/atom |
avg_ionic_char |
Average ionic bond character | – |
max_ionic_char |
Maximum ionic bond character | – |
avg_anion_electron_affinity |
Mean anion electron affinity | eV |
cation_ratio |
Fraction of cation species | – |
known_data_prototypes.json
Augmented dataset containing structural prototype grouping and polaritype population information.
Column Descriptions
| Column Name | Description |
|---|---|
struc_rep |
representative structure/compound serving as a unique key for this structural prototype, used to find member compounds in the dataset described above |
ntypes |
number of unique sites in this structural prototype |
natoms |
number of atoms in the unit cell of the prototype |
n_elem |
number of distinct chemical elements present in the prototype stoichiometry. |
polaritype_counts |
number of observed compounds in each polaritype (matches ordering of polaritype_idx) |
polaritype_reps |
list of lists of representations of equivalent polaritypes with +1/-1 representing cations/anions -- outer list indicates symmetrically distinct polaritypes and inner list indicates symmetrically equivalent polaritypes (matches ordering of polaritype_idx) |
equivalent_decorations |
list of lists of symmetrically equivalent decorations |
nontrivial_polaritype_family |
whether or not the prototype contains members in more than one polaritype |
grouped_Wyckoff_positions |
list of Wyckoff site descriptions (with species of structure representative) -- values in the type field correspond to indexing in all other type* columns |
prototype_designation |
AFLOW prototype designation |
population |
total number of compounds observed in this prototype |
params_list |
list of parameters which uniquely define a member of the structural prototype |
n_free_param |
number of parameters -- the length of the params_list |
type_counts |
number of species at each Wyckoff site, ordered to match other type columns |
stoichiometry |
reduced stoichiometric ratios defining the prototype composition, independent of specific element identities. |
space_group |
space group number associated with the structural prototype. |
anonymized_formula |
anonymized chemical formula encoding the prototype stoichiometry without element identities. |
training_data.pkl
Cleaned, imputed, and scaled version of the known data -- used to train the polaritype classifier. Serialized as a Python pickle object. Columns in this dataset are described in the sections above with the exception of the scaled_* columns which are versions of the columns described above after being passed through the provided scaler.
scaler.joblib.gz
Scikit-learn StandardScaler object saved with joblib. Used to scale the data in training_data.pkl. The *.joblib.gz files can be unpacked by importing joblib (https://joblib.readthedocs.io/en/stable/) and running the following command:
scaler = joblib.load(path/to/scaler.joblib.gz).
scaler_means.npy
Numpy array storing the means applied by the scaler.
scaler_stds.npy
Numpy array storing the standard deviations applied by the scaler.
Candidate Compounds
candidate_data_0_feat.csv
Featurized representation of candidate compounds evaluated by the ML discovery pipeline.
Column Descriptions
See the columns descriptions in the above sections for the majority of the columns in this table. Additional columns are described below. The cousin modifier indicates that the property refers to the known structure (from known_data_compounds.json) which was most closely related to the candidate compound according to the maximum compound substitution probability (described below).
| Column Name | Description |
|---|---|
known |
whether the structure is already catalogued in the Materials Project (MP) or the Inorganic Crystal Structure Database (ICSD). |
comp_sub_prob |
composition substitution probability: product of ionic substitution probabilities (as defined in https://doi.org/10.1021/ic102031h) of the candidate compound relative to the cousin compound. |
cousin_formula |
chemical formula of the cousin compound. |
cousin_type_sym |
element symbols of species at each Wyckoff site in the cousin compound, ordered consistently with the type convention. |
cousin_type_oxi |
oxidation states of species at each Wyckoff site in the cousin compound, ordered consistently with the type convention. |
cousin_e_above_hull |
[eV/atom] energy above the convex hull of the cousin compound, indicating its thermodynamic stability. |
cousin_id |
database identifier of the cousin compound. |
subs |
list of elemental substitutions applied to transform the cousin compound into the candidate compound, e.g., [[Species Mn2-, Species O2-], [Species Bi2+, Species Eu2+]] indicates that where the candidate compound contains Mn2-, the cousin compound contains O2- and where the candidate compound contains Bi2+, the cousin compound contains Eu2+ |
candidate_data_1_imputed.pkl
Preprocessed candidate dataset with missing values imputed and values scaled. Serialized as a Python pickle object. Columns in this dataset are described in the sections above with the exception of the scaled_* columns which are versions of the columns described above after being passed through the provided scaler.
