Data from: Predictive design of crystallographic chiral separation
Data files
Aug 27, 2025 version files 39.04 GB
-
3dmolcsp_trained_models.tar.gz
15.45 GB
-
ChiralSeparationCleanedWithNoisyDataDisclosable.csv
948.16 KB
-
md-and-reps.tar.gz
11.42 GB
-
molecules.db
106.50 KB
-
ProspectiveExperimentalValidationCleaned.csv
128.70 KB
-
ProspectiveExperimentScreenWithPreds.csv
4.70 MB
-
README.md
14.43 KB
-
trained_models_v1.tar.gz
2.03 GB
-
trained_models_v2.tar.gz
10.13 GB
Abstract
The efficient separation of chiral molecules is a fundamental challenge in the manufacture of pharmaceuticals and light-polarising materials. We developed an approach that combines machine learning with a physics-based representation to predict resolving agents for chiral molecules, using a transformer-based neural network. On historical data, our approach is 4-6 times more accurate than current practice. We further validate the model in a prospective experiment, where we use the model to design a resolution screen for six unseen racemates. We successfully resolved three of the six mixtures in a single round of experiments and obtained an overall 8-to-1 true positive to false negative ratio. Together with this study, we release a previously proprietary dataset of over 6,000 resolution experiments, the largest diastereomeric salt crystallisation dataset to date. More broadly, our approach and open crystallization data lay the foundation for accelerating and reducing the costs of chiral resolutions.
https://doi.org/10.5061/dryad.d2547d89c
Data supporting the manuscript Predictive design of crystallographic chiral separation. We provide the training data, the prospective screen designed with the model, and the raw experimental results of the prospective screen.
In addition, we provide the models used for retrospective and prospective experiments, as well as the auto encoders for compressing the representations.
Description of the data and file structure
CSV Files
ChiralSeparationCleanedWithNoisyDataDisclosable.csv
This file contains the historical data used to train the machine learning models. Each row represents a single diastereomeric resolution experiment. Method to generate exact cross validation splits used in the manuscript can be found in the GitHub repository below.
notebook_id: (String) The internal experiment identifier.solvent_1,solvent_2: (String) The chemical names of the solvents used.solvent_1_frac,solvent_2_frac: (Float) The volume fraction of the corresponding solvent in the mixture, ranging from 0.0 to 1.0.reactant: (String) The SMILES string representation of the racemate molecule being resolved.reactant_id: (String) A unique identifier for the reactant molecule.separator: (String) The SMILES string representation of the enantiopure resolving agent.separator_id: (String) A unique identifier for the resolving agent.mass_balance_solid: (Float) The mass fraction of the initial racemate that precipitated out of the solution as a solid salt.ee_in_solid: (Float) The enantiomeric excess (e.e.) of the precipitated solid, expressed as a fraction. A value of 1.0 represents an enantiopure solid.success: (Integer, Binary) A flag indicating if the resolution was considered successful for model training purposes.1= successful,0= unsuccessful.possible_outcome: (Integer, Binary) A data validation flag indicating if the recordedee_in_solidandmass_balance_solidvalues are physically possible.1= possible. Experiments with impossible outcomes were excluded from model training.is_valid,failure,achiral: (Boolean) Legacy columns from the data cleaning process. For all data points in this file,is_validisTrue,failureisFalse, andachiralisFalse.
ProspectiveExperimentScreenWithPreds.csv
This file contains the full set of model predictions for the virtual screen performed in this work. Each row represents a predicted outcome for a specific combination of a reactant, a resolving agent, and solvent conditions.
reactant_PCAT_id: (String) Unique identifier for the reactant (racemate) molecule.reactant: (String) The SMILES string representation of the reactant molecule.separator_PCAT_id: (String) Unique identifier for the resolving agent (separator) molecule.separator: (String) The SMILES string representation of the resolving agent molecule.solvent_1,solvent_2: (String) The chemical names of the solvents used. A blank value indicates that only one solvent was used.solvent_1_frac,solvent_2_frac: (Float) The volume fraction of the corresponding solvent in the mixture, ranging from 0.0 to 1.0.reactant_id,separator_id,solvent_1_id,solvent_2_id: (Integer/String) Legacy numerical identifiers used during data processing. These are included for completeness, but have no meaning now.y_pred_0toy_pred_4: (Float) The raw prediction score (probability of success) from each of the 5 individual models in the ensemble. Each model was trained on a different cross-validation fold of the training data.y_pred_mean: (Float) The mean of the five individual model predictions (y_pred_0throughy_pred_4). This value represents the final ensemble prediction and was used to rank the candidates.y_pred_std: (Float) The standard deviation of the five individual model predictions.
ProspectiveExperimentalValidationCleaned.csv
This file contains the experimental results from the prospective screen, where conditions predicted to be either successful or unsuccessful were tested in the lab.
Racemate_ID,Resolving_agent_ID: (String) Identifiers for the racemate and resolving agent used in these experiments. Note, these do not necessarily match other identifiers.Prediction: (String, Categorical) Indicates whether the model predicted this experimental condition to be successful (Predicted Good) or unsuccessful.DesignNo: (String) An identifier for the specific racemate system being tested (e.g., 'Design 1').Solvent: (String) The name of the pure solvent used for the experiment.Liquor Yield: (Float) The percentage (%) of the initial racemate mass that remained dissolved in the mother liquor.Liquor ee: (Float) The enantiomeric excess (%) of the racemate measured in the mother liquor.Isolated?: (String) A flag indicating if the precipitated solid was physically isolated and analyzed directly. IfIsolated, it means a direct measurement of the solid phase was performed.Inferred Solids Yield: (Float) The yield (%) of the solid, as inferred from the mother liquor measurements (Liquor Yield,Liquor ee).Inferred Solids ee: (Float) The enantiomeric excess (%) of the solid, as inferred from the mother liquor measurements.mfrac: (Float) This is the primary result column for yield and should be used for analysis. This equalsInferred Solids Yield, unlessIsolated?isIsolated, in which case a direct measurement of the solid overrides the value.ee: (Float) This is the primary result column for enantiomeric excess. This equalsInferred Solids ee, unlessIsolated?isIsolated, in which case a direct measurement of the solid overrides the value.reactant,separator: (String) The achiral SMILES strings for the reactant and separator, used for internal validation and matching between different identifiers.
Model and Data Archives (.tar.gz)
These compressed archives contain the trained models and molecular dynamics (MD) data. To decompress, use the command tar -xzvf [filename.tar.gz].
trained_models_v1.tar.gz
- Purpose: Contains the initial set of trained models used for the prospective (virtual screen) experiments and the initial retrospective analyses described in the manuscript. Note: These models are considered obsolete and have been superseded by the
v2models for retrospective analysis. They are provided here for historical completeness. - Contents: Upon decompression, the archive contains several subdirectories, each corresponding to a different modeling approach, and a file for the representation compressor.
DirectClassification/,DirectRegression/,DataAblationRegression/,Finetuned/: These directories contain models for different experiments (e.g., classification vs. regression). Each directory contains several.torchfiles (e.g.,train_set_0.torch,train_set_1.torch, etc.), where each file is a model trained on a specific cross-validation fold.pretrained_compressor.torch: The trained autoencoder model used to compress the atomic representations.
Each.torchfile within the subdirectories is a dictionary containing:run_params: The hyperparameters required to initialize the model architecture.state_dict: The trained model weights (the PyTorch state dictionary).
- Instructions: The models can be loaded and used with PyTorch. The general procedure is to load the file, use the
run_paramsto instantiate the correct model class, and then load the state dictionary into the model.
Example loading script in Python:
import torch
from DiastereomericResolutions.main.network import (
SeparationNetworkClassifier,
SeparationNetworkRegressor,
)
DEVICE = 'cpu' # or 'cuda'
model_path = 'DirectRegression/train_set_0.torch'
model_type = 'regression' # or 'classification'
# Load the saved dictionary
save_dict = torch.load(model_path, map_location=torch.device(DEVICE))
# Initialize the model architecture with saved hyperparameters
if model_type == "regression":
model = SeparationNetworkRegressor(**save_dict["run_params"]).to(DEVICE)
else:
model = SeparationNetworkClassifier(**save_dict["run_params"]).to(DEVICE)
state_dict = save_dict["state_dict"]
model.load_state_dict(state_dict)
model.eval()
trained_models_v2.tar.gz
- Purpose: Contains the set of models that were re-trained during the peer review process. These models were used for the final retrospective experiments presented in the manuscript.
- Contents: Upon decompression, the archive is organized by modeling strategy.
Direct/: Contains models trained from scratch directly on the full diastereomeric resolution dataset. This directory is further subdivided intoclassification/andregression/models.FineTuning/: Contains models that were first pre-trained on the full dataset, and then finetuned on the less noisy (solid mass fraction above 20%) subset. This directory contains subdirectories for various fine-tuning strategies (e.g.,classification_from_regression).trained_ae.torch: The trained autoencoder used for allv2experiments. A key difference from thev1compressor is that this autoencoder was trained only on the diastereomeric resolution dataset itself, not a larger external dataset.trained_ae_chk.torch: A training checkpoint file for the autoencoder, included for completeness. For inference,trained_ae.torchshould be used.logs/: Training log files.
As with thev1archive, the model directories contain.torchfiles for each cross-validation fold, which include both the model hyperparameters (run_params) and the trained weights (state_dict).
- Instructions: The loading procedure for these models is identical to that of the
v1models.
3dmolcsp_trained_models.tar.gz
- Purpose: This archive contains models using the 3DMolCSP approach [1], which serve as a performance baseline against our proposed models. They are provided to allow for a direct comparison and to reproduce the benchmarking results presented in the manuscript.
- Contents: The archive contains several subdirectories, each corresponding to a different training strategy used for the baseline model (e.g.,
direct_classification,finetune_regression_from_regression). Each of these directories contains.ptfiles, multiple seeds for each training cross-validation subset. These.ptfiles are PyTorch state dictionaries containing the trained model weights. - Instructions: Important: These models use a different architecture from the
v1andv2models and cannot be loaded using the code from the primary GitHub repository for this paper. To use these models, you must use the dedicated code available at the following repository:
https://github.com/RokasEl/3dmolcsp-diastereomeric-resolutions
That repository provides the necessary model class definitions and scripts to load these state dictionaries and reproduce the baseline results.
Raw Model Inputs and Molecule Database
This section describes the raw molecular dynamics (MD) data that serves as the direct input to our models' representation learning stage, and the database used to organize and identify these molecules.
md-and-reps.tar.gz
- Purpose: This archive contains the raw molecular dynamics (MD) trajectories and the processed atomic representations for each reactant-resolving agent pair. These representations are the direct inputs that are compressed by the autoencoder (
trained_ae.torch) before being used by the main prediction models. - Organization: The archive contains numerous subdirectories, each representing a unique simulation pair. The directory naming follows a specific pattern:
[reactant_id]-[separator_id](-rep): A simulation of the reactant and separator. The_idnumbers are integer primary keys from themolecules.dbdatabase.[reactant_id]-[separator_id]-mirrored(-rep): A simulation run with the enantiomer of the reactant molecule.
- Contents: Each subdirectory contains either:
frames.h5: An HDF5 file containing the raw, unprocessed MD trajectory data (atomic positions, velocities, etc.) for the pair.rep.npz: (Contained in the-repsubfolders) A NumPy compressed archive containing the processed atomic representations derived from the MD trajectory. This file is the direct input for the data loading pipeline.
- Usage Note: The full data loading and processing pipeline that converts these raw files into model-ready tensors is intricate. For detailed implementation and usage, interested users should consult the data loader classes and scripts within the associated GitHub repository.
molecules.db
- Purpose: This is a simple SQLite database that provides a centralized mapping between molecule SMILES strings and the unique integer IDs used in the directory names of
md-and-reps.tar.gz. It ensures that a molecule and its enantiomer are linked to a single, consistent ID. - Schema: The database contains a single table named
moleculeswith the following structure:id: (INTEGER PRIMARY KEY) The unique numerical identifier for the molecule.smiles: (TEXT UNIQUE) The RDKit-canonicalized SMILES string of the molecule.alias: (TEXT UNIQUE) The SMILES string of the corresponding enantiomer.
More information about the data can be found in the manuscript, and in this paper describing another recent data release.
Code/Software
Code related to the manuscript can be found at https://github.com/RokasEl/DiastereomericResolutions
[1] Hong, Y., Welch, C. J., Piras, P. & Tang, H. Enhanced Structure-Based Prediction of Chiral Stationary Phases for Chromatographic Enantioseparation from 3D Molecular Conformations. Anal. Chem. 96, 2351–2359 (2024).
High-Throughput Experimentation reactions were set up inside of an INERT Inc. triple double sized glove box with O2 and H2O levels <20 ppm. Glass vials (0.7 mL, 8 x 30 mm) pre-equipped with stir bars were used for each reaction. The reactions were set up with the components and conditions described by each dataset entry at 0.04~mmol scale and 0.2 M final concentration with a 1:1 ratio of acid to base and 200 μL total volume. The reaction vials were sealed by crimp under the glove-box environment and placed in a metal Chemglass Optichem 96-well heating plate atop a general IKA stirrer heater plate with an external temperature probe to accurately and evenly control the plate. The vials were heated for 1 hour at 80 C before being cooled over 3 hours and left to stir for a further 15hrs at 25 C. Solubility observations were made after the hour at 80 C and after the 15hr stir period at 25 C. At the reaction end point, the vials were centrifuged to settle any precipitates and the liquors sampled and analysed by chiral SFC-MS for the determination of the liquor e.e. and the calculation of solids m.frac. and e.e.
For any enriched liquor hits, the liquor was manually pipetted away from the centrifuged solid and the solid re-slurried in fresh solvent (using the same solvent choice as the reaction, half volume, 100 μL) to rinse off residual liquor. The vial was centrifuged a second time and the liquor (100 μL) again removed from the centrifuged solid and combined with the original. Both the combined liquors and the residual solid were brought to an equal volume of 500 μL using MeOH to ensure solution before being analysed by chiral SFC-MS. Measurement of the crystallised salt product % enantiomeric enrichment (ee) from isolated solids superseded the product % ee calculated based on SFC-MS analysis of uncrystallised material in the liquors. iChem explorer (Reaction Analytics, US) and Virscidian Analytical Studio™ software were used for data analysis.
HPLC grade methanol, isopropanol, ammonium formate, ammonia (Fisher Scientific, Pittsburgh, PA, USA) and bulk grade carbon dioxide (AirGas West (Escondido, California, USA) were used in this study. The CO2 was purified and pressurised to 1500 psig using a custom booster and purifier system from FLW, Inc. (Huntington Beach, CA, USA).
Analysis was performed using an Agilent 1260 SFC/MS system consisting of a binary pump, SFC control module, UV/DAD detector, and column compartment with an internal 6-position, 12-port valve, and 6120 MSD with an APCI source (Agilent, Inc., Santa Clara, CA, USA). A Gerstel MPS autosampler (Gerstel USA, MD, USA) was equipped with a 25 μL syringe and a 50 nL internal loop with a control method to vent CO2 from the loop prior to sample introduction. The effluent of the SFC is split to the MSD using a 3-way tee (Valco, Houston, TX, USA) and a 50 cm long, 50 μm i.d. PEEKsil capillary tubing (Trajan Scientific, NC, USA) located between the column outlet and BPR. All data was acquired using Agilent 64-bit ChemStation (Version C.01.10).
