Data from: Automation and machine learning drive rapid optimization of isoprenol production in Pseudomonas putida

Carruthers, David1; Kinnunen, Patrick1; Li, Yuerong1; Chen, Yan1; Gin, Jennifer1; Yunus, Ian1; Gaillard, William1; Tan, Stephen1; Adams, Paul1; Singh, Anup2; Sustarich, Jess3; Petzold, Christopher 1 ; Mukhopadhyay, Aindrila 1 ; Garcia Martin, Hector1; Lee, Taek Soon 1

Published Aug 20, 2025 on Dryad. https://doi.org/10.5061/dryad.gtht76hzh

Data files

Aug 20, 2025 version files 30.04 MB

CRISPRi_automation_Pputida_proteomic_metadata.csv

333.29 KB
CRISPRi_automation_Pputida_proteomic_Top3_peptide_quantification_method_data.csv

29.70 MB
README.md

5.77 KB

Abstract

Advances in genome engineering have improved our ability to perturb microbial metabolic networks, yet bioproduction campaigns often struggle with parsing complex metabolic datasets to efficiently enhance product titers. We address this challenge by coupling laboratory automation with machine learning to systematically optimize the production of isoprenol, a sustainable aviation fuel (SAF) precursor, in Pseudomonas putida. The simultaneous downregulation through CRISPR interference of combinations of up to four gene targets, guided by machine learning (ML), permitted us to increase isoprenol titer 5-fold in six consecutive DBTL cycles. Moreover, ML enabled us to swiftly explore a vast experimental design space of 800,000 possible combinations by strategically recommending approximately 400 priority constructs. High-throughput proteomics allowed us to validate CRISPRi downregulation and identify biological mechanisms driving production increases. Our work demonstrates that ML-driven automated DBTL cycles can rapidly enhance titers without specific biological knowledge, suggesting that it can be applied to any host, product, or pathway.

Dataset DOI: 10.5061/dryad.gtht76hzh

Description of the data and file structure

This dataset contains proteomic data key to the characterization and analyses of our CRISPRi study. It includes liquid chromatographic and mass spectrometric analysis of the proteomic samples of strains engineered for isoprenol production. High-throughput proteomics allowed us to validate CRISPRi downregulation and identify biological mechanisms driving production increases. In DBTL0, we used proteomics to determine whether a sgRNA downregulated a target gene based on the following three constraints: 1) below 90% of the library mean for the target; 2) in the bottom quartile of target expression; and 3) isoprenol titer greater than 66.6 mg/L (approximately 40% of control isoprenol titer). In DBTL1-6, the proteomics abundance measurements were used to identify CRISPRi combinations in which all individual perturbations successfully interfered with their target and to confirm dCas9 production.

The dataset presents diverse proteomic profiles and is ideal for comparative studies and bioinformatics analysis. The raw mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD063733 (DBTL0), PXD063737 (DBTL1), PXD063738 (DBTL2), PXD063740 (DBTL3), PXD063743 (DBTL4), PXD063744 (DBTL5), and PXD063746 (DBTL6). DIA-NN is freely available for download from https://github.com/vdemichev/DiaNN.

Files and variables

CRISPRi_automation_Pputida_proteomic_Top3_peptide_quantification_method_data.csv

This file contains proteomic data from Pseudomonas putida strains engineered to produce isoprenol with The protein accession ID corresponds to proteins identified by the DIA-NN peptide search program, and the quantitative values correspond to the Top3 peptide absolute protein quantification method as detailed in Ahrne et al. 2013 (DOI:10.1002/pmic.201300135), consisting of the average signal response of the three most intense tryptic peptides for each protein. The file also contains the isoprenol titer measured in mg/L for each line.

Description of the columns:

DBTL_Cycle: (int) The Design-Build-Test-Learn cycle number

Line_name: (str) The individual line name with replicate ID

Line: (str) The line name with the replicate ID removed

Replicate: (str) The replicate ID for the line

Isoprenol_titer: (float) The amount of isoprenol produced by the line as measured by GC-FID.

Isoprenol_titer_units: (str) Units of the Isoprenol_titer measurements.

All other column headers: Uniprot accession IDs

All other table values: (float): Percentage of the proteome for the specific protein as calculated by the Top3 peptide absolute protein quantification method as detailed in Ahrne et al. 2013 (DOI:10.1002/pmic.201300135), consisting of the average signal response of the three most intense tryptic peptides for each protein. When the protein is not detected or is detected with fewer than three peptides the field is left blank (nan).

CRISPRi_automation_Pputida_proteomic_metadata.csv

This file contains the metadata for the proteomic experiments detailed in the CRISPRi_automation_Pputida_proteomic_metadata.csv file.

Description of the columns:

DBTL_Cycle: (int) The cycle number within a Design-Build-Test-Learn cycle framework.

Line_name: (str) A descriptive name for a specific experimental line or condition with the replicate ID

Line: (str) A numerical or coded identifier for an experimental line or condition.

Replicate_ID: (str) The replicate number for a given experimental condition. Indicates repeated measurements under identical conditions.

Culture_volume: (float) The volume of the culture used in the experiment.

Culture_format: (str) The format of the culture vessel.

Growth_temperature_Celsius: (float) The temperature at which the culture was grown, in degrees Celsius.

Shaking_speed_rpm: (int) The speed of shaking (revolutions per minute) used in the culture incubation.

Media: (str) The type of growth media used for the culture.

Carbon_source: (str) The primary carbon source used in the growth media.

Carbon_source_concentration: (float) The concentration of the carbon source in the media.

Carbon_source_concentration_units: (str) The units of concentration for the carbon source.

Inducer: (str) The type of inducer used to trigger gene expression (if applicable).

Inducer_concentration: (float) The concentration of the inducer used.

Inducer_concentration_units: (str) The units of concentration for the inducer.

Induction_time_point: (float) The time point at which induction was initiated.

Induction_time_point_units: (str) The units of time for the induction time point.

Assay_type: (str) The type of assay performed on the culture.

Assay_time_point: (float) The time point at which the assay was performed.

Assay_time_point_units: (str) The units of time for the assay time point.

Organism: (str) The organism used in the experiment.

Strain_ID: (str) A unique identifier for the specific strain of the organism used.

Genes_targeted_for_CRISPRi:(str)A list of genes targeted for CRISPR interference (CRISPRi) for each line in the experiment.

Access information

Other publicly accessible locations of the data:

The raw mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD063733 (DBTL0), PXD063737 (DBTL1), PXD063738 (DBTL2), PXD063740 (DBTL3), PXD063743 (DBTL4), PXD063744 (DBTL5), and PXD063746 (DBTL6).