Data from: Automation and machine learning drive rapid optimization of isoprenol production in Pseudomonas putida
Data files
Aug 20, 2025 version files 30.04 MB
Abstract
Advances in genome engineering have improved our ability to perturb microbial metabolic networks, yet bioproduction campaigns often struggle with parsing complex metabolic datasets to efficiently enhance product titers. We address this challenge by coupling laboratory automation with machine learning to systematically optimize the production of isoprenol, a sustainable aviation fuel (SAF) precursor, in Pseudomonas putida. The simultaneous downregulation through CRISPR interference of combinations of up to four gene targets, guided by machine learning (ML), permitted us to increase isoprenol titer 5-fold in six consecutive DBTL cycles. Moreover, ML enabled us to swiftly explore a vast experimental design space of 800,000 possible combinations by strategically recommending approximately 400 priority constructs. High-throughput proteomics allowed us to validate CRISPRi downregulation and identify biological mechanisms driving production increases. Our work demonstrates that ML-driven automated DBTL cycles can rapidly enhance titers without specific biological knowledge, suggesting that it can be applied to any host, product, or pathway.
Dataset DOI: 10.5061/dryad.gtht76hzh
Description of the data and file structure
This dataset contains proteomic data key to the characterization and analyses of our CRISPRi study. It includes liquid chromatographic and mass spectrometric analysis of the proteomic samples of strains engineered for isoprenol production. High-throughput proteomics allowed us to validate CRISPRi downregulation and identify biological mechanisms driving production increases. In DBTL0, we used proteomics to determine whether a sgRNA downregulated a target gene based on the following three constraints: 1) below 90% of the library mean for the target; 2) in the bottom quartile of target expression; and 3) isoprenol titer greater than 66.6 mg/L (approximately 40% of control isoprenol titer). In DBTL1-6, the proteomics abundance measurements were used to identify CRISPRi combinations in which all individual perturbations successfully interfered with their target and to confirm dCas9 production.
The dataset presents diverse proteomic profiles and is ideal for comparative studies and bioinformatics analysis. The raw mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD063733 (DBTL0), PXD063737 (DBTL1), PXD063738 (DBTL2), PXD063740 (DBTL3), PXD063743 (DBTL4), PXD063744 (DBTL5), and PXD063746 (DBTL6). DIA-NN is freely available for download from https://github.com/vdemichev/DiaNN.
Files and variables
CRISPRi_automation_Pputida_proteomic_Top3_peptide_quantification_method_data.csv
This file contains proteomic data from Pseudomonas putida strains engineered to produce isoprenol with The protein accession ID corresponds to proteins identified by the DIA-NN peptide search program, and the quantitative values correspond to the Top3 peptide absolute protein quantification method as detailed in Ahrne et al. 2013 (DOI:10.1002/pmic.201300135), consisting of the average signal response of the three most intense tryptic peptides for each protein. The file also contains the isoprenol titer measured in mg/L for each line.
Description of the columns:
DBTL_Cycle: (int) The Design-Build-Test-Learn cycle number
Line_name: (str) The individual line name with replicate ID
Line: (str) The line name with the replicate ID removed
Replicate: (str) The replicate ID for the line
Isoprenol_titer: (float) The amount of isoprenol produced by the line as measured by GC-FID.
Isoprenol_titer_units: (str) Units of the Isoprenol_titer measurements.
All other column headers: Uniprot accession IDs
All other table values: (float): Percentage of the proteome for the specific protein as calculated by the Top3 peptide absolute protein quantification method as detailed in Ahrne et al. 2013 (DOI:10.1002/pmic.201300135), consisting of the average signal response of the three most intense tryptic peptides for each protein. When the protein is not detected or is detected with fewer than three peptides the field is left blank (nan).
CRISPRi_automation_Pputida_proteomic_metadata.csv
This file contains the metadata for the proteomic experiments detailed in the CRISPRi_automation_Pputida_proteomic_metadata.csv file.
Description of the columns:
DBTL_Cycle: (int) The cycle number within a Design-Build-Test-Learn cycle framework.
Line_name: (str) A descriptive name for a specific experimental line or condition with the replicate ID
Line: (str) A numerical or coded identifier for an experimental line or condition.
Replicate_ID: (str) The replicate number for a given experimental condition. Indicates repeated measurements under identical conditions.
Culture_volume: (float) The volume of the culture used in the experiment.
Culture_format: (str) The format of the culture vessel.
Growth_temperature_Celsius: (float) The temperature at which the culture was grown, in degrees Celsius.
Shaking_speed_rpm: (int) The speed of shaking (revolutions per minute) used in the culture incubation.
Media: (str) The type of growth media used for the culture.
Carbon_source: (str) The primary carbon source used in the growth media.
Carbon_source_concentration: (float) The concentration of the carbon source in the media.
Carbon_source_concentration_units: (str) The units of concentration for the carbon source.
Inducer: (str) The type of inducer used to trigger gene expression (if applicable).
Inducer_concentration: (float) The concentration of the inducer used.
Inducer_concentration_units: (str) The units of concentration for the inducer.
Induction_time_point: (float) The time point at which induction was initiated.
Induction_time_point_units: (str) The units of time for the induction time point.
Assay_type: (str) The type of assay performed on the culture.
Assay_time_point: (float) The time point at which the assay was performed.
Assay_time_point_units: (str) The units of time for the assay time point.
Organism: (str) The organism used in the experiment.
Strain_ID: (str) A unique identifier for the specific strain of the organism used.
Genes_targeted_for_CRISPRi:(str)A list of genes targeted for CRISPR interference (CRISPRi) for each line in the experiment.
Access information
Other publicly accessible locations of the data:
- The raw mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD063733 (DBTL0), PXD063737 (DBTL1), PXD063738 (DBTL2), PXD063740 (DBTL3), PXD063743 (DBTL4), PXD063744 (DBTL5), and PXD063746 (DBTL6).
High-throughput proteomics data were generated to monitor the effects of CRISPRi-mediated gene knockdowns on protein expression levels across six DBTL cycles (DBTL0-DBTL6). The sample preparation protocol is detailed at Protocols.io dx.doi.org/10.17504/protocols.io.6qpvr6xjpvmk/v1. Protein was extracted from P. putida cell pellets using Qiagen P2 Lysis Buffer, precipitated with acetone, and digested with trypsin. Resulting tryptic peptides were analyzed using an Agilent 1290 UHPLC system coupled to a Thermo Scientific Orbitrap Exploris 480 mass spectrometer, employing data-independent acquisition (DIA) mode. The data processing protocol is detailed at Protocols.io dx.doi.org/10.17504/protocols.io.5qpvobk7xl4o/v2. DIA raw data were processed using DIA-NN software (library-free mode) against a database containing the P. putida KT2440 Uniprot proteome, heterologous proteins, and common contaminants. Protein quantification was performed using the Top3 method (Ahrne et al. 2013 DOI:10.1002/pmic.201300135), averaging the signal response of the three most intense tryptic peptides for each protein. Data were filtered to a global false discovery rate (FDR) ≤ 0.01 at both precursor and protein group levels. The generated mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD063733 (DBTL0), PXD063737 (DBTL1), PXD063738 (DBTL2), PXD063740 (DBTL3), PXD063743 (DBTL4), PXD063744 (DBTL5), and PXD063746 (DBTL6). DIA-NN is freely available for download from https://github.com/vdemichev/DiaNN.
