Rule-based deconstruction and reconstruction of diterpene libraries: Categorizing foundational patterns & unravelling the structural landscape
Data files
Sep 05, 2024 version files 180.28 MB
-
README.md
-
Supp.1b.TeroKit_Diterpene_v2.0.tsv
-
Supp.1c.TeroKit_Triterpene_v2.0.tsv
-
Supp.1d.TeroKit_Sesquiterpene_v2.0.tsv
-
Supp.2a.DNP_Diterpene_Mining_v30.2_Updated_Skeleton_Backbone.tsv
-
Supp.2b.TeroKit_Diterpene_v2.0_Updated_Skeleton_Backbone.tsv
-
Supp.2c.TeroKit_Triterpene_v2.0_Updated_Skeleton_Backbone.tsv
-
Supp.2d.TeroKit_Sesquiterpene_v2.0_Updated_Skeleton_Backbone.tsv
-
Supp.2e.DNP_Diterpene_Skeleton_Summary.tsv
-
Supp.2f.Terokit_Diterpene_Skeleton_Summary.tsv
-
Supp.2g.Terokit_Diterpene_Skeleton_Abundance_Distribution.png
-
Supp.2h.Terokit_Diterpene_Deconstructed_Skeleton_Carbon_Distribution.png
-
Supp.2i.10X_Zoomed_Terokit_Diterpene_Deconstructed_Skeleton_Carbon_Distribution.png
-
Supp.3.PCA_comparison_matrix.tsv
-
Supp.4.Pickaxe_Carbocation_Source.pptx
-
Supp.5a.step1_10_gen_Class_II.tar.gz
-
Supp.5b.step2_10_gen_Class_I.tar.gz
-
Supp.5c.step3_filtered_backbones_only.tar.gz
-
Supp.5d.post_diTPS_cyclization_ruleset_exploration.tar.gz
-
Supp.6.Cytoscape_diTPS_networks.tar.gz
-
Supp.7.Backbone_IQV.tar.gz
-
Supp.8.SpeciesPerFamily_count_V_Compound_count.png
Abstract
Terpenoids make up the largest class of specialized metabolites with over 180,000 reports currently across all kingdoms of life. Their synthesis accentuates one of natures most choreographed enzymatic and non-reversible chemistries, leading to an extensive range of structural functionality and diversity. Current terpenoid repositories provide a seemingly endless playground of information regarding structure, sourcing, and synthesis. Efforts here investigate entries for the 20-carbon diterpenoid variants and deconstruct the complex patterns into simple, categorical groups. This deconstruction approach reduces over 60,000 unique compound entries to less than 1,000 categorical structures. Furthermore, over 75% of all diversity can be represented by just 25 structures. Diterpenoid diversity was mapped at an atomic scale, across the total compound landscape, and distributed throughout the tree of life. Additionally, these core structures provide guidelines for predicting how this diversity first originates via the mechanisms catalyzed by diterpene synthases. Over 95% of diterpenoid structures rely on cyclization. Here a reconstructive approach is reapplied based on known biochemical rules to model the birth of compound diversity. This computational synthesis validates previously identified reaction products and pathways, as well as enables predicting trajectories for synthesizing real and theoretical compounds. This deconstructive and reconstructive approach applied to the diterpene landscape provides modular, flexible, and an easy-to-use toolset for categorically simplifying otherwise complex or hidden patterns.
README: Rule-based deconstruction and reconstruction of diterpene libraries: Categorizing foundational patterns & unravelling the structural landscape
GENERAL INFORMATION
Title: Rule-Based Deconstruction and Reconstruction of the Diterpene Library: A Simulation
of Synthesis and Unravelling of Compound Structural Diversity
Authors: Davis Mathieu, Nicholas Schlecht, Marvin van Aalst, Kevin M. Shebek, Luke Busta,
Nicole Babineau, Oliver Ebenhöh, Björn Hamberger
Principal Investigator Contact Information
TERPENE BIOCHEMISTRY
Name: Björn Hamberger
Institution: Michigan State University
Address: East Lansing, MI USA
Email: hamberge@msu.edu
COMPUTATIONAL MODELLING
Name: Oliver Ebenhöh
Institution: Heinrich Heine Universität
AddressP Düsseldorf, NRW, Deutschland
Email: oliver.ebenoeh@hhu
Data Collection Date: 2022-2024
Information about funding sources that supported the collection of the data: NSF (DEF
1737898); DOE (DE-AC02-05CH11231); GLBCR DOE (DE-SC0018409); extra TBD
SHARING/ACCESS INFOMRATION
- licenses/Restrictions placed on the data: None
- Links to publication that cite or use this data:
TBD
- Links to other publicly accessible locations of the data: TeroKit database (http://terokit.qmclab.com/); Dictionary of Natural Products (https://dnp.chemnetbase.com); Pickaxe (https://github.com/tyo-nu/MINE-Database)
- Links/relationships to ancillary data sets: Author, Kevin M. Shebek, graciously provided support and collaboration leading to an authorship position in this work
- Was data derived from another source? YES
TeroKit Database for Sesquiterpenes, Diterpenes, and Triterpenes
Zeng, T., Liu, Z., Zhuang, J., Jiang, Y., He, W., Diao, H., Lv, N., Jian,
Y., Liang, D., Qiu, Y., Zhang, R., Zhang, F., Tang, X., & Wu, R. (2020).
TeroKit: A Database-Driven Web Server for Terpenome Research. Journal of Chemical Information and Modeling, 60(4), 2082–2090. https://doi.org/10.1021/acs.jcim.0c00141
Dictionary of Natural Products database
https://dnp.chemnetbase.com
Former Dictionary of Natural Products database mining
Johnson, S. R., Bhat, W. W., Bibik, J., Turmo, A., Hamberger, B.,
Evolutionary Mint Genomics Consortium, null, & Hamberger, B. (2019). A
database-driven approach identifies additional diterpene synthase
activities in the mint family (Lamiaceae). The Journal of Biological
Chemistry, 294(4), 1349–1362. https://doi.org/10.1074/jbc.RA118.006025
4. Recommended citation for this dataset:
TBD
DATA 7 FILE OVERVIEW
- Extracted Diterpene datasets from other sources and used as input (citations above). These were extracted from the Dictionary of Natural products, which were semi-automatically downloaded by searching for "diterpen*". The other database sources are all from TeroKit and were extracted with the grep "TARGET_WORD" command to extract "sesquiterpene", "diterpene", and "triterpene" sources.
Supp.1a: DNP diterpene source (25066 entries);
(https://dnp.chemnetbase.com/chemical/ChemicalSearch.xhtml?dswid=4831):
*Chemical_Name: Reference name for entered compound
*Synonyms: Common name for compound; (empty cell "NaN" indicates no common
name exists for that structure)
*Molecular_Formula: Molecular formula...
*Molecular_Weight: atomic mass of compound
*SMILES: string representation of compound names
*InChi: Compound representation of IUPac naming (InChiKey)
*Type_of_Compound: Overarching Diterpene Class
*Biological_Source: Genus species name (when applicable;"NaN" indicates no
species information affiliated with the compound on the DNP)
*Smaller_Clade: Source organism Family (extracted from Biological_Source
when applicable; NaN indicates no Biological Source)
*Bigger_Clade: Source organism Phylum (extracted from Biological_Source
when applicable; NaN indicates no Biological Source)
*Use/Importance: Value in humanitarian application (i.e. pharmaceuticals
and pesticides), if reported (otherwise "NaN")
*Biological Use/Importance: References in relation to working function in
nature, if reported (otherwise "NaN")
Supp.1b: TeroKit diterpene source (40833 entries):
(http://terokit.qmclab.com/index.html)
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1666 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Diterpenoids" -- how compounds were extracted
Supp.1c: TeroKit triterpene source (45318 entries):
(http://terokit.qmclab.com/index.html)
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1386 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Triterpenoids" -- how compounds were extracted
Supp.1d: TeroKit sesquiterpene source (42097 entries):
(http://terokit.qmclab.com/index.html)
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1642 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Sesquiterpenoids" -- how compounds were extracted
#DATA STRUCTURE
Supp.1a.DNP_Diterpene_Mining_v30.2.csv
Supp.1b.TeroKit_Diterpene_v2.0.tsv
Supp.1c.TeroKit_Triterpene_v2.0.tsv
Supp.1d.TeroKit_Sesquiterpene_v2.0.tsv
- Derivatized version(s) of Supp.1 datasets. These were deconstructed by the python script "Supp_Code.1.Terpenoid_Deconstruction.ipynb" to extract diterpene backbones from their decorated entry. Supp.2a-2d all are the same as Supp.1a-1d but include extracted columns for additional information about their deconstructed structural derivatives. Supp.2e and Supp.2f contain summaries about diterpene backbone abundance and which structures are more frequently seen than others. Supp.2g is a graphical representation of the number of compounds, their frequency and their representation of the dataset as a whole. Supp.2h is a graphical representation of the final Carbon number (expect 20) of all deconstructed skeletons, which was largely used for identifying outliers and testing deconstruction success. Supp.2i is a zoomed inversion to those with less than 3000 entries (makes it so 20C doesn't dwarf everything else).
Supp.2a: DNP diterpene source (25066 entries); utf-8:
*Chemical_Name: Reference name for entered compound
*Synonyms: Common name for compound; (empty cell "NaN" indicates no common
name exists for that structure)
*Molecular_Formula: Molecular formula...
*Molecular_Weight: atomic mass of compound
*SMILES: string representation of compound names
*InChi: Compound representation of IUPac naming (InChiKey)
*Type_of_Compound: Overarching Diterpene Class
*Biological_Source: Genus species name (when applicable;"NaN" indicates no
species information affiliated with the compound on the DNP)
*Smaller_Clade: Source organism Family (extracted from Biological_Source
when applicable; NaN indicates no Biological Source)
*Bigger_Clade: Source organism Phylum (extracted from Biological_Source
when applicable; NaN indicates no Biological Source)
*Use/Importance: Value in humanitarian application (i.e. pharmaceuticals
and pesticides), if reported (otherwise "NaN")
*Biological Use/Importance: References in relation to working function in
nature, if reported (otherwise "NaN")
*Backbone: core diterpene structure from entry
*Skeleton: core diterpene structure from entry with all stereochemistry,
bond variation, and R-groups removed
*Carbon Number: Skeleton number
Supp.2b: TeroKit diterpene source (40833 entries):
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1666 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Diterpenoids" -- how compounds were extracted
*Backbone: core diterpene structure from entry
*Skeleton: core diterpene structure from entry with all stereochemistry,
bond variation, and R-groups removed
*Carbon Number: Skeleton number
Supp.2c: TeroKit triterpene source (45318 entries):
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1386 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Triterpenoids" -- how compounds were extracted
*Backbone: core diterpene structure from entry
*Skeleton: core diterpene structure from entry with all stereochemistry,
bond variation, and R-groups removed
*Carbon Number: Skeleton number
Supp.2d: TeroKit sesquiterpene source (42097 entries):
*mol_id: TeroKit compound ID for identification from original source; In
some cases (1642 entries) this information is missing and marked
as "Unknown" from the bulk download entries. These ID's can still
be retrieved by searching the SMILE on TeroKit.
*formula: Molecular formula
*inchi: Compound representation of IUPac naming (InChi)
*smiles: string representation of compound names
*category: "Sesquiterpenoids" -- how compounds were extracted
*Backbone: core diterpene structure from entry
*Skeleton: core diterpene structure from entry with all stereochemistry,
bond variation, and R-groups removed
*Carbon Number: Skeleton number
Supp.2e: DNP Diterpenes (20C only) Skeleton and their abundance (671 entries):
*frequency: Number of times a specific skeleton is deconstructed back to in
the DNP database set.
*SMILES: String representation of the structure
*Carbon Number: Number of carbons present in diterpene skeleton (This
dataset is filtered to only include those that have 20-carbons
*index: The index of which compounds have that skeleton in dataset 2a
Supp.2f:TeroKit Diterpenes (20C only) Skeleton and their abundance (872 entries):
*frequency: Number of times a specific skeleton is deconstructed back to in
the TeroKit database set.
*SMILES: String representation of the structure
*Carbon Number: Number of carbons present in diterpene skeleton (This
dataset is filtered to only include those that have 20-carbons
*index: The index of which compounds have that skeleton in dataset 2b
Supp.2g: png showing the distribution of diterpene abundance among the
dataset (TeroKit).
Supp.2h: png showing the number of carbons on final, identified diterpenes
Supp.2i: same png as Supp.2h but zoomed in as to not dwarf columns with less than 1,000 entries
#DATA STRUCTURE
Supp.2a.DNP_Diterpene_Mining_v30.2_Updated_Skeleton_Backbone.tsv
Supp.2b.TeroKit_Diterpene_v2.0_Updated_Skeleton_Backbone.tsv
Supp.2c.TeroKit_Triterpene_v2.0_Updated_Skeleton_Backbone.tsv
Supp.2d.TeroKit_Sesquiterpene_v2.0_Updated_Skeleton_Backbone.tsv
Supp.2e.DNP_Diterpene_Skeleton_Summary.tsv
Supp.2f.Terokit_Diterpene_Skeleton_Summary.tsv
Supp.2g.Terokit_Diterpene_Skeleton_Abundance_Distribution.png
Supp.2h.Terokit_Diterpene_Deconstructed_Skeleton_Carbon_Distribution.png
Supp.2i.10X_Zoomed_Terokit_Diterpene_Deconstructed_Skeleton_Carbon_Distribution.png
- Calculated similarity scores of every unique skeleton compared to every unique skeleton. This was calculated using a BitVector calculation from RDKit.
Supp.3a: Similarity calculations for all 671 unique DNP skeletons compared to each
other. This makes a 671x671 comparison matrix, which was used to make a PCA but
also was used to make a heatmap.
#DATA STRUCTURE
Supp.3.PCA_comparison_matrix.tsv
- A Microsoft Powerpoint document to record and visualize all SMARTs rule implemented in our TPS Pickaxe runs (77 rules). Each slide is listed in a tsv format with: - name of rule - precursor compound denotations - SMARTS rule with input>>output - product compound denotations - shorthand name of rule and descriptor of intent - visual representation of the rule - when a rule is more outlandish a citation for where the rule was acquired was included All rules presented here are formatted to be entered as a custom ruleset in Pickaxe.py
#DATA STRUCTURE
Supp.4.NICKS_FOLDER_OF_CARBOCATION_EXPLANATION
- All new inputs for Pickaxe that were used here. four iterations of Pickaxe were run in total. The first had 10 generations of rule implementation, used GGDP as a starting compound, used a custom ClassII rule set, an updated coreactants list. The outputs from this work were later graphed and were included here as well. The second iteration used all compounds not containing Xenon generated from the first ruleset (A rule to create carbocations used Xenon instead of C+), a custom set of Class I rules, and the same coreactants list from Supp.5a. Outputs were included here as well for reactions, and compounds as well as any compound that had a matching structure to a diterpene skeleton. Supp.5c investigated compounds that had been identified as "targets" and traced back which compounds in total were involved in the synthesis of those products. Supp.5d used rules that shifted carbons in alternative ways to the actions of diterpene synthase activity and were used for identification of the rest of diterpene diversity.
Supp.5a: Unfiltered Pickaxe inputs and output using Class II modelled rules on GGDP
*input csv (GGPP_only_precursor.csv): csv containing the precursor GGDP
>>id: trackable name for compound
>>smiles: string representation of compound names
*input_custom_ruleset (ClassII_diTPS_rules.tsv): Class II modelled diTPS
SMARTS ruleset
>>Name: Callable name for rule
>>Reactants: compounds to be called (for coreactants)
>>SMARTS: Modeled diterpene synthase reaction
>>Products: compounds to be produced (expected coreactant products)
>>Comments: Description of the rule
>>counts: Necessary header for Pickaxe but not used here
>>Uniprot: Necessary header for Pickaxe but not used here (determi-
nes if SMART rule was extracted from UniPort
*coreactant dependency input (metacyc_coreactants.tsv): Necessary, custom
dependency with carbocation coreactants
>>id: callable coreactant name for SMARTs rulesets
>>Name: common name for coreactants
>>smiles: string representation of coreactants
#OUTPUT FOLDER: (CLASSII_NOFILTER_OUTPUT)
* Pickaxe reaction file output (step1_10_gen_Class_II_reactions.tsv):
all reactions performed with unfiltered class II ruleset
>>ID: Identifier for reaction in question (pkr#####)
>>Name: Specific name for reaction, if applicable (NA)
>>ID Equation: Stoichiometry of reaction from compound ids
>>SMILES equation Rxn hash: Reaction with Simplified
Molecular Input Line Entry Specification for
affiliated compounds to the reaction ID Equation
>>Reaction Rules: Reaction Name keyed from SMARTS ruleset
* Pickaxe reaction file output (step1_10_gen_Class_II_compounds.tsv):
all compounds generated with unfiltered class II ruleset
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/ Predicted
>>Generation: Earliest generation this compound occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International Chemical
Identifier
>>SMILES: string representation of compound names
Supp.5b: Unfiltered Pickaxe inputs and output using Class I modelled rules on GGDP
and Class II products
*input csv (10GenNoMacroCompoundsGGPP.csv): all resolved structures lacking
Xe produced from Supp.5a and GGDP
>>id: trackable name for compound
>>smiles: affiliated smiles for compound
*input_custom_ruleset (ClassI_diTPS_rules.tsv): Class I modelled diTPS
SMARTS ruleset
>>Name: Callable name for rule
>>Reactants: compounds to be called (for coreactants)
>>SMARTS: Modeled diterpene synthase reaction
>>Products: compounds to be produced (expected coreactant products)
>>Comments: Description of the rule
>>counts: Necessary header for Pickaxe but not used here
>>Uniprot: Necessary header for Pickaxe but not used here (determi-
nes if SMART rule was extracted from UniPort
*coreactant dependency input (metacyc_coreactants.tsv): Necessary, custom
dependency with carbocation coreactants
>>id: callable coreactant name for SMARTs rulesets
>>Name: common name for coreactants
>>smiles: affiliated smiles for coreactants
#OUTPUT FOLDER: (CLASSI_NOFILTER_OUTPUT)
* Pickaxe reaction file output (step2_10_gen_Class_I_reactions.tsv):
all reactions performed with unfiltered class I ruleset
>>ID: Identifier for reaction in question (pkr#####)
>>Name: Specific name for reaction, if applicable (NA)
>>ID Equation: Stoichiometry of reaction from compound ids
>>SMILES equation Rxn hash: Reaction with Simplified
Molecular Input Line Entry Specification for
affiliated compounds to the reaction ID Equation
>>Reaction Rules: Reaction Name keyed from SMARTS ruleset
* Pickaxe compound file output (step2_10_gen_Class_I_compounds.tsv):
all compounds generated with unfiltered class I ruleset
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/Predicted
>>Generation: Earliest generation this compound occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
* Skeletons to match (All_20_C_skeleton.tsv): Identified DNP
Skeletons for preliminary filtering steps
>>id: index for 20C skeletons identified in the DNP
>>SMILES: string representation of compound names
* Identified structures matching previously identified skeleton
(TARGETMATCH.tsv): Initial filtering step for identifying
which diterpenes were (un)successfully synthesized by the
model
>>Target: Which of the 20C skeletons from the DNP have
matching structures
>>Frequency: How many compounds have matching structures
>>Compounds: PKC compound ID name that matches Target
compounds
* Rules usage and abundance (Rules_Ratios.txt): Abundance table for
how often each rule was implemented in the model
Supp.5c: Filtered Pickaxe inputs and output using modelled Class II and Class I
rules on GGDP to final products.
*Filtered Class II Pickaxe Run (PHASE1_CLASSII_PRUNED/)
*input csv (GGPP_only_precursor.csv): csv containing the precursor
GGDP
>>id: trackable name for compound
>>smiles: string representation of compound names
*input_custom_ruleset (ClassII_diTPS_rules.tsv): Class II modelled
diTPS SMARTS ruleset
>>Name: Callable name for rule
>>Reactants: compounds to be called (for coreactants)
>>SMARTS: Modeled diterpene synthase reaction
>>Products: compounds to be produced (expected coreactant
products)
>>Comments: Description of the rule
>>counts: Necessary header for Pickaxe but not used here
>>Uniprot: Necessary header for Pickaxe but not used here
(determines if SMART rule was extracted from UniProt
*coreactant dependency input (metacyc_coreactants.tsv): Necessary,
custom dependency with carbocation coreactants
>>id: callable coreactant name for SMARTs rulesets
>>Name: common name for coreactants
>>smiles: string representation of coreactants
*Class II target intermediates
(ClassII_Target_Match_Intermediates.csv): Targets involved
in final synthesis of identified skeletons
>>id: compound id name (all capital for PKC#### here to
distinguish from this run and the second run)
>>smiles: string representation of compound names
#OUTPUT FOLDER: (CLASSII_TARGET_MATCH_OUTPUT/)
* Pickaxe reaction file output (reactions.tsv): all
reactions performed with filtered class II ruleset:
>>ID: Identifier for reaction in question (pkr#####)
>>Name: Specific name for reaction, if
applicable (NA)
>>ID Equation: Stoichiometry of reaction from
compound ids
>>SMILES equation Rxn hash: Reaction with
Simplified Molecular Input Line Entry
Specification for affiliated compounds to
the reaction ID Equation
>>Reaction Rules: Reaction Name keyed from SMARTS
ruleset
* Pickaxe reaction file output (reactions_Caps.tsv): all
reactions performed with filtered class II ruleset
but renamed to distinguish compounds and run from
the products of Class I reactions:
>>ID: Identifier for reaction in question (pkr#####)
>>Name: Specific name for reaction, if
applicable (NA)
>>ID Equation: Stoichiometry of reaction from
compound ids
>>SMILES equation Rxn hash: Reaction with
Simplified Molecular Input Line Entry
Specification for affiliated compounds to
the reaction ID Equation
>>Reaction Rules: Reaction Name keyed from SMARTS
ruleset
* Pickaxe compound file output (compounds.tsv): all
compounds generated with the filtered class II
ruleset leading to matched intermediates:
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
* Pickaxe compound file output (compounds_Caps.tsv): all
compounds generated with the filtered class II
ruleset leading to matched intermediates but
renamed to distinguish compounds and run from the
products of Class I reactions:
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
* Filtered input for Class I run
(Phase1End_Phase2StartCompounds.csv): all
intermediates involved in final skeleton formation
that were generated from Class II rules
>>id: trackable name for compound
>>smiles: string representation of compound names
*Filtered Class I Execution (PHASE2_CLASSI_PRUNED/)
*input csv (ClassII_Target_Match_Intermediates.csv): csv containing
all precursors involved in known diterpene synthesis,
generated from Class II products and GGDP
>>id: trackable name for compound
>>smiles: string representation of compound names
*input_custom_ruleset (ClassI_diTPS_rules.tsv): Class I modelled
diTPS SMARTS ruleset
>>Name: Callable name for rule
>>Reactants: compounds to be called (for coreactants)
>>SMARTS: Modeled diterpene synthase reaction
>>Products: compounds to be produced (expected coreactant
products)
>>Comments: Description of the rule
>>counts: Necessary header for Pickaxe but not used here
>>Uniprot: Necessary header for Pickaxe but not used here
(determines if SMART rule was extracted from UniProt
*coreactant dependency input (metacyc_coreactants.tsv): Necessary,
custom dependency with carbocation coreactants
>>id: callable coreactant name for SMARTs rulesets
>>Name: common name for coreactants
>>smiles: string representation of coreactants
*Class I target intermediates
(ClassII_Target_Match_Intermediates.csv): Targets involved
in final synthesis of identified skeletons
>>id: compound id name (all capital for PKC#### here to
distinguish from this run and the second run)
>>smiles: string representation of compound names
*Expected target compounds (Final_Target_Hits.tsv): all compounds
that matched a skeleton from the unfiltered run for
filtering final structures to only reactions involved in
final synthesis
>>id: compound id name
>>smiles: string representation of compound names
#OUTPUT FOLDER: (CLASSI_TARGET_MATCH_OUTPUT/)
* Pickaxe reaction file output (reactions.tsv): all
reactions performed with filtered class I ruleset:
>>ID: Identifier for reaction in question (pkr#####)
>>Name: Specific name for reaction, if
applicable (NA)
>>ID Equation: Stoichiometry of reaction from
compound ids
>>SMILES equation Rxn hash: Reaction with
Simplified Molecular Input Line Entry
Specification for affiliated compounds to
the reaction ID Equation
>>Reaction Rules: Reaction Name keyed from SMARTS
ruleset
* Pickaxe compound file output (compounds.tsv): all
compounds generated with the filtered class I
ruleset leading to matched intermediates:
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
* Pickaxe compound file output without coreactants
(compounds_NoStartingP2.tsv): all compounds
generated with the filtered class I ruleset
leading to matched intermediates with coreactants
removed for graphing in cytoscape
>>ID: Name of molecule
>>Type: Identified as Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
Supp.5d: Diterpene synthases perform the first cyclization reactions for all cyclic
diterpenes. However additional cyclization and skeleton modifying rules also
modulate and further diversify skeletons. Here we explore additional skeleton
manipulating rules to identify where and how this diversity originates.
*TeroKit and DNP Target Skeletons (20CSkeleton_Targets.tsv): skeletons
attempted to be matched after rules were implemented
>>id: compound id name
>>smiles: string representation of compound names
*Skeleton inputs (DiTPS_Synthesized_Skeletons_input.tsv): all diTPS
generated structures from carbocation specific rules (80 compounds)
>>id: compound id name
>>smiles: string representation of compound names
*Post cyclization SMARTS rules (Theoretical_Space_rules.tsv): Rules designed
to further modify diterpene skeletons to form known structures that
require additional modification like ring breakages and expansions
>>Name: Callable name for rule
>>Reactants: compounds to be called (for coreactants)
>>SMARTS: Modeled diterpene synthase reaction
>>Products: compounds to be produced (expected coreactant
products)
>>Comments: Description of the rule
>>counts: Necessary header for Pickaxe but not used here
>>Uniprot: Necessary header for Pickaxe but not used here
(determines if SMART rule was extracted from UniProt)
#OUTPUT DIRECTORY (DITPS_EXPANDED_RULES_OUTPUT/)
*compounds_break_ring_3gen.tsv: compounds created from 3
generations of rules that broke rings all possible manners
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
*compounds_combined.tsv: compounds, not filtered, that were generated
from all rules in the ruleset that were predicted to change
diterpene skeletons in alternative ways. Used to determine
where structural changes originated
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
*compounds_form_ring_1gen.tsv: compounds generated from rules with
1 generation of rings allowed to form in all possible
manners. Computationally demanding
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
*compounds_ring_shift.tsv: generated compounds from rules that
collapsed methyl groups into rings and shifted neighboring
rings of 6:6 to 5:7 for example.
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
*compounds_side_chain_shift.tsv: Compounds generated from moving
long linear carbon side chains such as methyl and ethyl
groups to neighboring carbons
>>ID: Name of molecule (either common name or PKC#####
assigned)
>>Type: Identified as Coreactant/Starting Compound/
Predicted
>>Generation: Earliest generation this compound
occurred
>>Formula: Molecular Formula for compound
>>InChiKey: The InChiKey encoded IUPAC International
Chemical Identifier
>>SMILES: string representation of compound names
#DATA STRUCTURE
Supp.5a.step1_10_gen_Class_II.tar.gz
GGPP_only_precursor.csv
classII_diTPS_rules.tsv
metacyc_coreactants.tsv
CLASSII_NOFILITER_OUTPUT/
step1_10_gen_Class_II_reactions.tsv
step1_10_gen_Class_II_compounds.tsv
Supp.5b.step2_10_gen_Class_I.tar.gz
10GenNoMacroCompoundsGGPP.csv
ClassI_diTPS_rules.tsv
metacyc_coreactants.tsv
CLASSI_NOFILITER_OUTPUT/
step2_10_gen_Class_I_reactions.tsv
step2_10_gen_Class_I_compounds.tsv
All_20_C_skeleton.tsv
TARGETMATCH.tsv
Rules_Ratios.tsv
Supp.5c.step3_Real_Backbones_Only.tar.gz
PHASE1_CLASSII_PRUNED/
ClassII_Target_Match_Intermediates.tsv
GGPP_only_precursor.csv
metacyc_coreactants.tsv
classII_diTPS_rules.tsv
CLASSII_TARGET_MATCH_OUTPUT/
compounds.tsv
compounds_Caps.tsv
reactions.tsv
reactions_Caps.tsv
Phase1End_Phase2StartCompounds.tsv
PHASE2_CLASSI_PRUNED/
ClassI_diTPS_rules.tsv
ClassII_Target_Match_Intermediates.csv
Final_Target_Hits.tsv
metacyc_coreactants.tsv
CLASSI_TARGET_MATCH_OUTPUT/
compounds.tsv
compounds_NoStartingP2.tsv
reactions.tsv
TARGETMATCH.tsv
Supp.5d.post_diTPS_cyclization_ruleset_exploration.tar.gz
20CSkeleton_Targets.tsv
DiTPS_Synthesized_Skeletons_input.tsv
Theoretical_Space_rules.tsv
DITPS_EXPANDED_RULES_OUTPUT/
compounds_break_ring_3gen.tsv
compounds_combined.tsv
compounds_form_ring_1gen.tsv
compounds_ring_shift.tsv
compounds_side_chain_shift.tsv
- A network was created from Supp.5c, which looked at all filtered compounds involved in the synthesis of target molecules (those with matching skeleton structures "synthesized_backbones.xlsx"). These were further derivatized to generate Figures 2b-2d (png) in the work however these networks in Cytoscape provide the raw information for further derivatization. This Cytoscape generated network includes the full network of filtered compounds generated by Pickaxe using diTPS modeled Class II/Class I mechanisms from GGDP Filtered_Network_Figure2b.cys, Figure 2b). The synthesis of individual compounds were also investigated, including Kaurene (Kaurene_Synthesis_Figure2c.cys; Figure 2c) and taxadiene (Taxadiene Synthesis_Figure2d.cys; Figure 2d). Compounds of interest were investigated in Cytoscape Exploration. All individual compounds identified took all final compounds and tracked nodes/edges linking back until GGDP from the full filtered network. All other nodes/edges were hidden.
Supp.6.Cytoscape_diTPS_network/
*Filtered_Network_Figure2b.cys: Full set of reactions and compounds
generated by Supplemental 5c.
*Kaurene_Synthesis_Figure2c: Isolated Reactions leading to Kaurene
*Taxadiene_Synthesis_Figure2d: Isolated Reactions leading to Taxadiene
*Taxane_Related_Mechanisms_Supp9.cys: Isolated Reactions leading to taxane
related structures that have an absence of predicted mechanism in
the literature. (Abeotaxane, Haziane, taxadiene, Atypical_taxane)
Cytoscape_Exploration/
*1_Taxadiene.cys: Isolated Reactions leading to Taxadiene
*2_Xenicane.cys: Isolated Reactions leading to Xenicane
*3_Fusicane.cys: Isolated Reactions leading to Fusicane
*4_Icetaxane.cys: Isolated Reactions leading to Icetaxane
*5_Sphenolobane.cys: Isolated Reactions leading to Sphenolobane
*6_Rearranged_Kaurene.cys: Isolated Reactions leading to Kaurene-
like structure
*7_Gnaphalane_like.cys: Isolated Reactions leading to
Gnaphalane
*8_Abeotaxane.cys: Isolated Reactions leading to Abeotaxane
*9_Atypical_taxane_derivative.cys: Isolated Reactions leading to
4 member ring structure of taxane relative
*10_Haziane.cys: Isolated Reactions leading to Haziane
*11_5.5.5.5_Tetraquinane.cys: Isolated Reactions leading to
5,5,5,5 Tetraquninane
*12_Salvinomoxicanolide.cys: Isolated Reactions leading to
salvinomonxicanalide structure
*13_serrulatane_like.cys: Isolated Reactions leading to
serrulatene related structure
*synthesized_backbones.xlsx: Anotated compounds within the network
that matched a DNP/TeroKit identified skeleton.
>>ID: coordinated skeleton ID from DNP Supp.2e
>>Frequency Synthesized: number of times the model made
this product and its present in the network
>>Frequency in Database: number of reported structures in
DNP with the same skeleton
>>SMILE: string representation of skeleton
>>Common Name: common name used in literature to represent
this structure
>>Known?: Annotations on where this has been reported
and whether it has mechanisms affiliated
>>Backbone Structure: Image of skeleton
#DATA STRUCTURE
Supp.6.Cytoscape_Networks.tar.gz
Filtered_Network_Figure2b.cys
Kaurene_Synthesis_Figure2c
Taxadiene_Synthesis_Figure2d
Taxane_Related_Mechanisms_Supp9.cys
Cytoscape_Exploration
1_Taxadiene.cys
2_Xenicane.cys
3_Fusicane.cys
4_Icetaxane.cys
5_Sphenolobane.cys
6_Rearranged_Kaurene.cys
7_Gnaphalane_like.cys
8_Abeotaxane.cys
9_Atypical_taxane_derivative.cys
10_Haziane.cys
11_5.5.5.5_Tetraquinane.cys
12_Savlinomoxicanolide.cys
13_serrulatane_like.cys
synthesized_backbones.xlsx
- An overlap of all backbones with the same skeleton structure was performed by converting the compounds to their SMILEs structures and replacing anything with an R-group with Xenon. This provided an overlay and way to identify where in a particular structure there was diversity, in a similar fashion to a multiple sequence alignment. This was done for the top 20 most common diterpene skeletons from TeroKit and used all of the backbones to identify molecular "hotspots".
Supp.7: Is a folder containing information about the top 20 skeletons in TeroKit
including which skeleton (ordered based on abundance) and containing information
about the reference SMILEs text that was used for aligning (carbon-only structure),
a table containing all the connecting points (edges) and their frequency of
variability (calculated with the index of qualitative variation), a node table to
indicate the variability in atomic decoration (calculated with the index of
qualitative variation), and a pseudo- "sequence alignment" of all of the SMILEs for
each backbone in that category, aligned based on the same atoms. Reference
images were created in ChemDraw and a visualization of the variability was created
in Cytoscape. Listed below is an example folder for what was contained and the
number of compounds & reference for each of the folders.
Skeleton_1: CCC(C)CCC1C(C)CCC2C(C)(C)CCCC12C (4060)
Skeleton_2: CCC(C)CCC1(C)C(C)CCC2(C)C(C)CCCC12 (3372)
Skeleton_3: CC1CC23CCC4C(C)(C)CCCC4(C)C2CCC1C3 (3301)
Skeleton_4: CC1CCCC(C)CCC(C(C)C)CCC(C)CCC1 (2050)
Skeleton_5: CC(C)C1CCC2C(CCC3C(C)(C)CCCC23C)C1 (1854)
Skeleton_6: CC1CCCC2(C)CCC3C(C)CCC(CC12)C3(C)C (1350)
Skeleton_7: CCC1(C)CCC2C(CCC3C(C)(C)CCCC23C)C1 (1228)
Skeleton_8: CC1CCCC2(C)CCCC(C)C2CC(C(C)C)CC1 (1173)
Skeleton_9: CCC(C)CCCC(C)CCCC(C)CCCC(C)C (858)
Skeleton_10: CCC1CCC2C(CCC3C(C)(C)CCCC23C)C1C (844)
Skeleton_11: CC1CCCC(C)(C)CCC(C)CC2CC(C)CC2C1 (772)
Skeleton_12: CC1CC2CC(C)CC3C(C(C)CC4C3C4(C)C)C2C1 (634)
Skeleton_13: CC1CCC2C(CC(C)CC3CC(C)CC3C1)C2(C)C (581)
Skeleton_14: CC1CC2CC(C)CC2C2C(C)CC(C(C)C)CC2C1 (554)
Skeleton_15: CC1CCCC(C)CCC2(C)CCC(C(C)C)C2CC1 (516)
Skeleton_16: CC1CCCC(C)CC2C(C(C)C)CCC(C)C2CC1 (500)
Skeleton_17: CC1CC2(C)CC1CCC2C1(C)CCCC(C)(C)C1C (390)
Skeleton_18: CC1CC2CC3(CC(C)CC3C1)C(C)CC1C2C1(C)C (357)
Skeleton_19: CC1CC23CCC4C(CCC4(C)C)C(C)C2CCC1C3 (350)
Skeleton_20: CC(C)CCCC(C)C1CCC(C)C2CCC(C)CC21 (297)
#EXAMPLE FOLDER WITHIN THE DATASET
Supp.7.Backbone_IQV/
Skeleton_1/
*1_consensus.txt: Skeleton SMILES (listed above as well)
*1_EDGE_IQV.tsv: Atom connecting points listed in order of
occurrence, compared to the reference sequence (so first C
is ref:0 and last C is ref:19). These report the IQV value
for each connecting edge (bond variation).\
*1_EDGE_IQV.tsv_1.png: Generated structure from Edge and Node
variability table in Cytoscape. These structures are what
were used to create the figure for top diterpene diversity.
*1_MSA.txt: Raw/intermediate file with all SMILESs aligned
*1_NODE_IQV.tsv: Variability of each letter character; where edge
focuses on connections, this focuses on individual
points/atoms.
*1_ref_image.png: What is the reference structure in question;
created in ChemDraw.
#DATA STRUCTURE
Supp.7.Backbone_IQV.tar.gz
Skeleton_1/
1_consensus.txt
1_EDGE_IQV.tsv
1_EDGE_IQV.tsv_1.png
1_MSA.txt
1_NODE_IQV.tsv
1_ref_image.jpg
Skeleton_2/
2_consensus.txt
2_EDGE_IQV.tsv
2_EDGE_IQV.tsv_2.png
2_MSA.txt
2_NODE_IQV.tsv
2_ref_image.jpg
Skeleton_3/
3_consensus.txt
3_EDGE_IQV.tsv
3_EDGE_IQV.tsv_3.png
3_MSA.txt
3_NODE_IQV.tsv
3_ref_image.jpg
Skeleton_4/
4_consensus.txt
4_EDGE_IQV.tsv
4_EDGE_IQV.tsv_4.png
4_MSA.txt
4_NODE_IQV.tsv
4_ref_image.jpg
Skeleton_5/
5_consensus.txt
5_EDGE_IQV.tsv
5_EDGE_IQV.tsv_5.png
5_MSA.txt
5_NODE_IQV.tsv
5_ref_image.jpg
Skeleton_6/
6_consensus.txt
6_EDGE_IQV.tsv
6_EDGE_IQV.tsv_6.png
6_MSA.txt
6_NODE_IQV.tsv
6_ref_image.jpg
Skeleton_7/
7_consensus.txt
7_EDGE_IQV.tsv
7_EDGE_IQV.tsv_7.png
7_MSA.txt
7_NODE_IQV.tsv
7_ref_image.jpg
Skeleton_8/
8_consensus.txt
8_EDGE_IQV.tsv
8_EDGE_IQV.tsv_8.png
8_MSA.txt
8_NODE_IQV.tsv
8_ref_image.jpg
Skeleton_9/
9_consensus.txt
9_EDGE_IQV.tsv
9_EDGE_IQV.tsv_9.png
9_MSA.txt
9_NODE_IQV.tsv
9_ref_image.jpg
Skeleton_10/
10_consensus.txt
10_EDGE_IQV.tsv
10_EDGE_IQV.tsv_10.png
10_MSA.txt
10_NODE_IQV.tsv
10_ref_image.jpg
Skeleton_11/
11_consensus.txt
11_EDGE_IQV.tsv
11_EDGE_IQV.tsv_11.png
11_MSA.txt
11_NODE_IQV.tsv
11_ref_image.jpg
Skeleton_12/
12_consensus.txt
12_EDGE_IQV.tsv
12_EDGE_IQV.tsv_12.png
12_MSA.txt
12_NODE_IQV.tsv
12_ref_image.jpg
Skeleton_13/
13_consensus.txt
13_EDGE_IQV.tsv
13_EDGE_IQV.tsv_13.png
13_MSA.txt
13_NODE_IQV.tsv
13_ref_image.jpg
Skeleton_14/
14_consensus.txt
14_EDGE_IQV.tsv
14_EDGE_IQV.tsv_14.png
14_MSA.txt
14_NODE_IQV.tsv
14_ref_image.jpg
Skeleton_1/
15_consensus.txt
15_EDGE_IQV.tsv
15_EDGE_IQV.tsv_15.png
15_MSA.txt
15_NODE_IQV.tsv
15_ref_image.jpg
Skeleton_1/
16_consensus.txt
16_EDGE_IQV.tsv
16_EDGE_IQV.tsv_16.png
16_MSA.txt
16_NODE_IQV.tsv
16_ref_image.jpg
Skeleton_1/
17_consensus.txt
17_EDGE_IQV.tsv
17_EDGE_IQV.tsv_17.png
17_MSA.txt
17_NODE_IQV.tsv
17_ref_image.jpg
Skeleton_18/
18_consensus.txt
18_EDGE_IQV.tsv
18_EDGE_IQV.tsv_18.png
18_MSA.txt
18_NODE_IQV.tsv
18_ref_image.jpg
Skeleton_19/
19_consensus.txt
19_EDGE_IQV.tsv
19_EDGE_IQV.tsv_19.png
19_MSA.txt
19_NODE_IQV.tsv
19_ref_image.jpg
Skeleton_20/
20_consensus.txt
20_EDGE_IQV.tsv
20_EDGE_IQV.tsv_20.png
20_MSA.txt
20_NODE_IQV.tsv
20_ref_image.jpg
- Visual of diterpene abundance throughout phylogeny from alternative perspectives then what were focused on in the paper. A figure and phylogeny within the paper are also produced in Supp_Code.6 if that is of interest. Supp.8 looks at the number of compounds reported compared to the number of species within each group to see if any particular species has more reports only because they encompass a larger family. This graphic is formatted as a ".png" .
#DATA STRUCTURE
Supp.8.SpeciesPerFamily_count_V_Compound_count.png
- The full annotated schematic predicted by our Pickaxe model and its synthesis of Taxadiene, which parallels previously reported synthesis mechanisms, along with Taxane related structures that have previously not had reported mechanisms including Abeotaxane, Harziane, and 3,11-Cyclotaxane. These mechanisms demonstrate the capacity for synthesis exclusively through diterpene synthase activity and carbocation cylization/rearrangement reactions.
*Predicted_Taxane_Relative_Mechanisms: A figure demonstrating the mechanisms
predicted by Pickaxe for synthesis of Taxadiene, Abeotaxane, Harziane, and
3,11-Cyclotaxane. This visual was extracted from Supplemental 6 data
network and visualized with ChemDraw.
#DATA STRUCTURE
Supp.9.Predicted_TaxaneRelative_Mechanisms.jpg
######################################################################
Supplemental Code
NOTE A DETAILED DESCRIPTION OF THE FUNCTION OF EACH PROGRAM CAN BE SEEN AT THE TOP
OF THAT PROGRAM UPON OPENING THE TEXT FILE (THESE INCLUDE A DETAILED LIST OF INPUTS,
PURPOSE OF INDIVIDUAL FUNCTIONS, AND OUTPUTS). FUNCTIONS OF EACH PROGRAM IS LISTED HERE ONLY
- Code for isolating Diterpene structures from entries within the DNP and TeroKit databases.
Supp_Code.1.Terpenoid_Deconstruction.ipynb
- Code for comparing skeleton structures from those identified within the DNP
Supp_Code.2.Skeleton_PCA.ipynb
- Code for modelling the synthesis of diterpene production
3a. Pickaxe settings used for our runs
3b. Code for identifying targets that overlay with previously identified skeletons
within the database.
3c. Tool for converting a reaction file to an edge/node table for Cytoscape
Supp_Code.3a.Pickaxe_DM_NS.py
Supp_Code.3b.Match.py
Supp_Code.3c.Network_maker.py
- Code for predicting the dynamicity and final location of carbocations as they are shifted throughout molecular synthesis.
Supp_Code.4.Carbocation_Quench_Predictor.ipynb
- Code used for aligning SMILEs backbones to one another and to identify atomic variability at each point
Supp_Code.5.Backbone_MSA.ipynb
- Script used to quantify the abundance of compounds reported for each phylogenetic family, with particular focus on plants and algae.
Supp_Code.6.Phylogenetic_Skeleton_Abundance_Heatmap.ipynb