Chemical informatics combined with Kendrick mass analysis to enhance annotation and identify pathways in soybean metabolomics
Data files
Jan 28, 2025 version files 495.64 MB
-
README.md
3.76 KB
-
Soybean_Metabolomics_RAW_Data.zip
495.64 MB
Abstract
Among abiotic stresses to agricultural crops, drought stress is the most prolific and has worldwide detrimental impacts. The soybean (Glycine max) is one of the most important sources of nutrition to both livestock and humans. Different plant introductions (PI) of soybeans have been identified to have different drought tolerance levels. Here, two soybean lines, Pana (drought sensitive) and PI 567731 (drought tolerant) were selected to identify chemical compounds and pathways which could be targets for metabolomic analysis induced by abiotic stress. Extracts from the two lines are analyzed by direct infusion electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. The high mass resolution and accuracy of the method allows for identification of ions from hundreds of different compounds in each cultivar. The exact m/z of these species were filtered through SoyCyc and the Human Metabolome Database to identify possible molecular formulas of the ions. Next, the exact m/z values are converted into Kendrick masses and their Kendrick mass defects (KMD) computed, which are then sorted from high to low KMD. This latter process assists in identifying many additional molecular formulas, and is noted to be particularly useful in identifying formulas whose mass difference corresponds to two hydrogen atoms. In this study, more than 460 ionic formulas are identified in Pana, and more than 340 ionic formulas are identified in PI 567731, with many of these formulas reported from soybean for the first time. Using the SoyCyc matches, the metabolic pathways from each cultivar are compared, providing for lists of molecular targets available to profile effects of abiotic stress on these soybean cultivars. Key metabolites include chlorophylls, pheophytins, mono- and diacylglycerols, cycloeucalenone, squalene, and plastoquinones and involve pathways which include the anabolism and catabolism of chlorophyll, glycolipid desaturartion, and biosynthesis of phytosterols, plant sterols, and carotenoids.
README: Chemical informatics combined with Kendrick mass analysis to enhance annotation and identify pathways in soybean metabolomics
https://doi.org/10.5061/dryad.np5hqc046
Description of the data and file structure
Leaves from the two cultivars (Pana and PI 567731) were collected from 20 plants/cultivar and flash frozen immediately following tissue collection and transported to the University of Missouri. Afterward, they were placed on dry ice and shipped to the University at Buffalo. There, the specimens were stored in polycarbonate petri-dishes at -20°C until extractions were performed. Flash frozen leaves from multiple plants of each cultivar were pooled. Next, each group was individually macerated manually for five minutes in methanol using mortar and pestle. To remove particulates, vacuum filtration was performed. The samples were subsequently dried in a vacuum oven, and then the dried residue was reconstituted into 2 mL of HPLC grade methanol. These samples were diluted by 50x prior to ESI FT-ICR analysis.
Files and variables
File: Soy_Research_Supplemental.zip
Description: Excel files of Supplementary Table S1: Ionic Formulas and KMD Analysis for Species in Pana Leaf Extracts.xlsx and Supplementary Table S2: Ionic Formulas and KMD Analysis for Species in PI567731 Leaf Extracts.xlsx; Word files of Supplementary Table S3: List of Compounds from Pana Leaf Extracts for Pathways Analysis and Supplementary Table S4: List of Compounds from PI567731 Leaf Extracts for Pathways Analysis.
File: Soybean_Metabolomics_RAW_Data.zip
Description: Folders containing the Raw FT-ICR mass spectrometry data. Extracts from Pana leaves begin with "Pana" on folder name and extracts from PI567731 leaves begin with "PI731". The designation in the folder name "OC" refers to "old (aged) control" followed by the sample number. Three replicates were collected for each sample with trailing folder name 000001.d, 000002.d, and 000003.d. Within each folder are the raw time domain data sets (fid) as well as the processed data file (analysis.baf) and subfolders entitled "Bruker11052015.m" which contain the Bruker method and pulse programs used to acquire each data set.
Code/software
The data was processed using Data Analysis 4.0 software. This can be read in the open source OpenMS software package. In order to be read in OpenMS, Bruker data files (.d, fid, or .baf) must be converted to the open source .mzML file used for mass spectrometry. This conversion is simple using the MSConvert function within the open access ProteoWizard (https://proteowizard.sourceforge.io). Within ProteoWizard, the Bruker data files to be used are added to a list of files for conversion to .mzML format; users start the conversion process, which when complete will generate data files with the same root name as .mzML files. In the TOPPView application within OpenMS, to load files requires the following short process:
- Go to File > Open file.
- Choose a file from the file importer and click Open.
- Select options from the panel and click Ok. The recommended selections are Open as: new layer and Map view: 1D. Data analysis functions can then be implemented using the TOPPView application.
Data in Supplementary Tables S1 and S2 are prepared as Excel .xlsx files that can be read using the open source LibreOffice software package. Supplementary Tables S3 and S4 are prepared as Word .docx files that can also be read using the open source LibreOffice software package.
Access information
Other publicly accessible locations of the data:
- None
Data was derived from the following sources:
- None
Methods
Direct infusion ESI FT-ICR mass spectrometry was conducted using three replicates from each cultivar; the time-domain data was converted to m/z domain data prior to processing to identify features in the mass spectra.
Direct infusion ESI-FT-ICR data sets were processed as follows using Bruker Daltonics (Bremen, Germany) Data Analysis 4.0 software. Software was instructed to find all peaks with a signal-to-noise ratio > 3 to produce a peak list. Next, the peak list was subjected to the deconvolution process such that isotopic envelopes were determined, and each individual ionic species was then grouped as part of the given isotopic cluster. A threshold of 0.1% peak area relative to the most intense peak (m/z 1073.506 in each cultivar list, corresponding to ion C67H94NaN4O6) was used. The peak list was reduced to the monoisotopic isotope of each isotopic cluster, and this was the m/z value used in compiling lists for each cultivar.
After compilation of the m/z list for each cultivar, it was first passed through the SoyCyc database of metabolites (https://soycyc.soybase.org/); matches of either protonated, sodiated, or potassiated ions to the known metabolites within 3 ppm mass error was considered a confirmation of the ionic formula. Each list was then filtered through HMDB to discover matches to either protonated, sodiated, or potassiated ions in the database. For endogenous compounds, the 3 ppm mass error was again used to constitute a match. For non-natural compounds, however, a stricter limit of 1 ppm was used to constitute a match between the database and the m/z list. To further annotate the m/z with ionic formulas, each list was converted to the corresponding Kendrick mass and KMD calculated for each ion; ions were then sorted by KMD and plotted as nominal Kendrick mass vs. KMD to assist in identification of ionic formulas to those m/z which did not yet have one. Final lists of ionic formulas from each cultivar were then recorded and compared.
For those m/z values which matched entries in the SoyCyc database, an examination of the metabolic pathways involved was also performed to obtain context on how the cultivars might respond to drought at a molecular level. Note: the absence of an annotated peak in the list does not mean that metabolite is not present; rather, the metabolite is not detected with an abundance greater than 0.1% within the restrictive mass accuracy window employed. Metabolites from each cultivar identified in SoyCyc were the inputs into the Pathway Covering tool (https://pmn.plantcyc.org/cmpd-pwy-coverage.shtml) using a constant cost function; the tool then computed a minimal-cost set of metabolic pathways for Glycine max from each cultivar’s data set. For this analysis, Pathway Tools version 26.0 [42] was used employing data identified within the SoyCyc 10.0.2 database.