Mathematical chromatography deciphers the molecular fingerprints of dissolved organic matter
Cite this dataset
Wünsch, Urban J.; Hawkes, Jeffrey A. (2020). Mathematical chromatography deciphers the molecular fingerprints of dissolved organic matter [Dataset]. Dryad. https://doi.org/10.5061/dryad.nk98sf7pp
Abstract
Highresolution mass spectrometry (HRMS) elucidates the molecular composition of dissolved organic matter (DOM) through the unequivocal assignment of molecular formulas. When HRMS is used as a detector coupled to high performance liquid chromatography (HPLC), the molecular fingerprints of DOM are further augmented. However, the identification of eluting compounds remains impossible when DOM chromatograms consist of unresolved humps. Here, we utilized the concept of mathematical chromatography to achieve information reduction and feature extraction. Parallel Factor Analysis (PARAFAC) was applied to a dataset describing the reversephase separation of DOM in headwater streams located in southeast Sweden. A dataset consisting of 1355 molecular formulas and 7178 mass spectra was reduced to five components that described 96.89% of the data. Each component summarized the distinct chromatographic elution of molecular formulas with different polarity. Component scores represented the abundance of the identified HPLC features in each sample. Using this chemometric approach allowed the identification of common patterns in HPLC–HRMS datasets by reducing thousands of mass spectra to only a few statistical components. Unlike in principal component analysis (PCA), components closely followed the analytical principles of HPLC–HRMS and therefore represented more realistic pools of DOM. This approach provides a wealth of new opportunities for unravelling the composition of complex mixtures in natural and engineered systems.
Methods
Dataset1.zip
 Samples were stored unfiltered in the dark at 4° C for approximately five months after sampling.
 On the day of measurements, specific volumes of samples were transferred to 2 mL Eppendorf vials so that 11.25 µg carbon was present in each sample vial, while 2 mL of blanks were transferred.
 The water in samples and blanks was subsequently removed by vacuum evaporation at 45° C, after which samples were reconstituted in 150 µL 1 % (v/v) formic acid to a final concentration of 75 mg/L carbon.

Reversephase chromatography separations were performed on an Agilent 1100 series instrument with an Agilent PLRP‑S series column (150 x 1 mm, 3 µm bed size, 100 Å pore size). Eighty µL sample was loaded at a flow rate of 100 µL min^{1} 0.1 % formic acid, 0.05 % ammonia, and 5 % acetonitrile. The elution of DOM was achieved through a stepwise increase in concentration of solvent B (100 % acetonitrile) from zero initially, followed by 20 %, and ending in > 45 % solvent B.

Mass spectrometry detection was carried out with an Orbitrap LTQVelosPro (Thermo Scientific, Germany) with electrospray ionization (ESI, negative mode) as ion source. Transient ions were collected in the range of m/z 150 ‑1000 at an instrumental resolving power set to 10^{5}. An external calibration with the manufacturer’s calibration mixture was followed by an internal calibration using six ubiquitous ions in the range of m/z 251 ‑ 493.

Transients were filtered for noise after considering peaks with mass defect 0.60.8 as noise and removing all peaks with intensity lower than the mean + 3 standard deviations of these peaks. Molecular formulas were assigned within the range C_{440}, H_{480}, O_{140}, N_{01}, S_{01} in the mass range m/z 170 – 700. Additionally, assignments were constricted to O/C 01, H/C 0.3 ‑2, a double bond equivalent minus oxygen less than or equal to 10, and a mass defect of ‑0.1 to 0.3 (decimal after the nominal mass).

Formulas detected in process blanks were excluded from further analysis. Formulas were also removed from consideration in samples if the intensity did not exceed the noise + 10 standard deviations in at least 10 sequential transients at some point in the elution. This molecular formula assignment and data treatment yielded 2052 unique molecular formulas. Several sequential intensities (typically 34) were summed to a chromatographic resolution of 0.1 min to favour analyte signals over instrument noise and to reduce computational requirements.

To yield a more quantitative dataset in subsequent analyses, the DOC normalization was reversed by accounting for the sample specific volume that yielded the constant amount of carbon dissolved for chromatographic analysis. For statistical modelling, the retention time window of 5.0 ‑22.9 min was selected, yielding a preliminary dataset size of 74 samples x 2052 molecular formulas x 180 retention times (dataset_1.zip).
Dataset2.zip
 Dataset 1 was the source of dataset 2.
 All mass spectra were divided by a factor of 4.92 x 10^{7}
 Masses that were detected in less than 10 % of measurements (including samples and retention times) were excluded from further analysis (N = 661, Dataset4.zip).
 An additional 36 molecular formulas (Dataset3.zip) were removed from the dataset due to noticeably unique chromatograms.
 Chromatographic sections with missing observations of at least 2 min (20 observations or more) were set to zero while leaving a gap of missing numbers of 0.7 min to each end of the section.
 Every 2^{nd} retention time (after t = 7 min) was excluded
 All data above retention times of 22.2 min was excluded.
Dataset3.zip
 Dataset3 contains outliers that were removed in step 4 of Dataset2
Dataset4.zip
 Dataset4 contains rarely observed formulas that were removed in step 3 of Dataset2
Dataset5.zip
To isolate groups of molecular formulas with identical chromatographic elution profiles, parallel factor analysis (PARAFAC) was utilized. All data processing and modelling was carried out using PLS_Toolbox (v8.61, Eigenvector Research Inc.) in MATLAB (v9.7, MathWorks Inc.). PARAFAC models were constrained to nonnegativity in all modes and the convergence criterion was set to a relative change in fitting error between iterations of 10^{12}. Each model was initialized 50 times with orthogonalized random numbers and only the least squares solution was further inspected. Models with two to nine components were considered. A fivecomponent model was validated. Dataset5 contains it's properties and supporting geochemical sample parameters.
Dataset6.zip
Dataset6 contains the residual chromatograms (data minus model = dataset2 minus dataset5) for every sample and formula. To create one file, the residuals were unfolded into one large matrix. Each formula in every sample is acompanied by a tag that categorizes the residual chromatogram. The categories were assigned as follows (numbers correspond to the numbering scheme in the csv file):
(1) Underestimations are chromatograms in which more than 80% of residuals were positive.
(2) Overestimations are chromatograms in which more than 80% of residuals were negative.
(3) False positive abundances were identified by counting the cases in which PARAFAC estimated a nonzero chromatogram, but the data only contained zeros or missing observations.
(5) Residuals were classified as random when they did not fall into any other category, their absolute median was < 0.001, and the number of positive and negative residuals each accounted for between 40 and 60 % of the raw chromatograms (not counting zeros or missing observations).
(NaN) Residuals did not fall into any of the above categories. Therefore "uncategorized".
The number "4" was not used in this listing.
Usage notes
Mathematical chromatography offers information reduction and feature extraction in complex liquid chromatography—mass spectrometry datasets.
All six datasets contain ReadMe files named "Readme  datasetX.txt" that provide detailed information for each dataset. Missing numbers are always idicated by "NaN", contents of rows and columns is always explained in the ReadMe files.
The data publication does not contain the MATLAB scripts or functions that were used to create datasets. However, with the information provided in the section "Methods", all datasets can be recreated. All data are provided as commaseparated files that can be read on any platform in any programming environment that is capable of reading *.csvfiles.
PARAFAC can be carried out using software provided freeofcharge in R (multiway package,https://cran.rproject.org/web/packages/multiway/multiway.pdf), in Python (TensorLy package, http://tensorly.org/stable/modules/generated/tensorly.decomposition.parafac.html), and in MATLAB (Nway package, http://models.life.ku.dk/nwaytoolbox).
Funding
Swedish Research Council for Environment Agricultural Sciences and Spatial Planning, Award: 201700743
Swedish Research Council, Award: 201804618
ÅForsk, Award: 19499