Skip to main content

Salmon Creek Organic Geochemistry Chemometric Data


Larsen, Laurel; Woelfle-Erskine, Cleo (2017), Salmon Creek Organic Geochemistry Chemometric Data, Dryad, Dataset,


This datafile contains fluorescence indices and the results of a parallel factor analysis (PARAFAC) performed on samples collected within the Salmon Creek Watershed in Sonoma County, Califonia. Please see Woelfle-Erskine et al. (2017) for site location details. The present dataset is provided in support of Larsen and Woelfle-Erskine (in review). The purpose of this dataset was to develop a fluorescent fingerprint for purposes of differentiating between surface water, groundwater, and hyporheic water.


Samples were collected with a peristaltic pump (Geotech Inc., Denver, CO). Both piezometer and surface-water samples collected for DOC analyses were filtered in the field through 0.2 micron polyethersulfone filters, into acid-washed, 40 mL amber vials, and kept on ice during transport to the lab. Samples were analyzed for fluorescence within 3 days of collection. Isotope samples were collected in the same manner, but were not filtered and were frozen until analysis. Radon samples were collected in pre-washed 2-L plastic bottles.

DOC fluorescence was analyzed on a Horiba Jobin Yvon Aqualog fluorometer, which obtains simultaneous absorbance and fluorescence information, allowing for rigorous correction for the inner-filter effect. Samples were blank-subtracted, corrected for the inner filter effect, and Raman normalized within the Aqualog instrument software. First- and second-order Rayleigh scattering regions were removed, and a triangle of zeros was inserted in the region where the excitation wavelength is greater than the emission wavelength.

The entire library of samples collected from 2013 to 2015 (see distributional characteristics in Table 1) was used to develop a PARAFAC model for the study area. PARAFAC was implemented using the drEEM toolbox in Matlab. EEMs were trimmed to excitation wavelengths 260-375 nm and emission wavelengths 350-620 nm to remove regions that exhibited excessively high leverage on model fits and to avoid fitting noise. Subsequently, they were normalized by their highest fluorescence value. Four outlier samples were identified based on their excessively high loadings and their EEMs and removed from subsequent analysis. Two of the outliers had no apparent organic matter fluorescence, whereas the other two had anomalous spectral features. Models with between two and eight components were evaluated. Residuals appeared to take on the appearance of unstructured noise with the five-component model, with typically less than ten percent of the magnitude of the sample fluorescence. Models with up to five components met the Tucker’s convergence criterion for spectral equivalence in split halves for two alternative splits, whereas models with up to seven components met the criterion for one of two splits. Repeated model solutions with different intializations converged upon the same local minimum in eight out of 10 trials with the five-component model, whereas the six-component model converged upon a more diverse set of local minima, for which at least two of the spectral loadings were noticeably different.

For reasons of stability of the minimum, reasonable appearance of the spectral loadings, and unstructured and low-fluorescence nature of the residuals, the five-component model was selected for further analysis. Component loadings were converted to Fmax values (which represent the fluorescence attributable to each component, in Raman units) through multiplication by the maximum excitation and emission loadings of the component.

In addition to the development of the original PARAFAC model (hereafter referred to as the OP model), the PARAFAC model of Cory and McKnight (2005) (hereafter referred to as the CM model) was also applied to the set of collected EEMs. The CM model was developed from a library of samples collected from diverse environments, ranging from those with terrestrially dominated DOC to those with microbially dominated DOC.

We interpreted the PARAFAC results through further multivariate analyses. We performed a principal components analysis (PCA) to evaluate those fluorescent components and indices that covaried with each other and were most strongly associated with different classes of samples (i.e., well, spring, piezometer, surface water). We also used a hierarchical clustering analysis to determine which groups of samples were most similar to each other with respect to the input fluorescent components and indices. The Fmax values from the CM and OP PARAFAC models, together with the FI, HIX, and BIX, constituted the dependent variables for the PCA and hierarchical clustering analysis, performed in JMP statistical software (SAS Institute, Inc., Cary, NC). The PCA was performed on the variable correlation matrix. The hierarchical clustering analysis was performed on standardized variables, using Ward’s minimum variance method.

Usage notes

Please see accompanying metadata.


National Science Foundation, Award: 1434309

Gordon and Betty Moore Foundation, Award: Data-driven Discovery Investigator Program


SW 38.355469, -123.029881
NW 38.379964, -122.987995
Bodega, CA, USA