Evaluating the use of Fourier transform Raman spectroscopy for pollen chemical characterization
Data files
Jul 23, 2025 version files 42.98 MB
-
AppSpec_Muthreich_Data.zip
42.97 MB
-
README.md
7.52 KB
Abstract
Vibrational spectroscopy is gaining popularity for understanding ecological and evolutionary patterns in plants, particularly in relation to the analysis of pollen grains. So far, Fourier transform infrared spectroscopy (FTIR) has been the main approach used to classify pollen grains based on chemical variations. However, FTIR may be less suitable for detecting differences in the pollen-grain exine, mainly composed of sporopollenin. In contrast, Raman spectroscopy has increased sensitivity for the main chemical components found within sporopollenins. We compare the classification performance and chemical information provided by FTIR and FT-Raman using a large dataset of Quercus L. pollen, comprising 5 species in 3 sections (section Cerris: Q. suber; section Ilex: Q. coccifera, Q. rotundifolia; section Quercus: Q. robur, Q. faginea). We use multiblock sparse partial-least-squares discriminant analyses (MB-SPL-DA) analyses to directly compare the two infrared methods. Both FTIR and FT-Raman successfully classified Quercus pollen to section level (100% accuracy). At the species level our models achieved ~90% accuracy for FT-Rama and FTIR separately and in the combined multiblock model. The multiblock results showed an increased number of sporopollenin peaks observed in FT-Raman spectra as compared to FTIR. These peaks are also of a higher importance for classification. Results also showed differences in the types of vibrations that are of diagnostic value for the two infrared methods. CH2 deformations are more important in FT-Raman, while C-O-C, C-O, and C=O stretches are more important for FTIR-based identification of pollen. These vibrations are indicators of carbohydrates, proteins, and lipids. FT-Raman provides equally successful diagnostic potential to FTIR, but uses more chemical information based on variations in sporopollenin chemistry than FTIR. We suggest that the combined analysis of FTIR and FT-Raman using multiblock analysis has great potential for classification.
Dataset overview
Description and explanation of methodology that were followed are included in the published article: Muthreich F, Tafintseva V, Zimmermann B, Kohler A, Vila-Viçosa CM, Seddon AWR. Evaluating the Use of Fourier Transform Raman Spectroscopy for Pollen Chemical Characterization. Applied Spectroscopy. 2025;0(0). doi:10.1177/00037028251334405
Dependencies
Analysis and article were written in R version 4.3.1.
Packages needed for data processing, analysis, and plotting are listed in 'R/packages.R'.
Alternatively, a renv snapshot (R package renv) is provided, which can be loaded with 'renv::restore()'.
To knit to PDF, the manuscript was built with bookdown. A version of pandoc and latex is required with rmarkdown and bookdown to ensure successful knitting, these differ depending on OS used. The latest knitted versions of the manuscript and this file as '.pdf' are provided.
Data folder and file structure
Main folder that contains the data is: 'AppSpec_Muthreich_Data.zip'
The data has two main folders: One for input files and one for output files generated during analyses.
data
|-- input
| |-- data_metainfo.csv
| |-- node_1.csv
| |-- node_1_ftir.csv
| |-- node_1_raman.csv
| |-- node_2.csv
| |-- node_2_ftir.csv
| |-- node_3.csv
| |-- node_3_raman.csv
| |-- Quercus_ATR_RAW.csv
| |-- Raman_Portugal.csv
| |-- scores.csv
|-- output
File descriptions:
data_metainfo.csv
Columns contained in this file.
- ID: ID code
- SID: machine sample id
- Sample.Name: Sample name from analysis machine
- SRep: Sample replicate; same number indicates same sample
- MRep: Measurement replicate
- Section: Taxonomic Quercus section
- Genus: Genus
- Species: Species
- Sub_Spec: taxonomic subspecies
- Latitude: Sampled tree location latitude
- Longitude: Sampled tree location longitude
- Group: subgroup sample location
- Location: Sample location description
- Date: date tree was sampled
node_1.csv
File contains the classification results for the different preprocessing parameters at the first classification node, to Quercus section.
Classification metric is recall.
- First 4 rows are the FTIR only models trained with spectra processing parameters corresponding to the column names
- second 4 rows are the Raman only models
- Rest of the table are the different permutations of Multiblock were both spectral datasets were used
- column 1-4 columns describe what preprocessing parameters were used in the FTIR dataset for the corresponding row model
- column 5-8 describe what preprocessing parameters were used in the FT-Raman dataset for the corresponding row model
Explanation of column names:
- short: short spectral range 700-1900 cm^-1
- long: extended spectral range 600-3200 cm^-1
- 11: window size parameter of Savitzky Golay processing method
- 29: window size parameter of Savitzky Golay processing method
- Cerris: Quercus Section Cerris recall results
- Ilex: Quercus Section Ilex recall results
- Quercus: Quercus Section Quercus recall results
- Accuracy: weighted Accuracy of all Quercus Section recall results
- Ave_Acc: average Accuracy of all Quercus Section recall results
node_1_ftir.csv
File contains the loadings of the FTIR model for the first classification node for the entire spectral region 700-3200 cm^-1.
First Column is Quercus section
Rest of the columns are the wavenumbers in cm^-1
node_1_ftir_2.csv
File contains the loadings of the FTIR model for the first classification node for the reduced/short spectral region 700-1900 cm^-1.
First Column is Quercus section
Rest of the columns are the wavenumbers in cm^-1
node_1_raman.csv
File contains the loadings of the Raman model for the first classification node for the entire spectral region 700-3200 cm^-1.
First Column is Quercus section
Rest of the columns are the wavenumbers in cm^-1
node_1_raman_2.csv
File contains the loadings of the Raman model for the first classification node for the reduced/short spectral region 700-1900 cm^-1.
First Column is Quercus section
Rest of the columns are the wavenumbers in cm^-1
node_2.csv
File contains the classification results for the different preprocessing parameters at the second classification node, to Section Ilex Quercus species.
Classification metric is recall per species and overall accuracy.
- First 4 rows are the FTIR only models
- second 4 rows are the Raman only models
- Rest of the table are the different permutations of Multiblock were both spectral datasets were used
- column 1-4 columns describe what preprocessing parameters were used in the FTIR dataset for corresponding row model
- column 5-8 describe what preprocessing parameters were used in the FT-Raman dataset for corresponding row model
Explanation of column names:
- short: short spectral range 700-1900 cm^-1
- long: extended spectral range 600-3200 cm^-1
- 11: window size parameter of Savitzky Golay processing method
- 29: window size parameter of Savitzky Golay processing method
- coc: Quercus species Quercus coccifera recall results
- rot: Quercus species Quercus rotundifolia recall results
- Accuracy: weighted Accuracy of both Quercus species recall results
- Ave_Acc: average Accuracy of both Quercus species recall results
node_2_ftir.csv
File contains the loadings of the FTIR model for the second classification node.
First Column is Quercus species name
Rest of the columns are the wavenumbers in cm^-1
node_3.csv
File contains the classification results for the different preprocessing parameters at the third classification node, to Section Cerris Quercus species.
Classification metric is recall per species and overall accuracy.
- First 4 rows are the FTIR only models
- second 4 rows are the Raman only models
- Rest of the table shows the different permutations of Multiblock were both spectral datasets were used
- column 1-4 columns describe what preprocessing parameters were used in the FTIR dataset for the corresponding row model
- column 5-8 describe what preprocessing parameters were used in the FT-Raman dataset for corresponding row model
Explanation of column names:
- short: short spectral range 700-1900 cm^-1
- long: extended spectral range 600-3200 cm^-1
- 11: window size parameter of Savitzky Golay processing method
- 29: window size parameter of Savitzky Golay processing method
- fag: Quercus species Quercus faginea recall results
- rob: Quercus species Quercus robur recall results
- Accuracy: weighted Accuracy of both Quercus species recall results
- Ave_Acc: average Accuracy of both Quercus species recall results
node_3_raman.csv
File contains the loadings of the Raman model for the third classification node.
First Column is Quercus species name
Rest of the columns are the wavenumbers in cm^-1
Quercus_ATR_RAW.csv
File with the raw FTIR spectra dataset.
Columns are wavenumbers and rows are samples.
Raman_Portugal.csv
File with the raw Raman spectra dataset.
Columns are wavenumbers and rows are samples.
scores.csv
File contains the latent variables or component scores for each sample used in the figures in the paper.
Contains 4 columns:
- sample name
- LV1: First latent variable
- LV2: Second latent variable
- scores: type of score, either
- global: global model latent variable score
- fitr: ftir block model latent variable score
- raman: Raman block model latent variable score