Skip to main content

Proteomic spectra of epipelagic copepods


Peters, Janna; Laakmann, Silke; Renz, Jasmin (2023), Proteomic spectra of epipelagic copepods, Dryad, Dataset,


We analyzed robustness of species identification based on proteomic composition to data processing and intraspecific variability, specificity and sensitivity of species-markers as well as discriminatory power of proteomic fingerprinting and its sensitivity to phylogenetic distance. Our analysis is based on MALDI-TOF MS data from 32 marine copepod species coming from 13 regions (North and Central Atlantic and adjacent seas). A random forest (RF) model correctly classified all specimens to species level with only small sensitivity to data processing, demonstrating the strong robustness of the method. Compounds with high specificity showed low sensitivity i.e., identification was based on complex pattern-differences rather than on presence of single markers.


Specimens were derived from ethanol samples (> 96 %), which were collected during diverse monitoring programs or field campaigns and from a copepod culture. Age and storing conditions varied between samples, samples were stored at 4°C from 2018 onwards. Adult female copepods were identified morphologically to species level and stored in ethanol until further processing at 4°C. In total, 752 specimens from 32 species, and 13 different regions were used for proteomic fingerprinting analyses. Proteomic profiles were determined for all 752 specimens. For small copepods (< 2 mm), the whole specimen, and for larger copepods, a piece of the cephalosome was shortly dried at room temperature and kept in an Eppendorf tube. Depending on sample size 5–10 µl matrix solution (α-Cyano-4-hydroxycinnamic acid as saturated solution in 50% acetonitrile, 47.5% LC-MS grade water, and 2.5% trifluoroacetic acid) was added. After at least 10 min extraction, 1.2 µl of each sample was added onto the target plate, with 2–3 replicates. Protein mass spectra were measured from 2 to 20 kDa using a linear-mode MALDI-TOF System (Microflex LT/SH, Bruker Daltonics). Peak intensities were analyzed during random measurement in the range between 2 and 20 kDa using a centroid peak detection algorithm, a signal-to-noise threshold of 2 and a minimum intensity threshold of 400 with a peak resolution higher than 400 for mass spectra evaluation. Proteins/Oligonucleotide method was employed for fuzzy control with a maximal resolution 10 times above the threshold. For each sample, 240 satisfactory shots were summed up. Spectra in the range of 2–20 kDa were processed with MALDIquant and MALDIquantForeign using square root transformation, Savitzky-Golay smoothing with a half window size of 10, baseline removal by the statistics-sensitive non-linear iterative peak-clipping algorithm and normalization setting the total ion current set to 1. Normalized spectra of technical replicates were averaged. 

Optimal peak detection parameters were derived by varying the signal-to-noise ratio (SNR) thresholds for peak identification and the half window size (HWS) of peak picking, both in the range of 3–15 with species classification success of the random forest model (method see below) as the target variable. Also, we determined the proportion of closely spaced peaks after binning (i.e. mass difference between peaks < 6 Da), that have a risk of being mis-assigned to a specific m/z value. The highest classification success was reached with a SNR of 4 and a HWS of 3, however, the proportion of closely spaced peaks was still high and strongly decreased at SNR and HWS values of 8. These peak detection parameters (SNR=8, HWS=8) were applied to the final data set. Picked peaks were repeatedly binned to compensate for small variation in the m/z values between measurements until the intensity matrix reached a stable peak number (with a tolerance of 2000 ppm, strict approach). All signals below the SNR were set to zero in the final peak matrix. For all further analysis peak intensities were Hellinger transformed using the R package vegan.

Species classification was performed by a random forest (RF) model using the R package randomForest using 2000 trees and the square root of peak number as randomly sampled variables at each split. To avoid overrepresentation of the most abundant species, the number of sub-sampled specimens per species in each of the 2000 decision trees was limited to the abundance of the least abundant species (N=4), respectively. Species with less than four specimens were excluded as we observed a strong increase in the out of bag with smaller sample size. For all other species, the classification RF model was applied to all specimens from 27 species.


Deutsche Forschungsgemeinschaft, Award: RE2808/3-1/2