Data from: A universal tool for marine metazoan species identification – Towards best practices in proteomic fingerprinting
Data files
Jan 10, 2024 version files 114.31 KB
Abstract
Proteomic fingerprinting using MALDI-TOF mass spectrometry is a well-established tool for identifying microorganisms and has shown promising results for identification of animal species, particularly disease vectors and marine organisms. However, few studies have tested species identification across different orders and classes. In this study, we collected data from 1,246 specimens and 198 species to test species identification in a diverse dataset. We also evaluated different specimen preparation and data processing approaches for machine learning and developed a workflow to optimize classification using random forest. Our results showed high success rates of over 90%, but we also found that the size of the reference library affects classification error. Additionally, we demonstrated the ability of the method to differentiate marine cryptic-species complexes and to distinguish sexes within species.
README: Data from: Proteomic fingerprinting as universal tool for species identification in marine metazoans on the road to best practices
The data relates to the publication Proteomic fingerprinting as universal tool for species identification in marine metazoans on the road to best practices in Nature Scientific Reports.
DOI:10.1038/s41598-024-51235-z
This dataset contains an Excel file called ReferenceTable.csv and a .zip file containing raw Bruker mass spectrometry files for 1246 measurements from a variety of marine species. For each specimen, one folder is included in the .zip file. These files are named according to specimen names in the Excel file. Furthermore, the excel file contains information on
taxonomical classification of each specimen.
The .zip file contains a folder named Dryad in which folders for the 1246 specimens are deposited. These again containing folders for each specimen. The "specimen folders" are build up according to the default Bruker flex container system and contain one folder for each spot that was measured for a specimen. Each "spot folder" consists of folders for each technical replicate measurement on the respective spot. Within these folders the actual data produced by the instrument can be found.
Data can be analyzed using Bruker proprietary software such as Bruker Flex analysis or Bruker Biotyper.
Alternatively, data can be analyzed in R using R-packages MaldiQuant and MaldiQuantForeign following
the vignette that can be found via:
https://cran.r-project.org/web/packages/MALDIquant/index.html
Additionally, the dataset contains three .R files. These are R scripts that as they were used in the referenced study.
The file Workflow.R contains a script for default data processing in R to produce a data matrix that can be used for further analyses.
The file Batch_script_Classification_posthoc_parallel.R contains a script for testing classification by discarding a specimen from the data and then have it classified using the remaining data. On the classification a post-hoc test is then applied. The script could be modified for classification on higher taxonomic level. This script is specifically created for parallel processing on a larger server.
The file Data optimization.R is a script to vary Baseline iterations as well as halfWindowSize and signal to noise ratio during peak detection. In the end, the script creates a random forest model and saves the OOB error. The script is designed for parallel processing and can be used on a larger serve and allows the analyses of several variations at once.
Methods
Tissue for measurements was taken mainly from the marine organisms tissue bank of the Senckenberg am Meer, German Centre for Marine Biodiversity Research, which was established using samples from numerous studies (Knebelsberger and Thiel, 2014; Knebelsberger et al., 2014; Markert et al., 2014; Gebhardt and Knebelsberger, 2015; Raupach et al., 2015; Barco et al., 2016; Laakmann et al., 2016; Rossel et al., 2020b) (supplementary table S1 for accession numbers) on North Sea metazoans. The material from this collection was taken from specimens processed for COI-barcoding to create reference libraries for a variety of marine animal groups. During this process, tissue samples of the respective specimens were stored in ethanol at -80°C. Tissue samples were available for Bivalvia (muscle, 18 species), Cephalopoda (muscle from arm, 12 species), Gastropoda (muscle from foot, 24 species), Polyplacophora (muscle from foot, 2 species), Ascidiacea (tissue, 1 species), Teleostei (muscle, 67 species), Elasmobranchii (muscle, 7 species), Malacostraca (muscle from foot or chelae, 39 species), Thecostraca (muscle from foot, 1 species), Pycnogonida (leg fragment, 1 species), Asteroidea (tube feet, 10 species), Ophiuroidea (tissue from arm, 10 species) and Echinoidea (tissue from the base of the tubercle, 6 species) (nspecies= 198, nspecimens=1,246).
Sample preparation
The basic protocol of sample preparation was the same for all analyzed tissue samples. A very small tissue fragment (< 1 mm3) was incubated for 5 minutes in α-cyano-4-hydroxycinnamic acid (HCCA) as a saturated solution in 50% acetonitrile, 47.5% molecular grade water and 2.5% trifluoroacetic acid. Tissue from crustacean Cancer pagurus Linnaeus, 1758, the fish Clupea harengus Linnaeus, 1758, the cephalopod Eledone cirrhosa (Lamarck, 1798) and the echinoderm Stichastrella rosea (O.F. Müller, 1776) was used to find an optimal tissue to HCCA matrix ratio. Tissue was weighted on a METTLER TOLEDO XS3DU micro-balance and the amount of matrix was adjusted to tissue weight to obtain the desired ratios ranging from 0.012 µg µl-1 to 200 µg µl-1. After incubation, 1.5 µl of the solution was transferred to 10 spots on a target plate, respectively. Mass spectra were measured with a Microflex LT/SH System (Bruker Daltonics) using method MBTAuto. Peak evaluation was carried out in a mass peak range between 2 k – 10 k Dalton (Da) using a centroid peak detection algorithm, a signal to noise threshold of 2 and a minimum intensity threshold of 600. To create a sum spectrum, 160 satisfactory shots were summed up.
Resulting from observations during this initial test, a fast applicable protocol was developed without the need to weigh each tissue sample. Matrix volume was added to tissue samples depending on tissue volume, i.e. tissue samples were always completely covered by HCCA matrix with a small layer (ca. 1 mm) of supernatant. Samples were incubated for 5 minutes and 1.5 µl of the solution were transferred to a single spot on a target plate for measurement. Each spot was measured between two to three times.
Usage notes
Data can be analyzed using Bruker proprietary software such as Bruker Flex analysis or Bruker Biotyper.
Alternatively, data can be analyzed in R using R-packages MaldiQuant and MaldiQuantForeign following the vignette that can be found via:
https://cran.r-project.org/web/packages/MALDIquant/index.html