MODID: Multispectral oral disease image dataset with segmentaion

Chand, Sneha 1 ; Namasivayam, Karthik2 ; Dave, Janak3 ; SP, Preejith3 ; Sadaksharam, Jayachandran2 ; Sivaprakasam, Mohanasankar1

Research facility: Healthcare Technology Innovation Centre

Published Aug 22, 2024 on Dryad. https://doi.org/10.5061/dryad.nvx0k6dxw

Data files

Aug 22, 2024 version files 4.72 GB

masks.zip
298.52 KB
MODID_DESCRIPTOR.xlsx
49.10 MB
processed_data.zip
1.89 GB
README.md
6.30 KB
rgb.zip
49.74 MB
unprocessed_data-20240116T152505Z-001.zip
807.14 MB
unprocessed_data-20240116T152505Z-002.zip
1.04 GB
unprocessed_data-20240116T152505Z-003.zip
888.52 MB

Abstract

In oral imaging spectroscopy, tissue data is often collected from resected samples, which may not accurately reflect true tissue signatures due to preservation chemicals. To overcome this limitation, we aim to capture in-vivo spectral signatures for oral diseases. Our dataset comprises spectral images acquired in 16 bands, ranging from 460 nm to 600 nm, with a resolution of 270x510 pixels. We collected these images from 91 volunteers, including 15 healthy individuals and 78 patients with various oral health conditions. Detailed patient history records are also included for each case. This publicly available multispectral dataset holds great potential for advancing spectroscopy diagnosis. By leveraging artificial intelligence in conjunction with this robust spectral image repository, more accurate and insightful outcomes can be achieved in the analysis and diagnosis of oral diseases. Access to this comprehensive oral health dataset contributes to enhancing our understanding of tissue spectroscopy variability and facilitates further research in the field of disease analysis.

README: MODID: Multispectral oral disease image dataset with segmentation

The dataset consists of 244 samples from 91 patients. Each sample includes both unprocessed and processed multispectral files. Each datacube consists of 16 bands in the visible range, stacked to form the complete cube.
spectral range = 460 nm to 600 nm
bands = 16

band frequency (in nm) = { 460.480990, 465.215098, 474.119400, 482.578653, 492.599330, 504.367465, 512.469275, 521.987657, 533.930667, 541.203169, 551.996143, 559.719978, 570.133100, 579.729623, 585.387574, 595.018809 }

Description of the Data

I. Unprocessed Data

The unprocessed data folder comprises of :

Mosaiced raw data capturing the full sensor size at 2048 × 1088 resolution, along with its corresponding context file; given as (image.raw , acquisition_description.xml and image.raw.xml)
The context file encompasses vital components, including:
1. The calibration file is specific to the camera used and is produced during manufacturing to ensure accurate spectral data. This file comes with the camera and is used as an input to the IMEC DataCapture software before data collection begins.
2. Black Reference File: This file is used to compensate for dark noise during data capture.
3. White Reference File: The white reference file is employed to generate corrected data spectra and is crucial for ensuring the accuracy of spectral data. It accounts for fixed parameters like the light source, and the distance between the light source, camera, and the object under test. Given the consistent setup in our data collection, any of the provided white reference files can be used with the data. In regular intervals white reference captures are undertaken as part of quality assurance to ensure the reliability and consistency of the spectral measurements.
4. Optical Setup File: The optical setup file contains essential details about the optical configuration used during data capture, which is crucial for ensuring accurate spectral data for example focal length of the camera and so on.
5. Context Description File: This file includes comprehensive information about the camera system, such as system ID, data type, format, and other pertinent details necessary for understanding and processing the captured data.

II. Processed Data

In contrast to the unprocessed data, which includes raw data files along with their corresponding calibration files, the processed data undergoes a different treatment. The calibration files are applied to the raw data during export, resulting in a demosaiced spectral data cube featuring 16 bands spanning the range of 460 nm to 600 nm. This processed dataset exhibits a spectral resolution of 10-15 nm and a spatial resolution of 270x510 pixels.

Multispectral images often come with .hdr (Header) files that contain vital metadata and parameters. However, to access the actual pixel values or spectral data, a separate .raw file is required. The .hdr file serves as a descriptor, outlining the structure of the data, while the .raw file holds the binary pixel values. Together, these files provide comprehensive information about the multispectral image, also known as the ENVI format (.hdr with .raw). You can use any ENVI software available in the market to open the data. There is also a python script with the dataset that provides an example to load data.

RGB | Mask

In addition to the processed .hdr and .raw files, the dataset also includes RGB images and a binary mask for the samples. These RGB images provide a visual representation of the data, combining the spectral bands to create a color image. The binary mask, on the other hand, identifies specific regions of interest within the sample. Together, these RGB images and binary masks complement the multispectral data, enabling visual interpretation and further analysis of the dataset.

Description of the file structure

Under the main folder MODID there are FOUR folders named unprocessed, processed, rgb, and mask, along with a patient information Excel sheet named MODID_DESCRIPTOR.xlsx and a readme file.

|- Data/
|  |- Processed/
|  |  |- image_number.hdr
|  |  |- image_number.raw
|  |- Unprocessed/
|  |  |- image_number (image.raw , acquisition_description.xml and image.raw.xml)
|  |  |- context
|  |  |  |- calibration_file (CMV2K-SSM4x4-460_600-15.7.15.13.xml)
|  |  |  |- white_reference (white_reference.raw and white_reference.xml)
|  |  |  |- dark_reference (dark_reference.raw and dark_reference.xml)
|  |  |  |- optical_setup (optical_setup.xml)
|  |  |  |- context_description.xml
|- rgb/
|  |- image_number.png
|- mask/
|  |- image_number.png
|- code/
|  |- Band_image_generation.py
|  |- spectrum_generation.py

MODID_DESCRIPTOR.xlsx

The Excel spreadsheet named MODID_DESCRIPTOR.xlsx," there are two sheets named Image ID and Patient data.

Image ID sheet records the names of the image samples and is maintained to record which image belongs to which patient.

Here each image sample is associated with a specific Patient ID in the Image ID column.
Additionally, the Image Number column indicates the number of samples collected for each patient.

Patient data sheet provides patient information such as the Patient ID (to relate to the image id sheet), Gender, Age range, and patient habits such as smoking. Furthermore, the sheet includes the patient's diagnosis and displays RGB images of the collected samples for each patient.

Code

The code uses key libraries for spectral data generation, such as NumPy, Matplotlib, and Spectral.

There are two files:

Band image genration - This file has the code to read a multispectral data from the MODID dataset and display the spectral images per band.
spectrum generation - This file is used to generate the mean spectrum over the region of interest. The region of interest is taken from the mask png and overlayed upon the processed data, to extract the spectral bands only for the area of interest. This is then averaged over the pixel area per band to generate a spectrum.

Software

HSI Mosaic software from IMEC (camera sensor manufacturer) was used for data acquisition.
Label Studio software was used for annotation process and mask generation.