Data from: Deep Learning Unlocks X‐ray Microtomography Segmentation of Multiclass Microdamage in Heterogeneous Materials Kopp, Reed, Massachusetts Institute of Technology*, https://orcid.org/0000-0002-2260-2374 Joseph, Joshua, Massachusetts Institute of Technology Ni, Xinchen, Massachusetts Institute of Technology Roy, Nicholas, Massachusetts Institute of Technology Wardle, Brian L., Massachusetts Institute of Technology rkopp@alum.mit.edu, wardle@mit.edu Accepted paper online date (corresponds to data): November 20, 2021 (DOI: 10.1002/adma.202107817) Citation: Kopp, Reed et al. (2021), Data from: Deep Learning Unlocks X‐ray Microtomography Segmentation of Multiclass Microdamage in Heterogeneous Materials, Dryad, Dataset, https://doi.org/10.5061/dryad.ffbg79cwb Keywords: Materials engineering, machine learning, 3D multiclass damage, composite materials, computed tomography, deep learning, heterogeneous materials, machine vision, material characterization, synchrotron radiation computed tomography This README describes the background and usage of a published database of thirty high-resolution synchrotron radiation computed tomography (SRCT) scans of advanced composites, wherein each scan is represented in a zipped file containing both raw tomographic images (TIFF) and corresponding trained-human annotations of polymer damage (TIFF). The in situ scans are different load steps associated with six different advanced composite specimens, comprising four Thick-laminae specimens and two Thin-laminae specimens. These scans were used in the development of deep learning datasets for semantic segmentation via the high-level instructions outlined below. Additional background and usage details can be found in the corresponding Advanced Materials paper (DOI: 10.1002/adma.202107817). As mentioned, each of the 30 scans is represented by a zipped file with the following directory structure: [Scan ID] (zipped file) --->"Images" (directory containing a stack of 8-bit grayscale TIFF images, comprising a raw 3D tomographic image) --->"Annotations" (directory containing a stack of 8-bit grayscale TIFF images, comprising the trained-human-generated damage labels (ground truth) corresponding to the raw 3D tomographic image of the same scan ID) The scan IDs/names follow a basic convention that is partly described in Figure 2 of the corresponding Advanced Materials paper (DOI: 10.1002/adma.202107817), namely: [Composite Type]_[Specimen ID]_[LOADSTEP ID]_NUMBER1_NUMBER2_NUMBER3, wherein "Number 1" is the number of pixels along the volumetric image's x-direction (aligned with 0 degree fibers and loading direction), "Number 2" is the number of pixels along the volumetric image's y-direction (aligned with 90 degree fibers), and "Number 3" is the number of pixels along the volumetric image's z-direction (through the stack). These 3 numbers define the size of the volume image, which identical for "images" and "annotations" for each scan. The gray value (GV) key for the annotated TIFF stacks inside each “annotations” directory is as follows: --> 0 (black): background --> 100 (dark gray): 0-degree laminae damage ("Class 1" in Advanced Materials paper) --> 175 (gray): ±45-degree laminae damage ("Class 2" in Advanced Materials paper) --> 250 (white): 90-degree laminae damage ("Class 3" in Advanced Materials paper) An easily accessible, publicly available image viewer/analysis software is Fiji ImageJ. After unzipping each scan, drag the "images" or "annotations" folder into ImageJ. Selecting Image>Stack>Orthogonal Views shows all 3 orthogonal planes then. After the data has been downloaded and each zipped file has been unzipped, this collective "database" can be used to generate datasets for deep learning of semantic segmentation of polymer damage in composite laminates, as imaged by SRCT. The semantic segmentation code repository developed by the authors of this dataset, as well as the corresponding Advanced Materials paper, can be open-source accessed at https://github.com/mit-quest/necstlab-damage-segmentation. This GitHub repository provides code for a deep learning pipeline, ranging from raw data ingestion to trained machine inference. Note that by downloading and unzipping this database, step 1 (raw data ingestion) of the GitHub repository pipeline has already been completed and thus can be skipped. Step 2 then involves creating semantic segmentation datasets from this database using configuration files (example provided later), which are then used in machine training, validation, and testing as well as prediction threshold training. Further explanation is reported in the Advanced Materials paper. A full set of instructions for using the code repository, which can be run with local or virtual computing resources, is provided with the code repository itself. Next are some highlights to note. Sample dataset configuration files are located in the code repository. For example, dataset configuration file “dataset-esrf16_segV1_study1_CropCase5_TrnValCaseB.yaml” corresponds to the "5-INL" dataset in the Advanced Materials paper (see Supporting Information for paper also), and similarly, the dataset configuration file in the code repository marked as “5A” corresponds to the "5-IND" dataset in the paper. Note that the scan IDs in the code repository are not the same as the names featured here which reflect the simplified convention listed in Figure 2 in the paper, though the underlying image data is the same. Additionally, note that in the dataset configuration file, there is an option to either include or exclude the background as an independent class to be learned (i.e., true positive classification of background is enabled when it is included as a separate class to be learned). Compared to a general dataset configuration file used to generate data for the Advanced Materials paper, in which we excluded explicitly learning the background class (see SI), the class index in the paper is incremented up by 1, i.e., Class 0 in dataset configuration file corresponds to Class 1 in paper, Class 1 in dataset corresponds to Class 2 in paper, etc. The reason for this is that the paper considers background implicitly as Class 0, in line with other literature, so as not to confuse the reader. However, for coding purposes, the dataset “masks” (generated as binary output from step 2 in the code pipeline) only consider damage of a given class as only possibly having value of 1 (one-hot), with background always having a value of 0, meaning that background is not a separate class that is explicitly learned. In the paper, the 3-class model corresponds to 3 damage classes explicitly being learned. Finally, to aid in using this database with our GitHub code repository step 2 (i.e., generate datasets after downsampling this database and sampling smaller sub-crops from relatively large raw tomograms), below are the contents of a sample dataset configuration file that relates to "5-INL" in the associated Advanced Materials paper: dataset-5INL.yaml: dataset_config: dataset_split: train: [ "ThickLaminae_Specimen-1_Load-b_2560_1800_2160", "ThickLaminae_Specimen-1_Load-d_2560_1800_2160", "ThickLaminae_Specimen-1_Load-f_2540_1720_2160", "ThickLaminae_Specimen-2_Load-a_2560_1750_2160", "ThickLaminae_Specimen-2_Load-c_2560_1750_2160", "ThickLaminae_Specimen-2_Load-e_2560_1750_2160", "ThickLaminae_Specimen-3_Load-a_2560_1800_2160", "ThickLaminae_Specimen-3_Load-c_2560_1800_2160", "ThinLaminae_Specimen-1_Load-b_2434_1547_2159", "ThinLaminae_Specimen-1_Load-d_2496_1563_2159" ] validation: [ "ThickLaminae_Specimen-1_Load-a_2560_1800_2160", "ThickLaminae_Specimen-1_Load-c_2560_1800_2160", "ThickLaminae_Specimen-1_Load-e_2560_1800_2160", "ThickLaminae_Specimen-1_Load-g_2560_1800_2160", "ThickLaminae_Specimen-2_Load-b_2560_1750_2160", "ThickLaminae_Specimen-2_Load-d_2560_1750_2160", "ThickLaminae_Specimen-3_Load-b_2560_1800_2160", "ThickLaminae_Specimen-3_Load-d_2560_1800_2160", "ThinLaminae_Specimen-1_Load-a_2508_1551_2159", "ThinLaminae_Specimen-1_Load-c_2484_1524_2159", "ThinLaminae_Specimen-1_Load-e_2433_1533_2159" ] test: [ "ThickLaminae_Specimen-4_Load-a_2560_1750_2160", "ThickLaminae_Specimen-4_Load-b_2560_1750_2160", "ThickLaminae_Specimen-4_Load-c_2560_1750_2160", "ThickLaminae_Specimen-4_Load-d_2560_1750_2160", "ThinLaminae_Specimen-2_Load-a_2349_1578_2159", "ThinLaminae_Specimen-2_Load-b_2293_1581_2159", "ThinLaminae_Specimen-2_Load-c_2292_1566_2159", "ThinLaminae_Specimen-2_Load-d_2334_1578_2159", "ThinLaminae_Specimen-2_Load-e_2316_1542_2159" ] stack_downsampling: type: 'linear' number_of_images: 500 num_skip_beg_slices: 50 num_skip_end_slices: 50 target_size: [512, 512] # width, height image_cropping: type: 'class' num_pos_per_class: 1 # if `class` selected, then is num of random class-positive crops (of target sz) per img, >0 num_neg_per_class: 1 # if `class` selected, then is num of random class-negative crops (of target sz) per img, >=0 min_num_class_pos_px: # if `class` selected, then is min num of class-pos pixels required in given class-pos crop, >0 class_0_pos_px: 5 # '0-degree_damage' class_1_pos_px: 5 # '45-degree_damage' class_2_pos_px: 5 # '90-degree_damage' class_annotation_mapping: class_0_annotation_GVs: [100] # '0-degree_damage' class_1_annotation_GVs: [175] # '45-degree_damage' class_2_annotation_GVs: [250] # '90-degree_damage'