Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?
Data files
May 31, 2024 version files 21.71 GB
Abstract
Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and shape measurements for the internal structure poses a greater challenge compared to the external structure, due to low contrast differences between different materials and increased geometric complexity. These results provide novel insight into optimal training set sizes for precise image segmentation of diverse traits and highlight the potential of data augmentation for enhancing multivariate feature extraction from 3D images.
README: Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?
https://doi.org/10.5061/dryad.1rn8pk12f
All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository: https://doi.org/10.5281/zenodo.11109348.
This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.
Description of the data and file structure
The primary data is arranged into the following:
- Image_Files.zip: Foraminiferal CT data used in the analysis.
- Internal_and_External_Structures.zip: The manual segmentation data corresponding to the image files which can be used to train the AI networks.
- External_Shape_Comparison.zip: Inputs and Outputs for Shape Analysis for External Shape. In the output folder the Atlas_Momentas.txt and Atlas_ControlPoints.txt can be used to re-create the PCA plots. The data.csv file gives the details of the specimens. The kpca.csv and eigenvalues.csv are the final results produced following the Toussaint pipeline. The residuals remain unused.
- Internal_Shape_Comparison.zip: Inputs and Outputs for Shape Analysis for Internal Shape. In the output folder the Atlas_Momentas.txt and Atlas_ControlPoints.txt can be used to re-create the PCA plots. The data.csv file gives the details of the specimens. The kpca.csv and eigenvalues.csv are the final results produced following the Toussaint pipeline. The residuals remain unused.
- Processed_Mesh_Files.zip: Mesh .ply files used for the shape analysis.
- Aug_T5__AI_20.h5: The highest accuracy network file which can be used to practice AI segmentation.
Sharing/Access information
All data related to the paper "How many specimens make a sufficient training set for automated 3D feature extraction?" can be found here or on the GitHub page noted above.
Code/Software
All code and scripts are either in Python or R. These can all be found in the corresponding GitHub page. Also for shape data comparison follow the details outlined in https://gitlab.com/ntoussaint/landmark-free-morphometry.
Methods
Data collection
50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13].
The samples containing these individuals and species spanned 5.65 million years ago (Ma) to 2.85 Ma [14] and were collected from the Ceara Rise in the Equatorial Atlantic region at Ocean Drilling Program (ODP) Site 925, which comprised Hole 925B (4°12.248'N, 43°29.349'W), Hole 925C 20 (4°12.256'N, 43°29.349'W), and Hole 925D (4°12.260'N, 43°29.363'W). See Curry et al., [15] for more details. This group was chosen to provide inter- and intraspecific species variation, and to provide contemporary data to test how morphological distinctiveness maps to taxonomic hypotheses [16].
The non-destructive imaging of both internal and external structures of the foraminifera was conducted at the µ-VIS X-ray Imaging Centre, University of Southampton, UK, using a Zeiss Xradia 510 Versa X-ray tomography scanner. Employing a rotational target system, the scanner operated at a voltage of 110 kV and a power of 10 W. Projections were reconstructed using Zeiss Xradia software, resulting in 16-bit greyscale .tiff stacks characterised by a voxel size of 1.75 μm and an average dimension of 992 x 1015 pixels for each 2D slice.
Generation of training sets
We extracted the external calcite and internal cavity spaces from the micro-CT scans of the 50 individuals using manual segmentation within Dragonfly v. 2021.3 (Object Research Systems, Canada). This step took approximately 480 minutes per specimen (24,000 minutes total) and involved the manual labelling of 11,947 2D images. Segmentation data for each specimen were exported as multi-label (3 labels: external, internal, and background) 8-bit multipage .tiff stacks and paired with the original CT image data to allow for training (see figure 2).
The 50 specimens were categorised into three distinct groups (electronic supplementary material, table S1): 20 training image stacks, 10 validation image stacks, and 20 test image stacks. From the training image category, we generated six distinct training sets, varying in size from 1 to 20 specimens (see table 1). These were used to assess the impact of training set size on segmentation accuracy, as determined through a comparative analysis against the validation set (see Section 2.3).
From the initial six training sets, we created six additional training sets through data augmentation using the NumPy library [17] in Python. This augmentation method was chosen for its simplicity and accessibility to researchers with limited computational expertise, as it can be easily implemented using a straightforward batch code. This augmentation process entailed rotating the original images five times (the maximum amount permitted using this method), effectively producing six distinct 3D orientations per specimen for each of the original training sets (see figure 3). The augmented training sets comprised between 6 and 120 .tiff stacks (see table 1).
Training the neural networks
CNNs were trained using the offline version of Biomedisa, which utilises a 3D U-Net architecture [18] – the primary model employed for image segmentation [19], and is optimised using Keras with a TensorFlow backend. We used patches of size 64 x 64 x 64 voxels, which were then scaled to a size of 256 x 256 x 256 voxels. This scaling was performed to improve the network’s ability to capture spatial features and mitigate potential information loss during training. We trained 3 networks for each of the training sets to check the extent of stochastic variation on the results [20].
To train our models in Biomedisa, we used a stochastic gradient descent with a learning rate of 0.01, a decay of 1 × 10-6, momentum of 0.9, and Nesterov momentum enabled. A stride size of 32 pixels and a batch size of 24 samples per epoch were used alongside an automated cropping feature, which has been demonstrated to enhance accuracy [21]. The training of each network was performed on a Tesla V100S-PCIE-32GB graphics card with 30989 MB of available memory. All the analyses and training procedures were conducted on the High-Performance Computing (HPC) system at the Natural History Museum, London.
To measure network accuracy, we used the Dice similarity coefficient (Dice score), a metric commonly used in used in biomedical image segmentation studies [22, 23]. The Dice score quantifies the level of overlap between two segmentations, providing a value between 0 (no overlap) and 1 (perfect match).
We conducted experiments to evaluate the potential efficiency gains of using an early stopping mechanism within Biomedisa. After testing a variety of epoch limits, we opted for an early stopping criterion set at 25 epochs, which was found to be the lowest value as to which all models trained correctly for every training set. By “trained correctly” we mean if there is no increase in Dice score within a 25-epoch window, the optimal network is selected, and training is terminated. To gauge its impact of early stopping on network accuracy, we compared the results obtained from the original six training sets under early stopping to those obtained on a full run of 200 epochs.
Evaluation of feature extraction
We used the median accuracy network from each of the 12 training sets to produce segmentation data for the external and internal structures of the 20 test specimens. The median accuracy was selected as it provides a more robust estimate of performance by ensuring that outliers had less impact on the overall result. We then compared the volumetric and shape measurements from the manual data to those from each training set. The volumetric measurements were total volume (comprising both external and internal volumes) and percentage calcite (calculated as the ratio of external volume to internal volume, multiplied by 100).
To compare shape, mesh data for the external and internal structures was generated from the segmentation data of the 12 training sets and the manual data. Meshes were decimated to 50,000 faces and smoothed before being scaled and aligned using Python and Generalized Procrustes Surface Analysis (GPSA) [24], respectively. Shape was then analysed using the landmark-free morphometry pipeline, as outlined by Toussaint et al., [25]. We used a kernel width of 0.1mm and noise parameter of 1.0 for both the analysis of shape for both the external and internal data, using a Keops kernel (PyKeops; https://pypi.org/project/pykeops/) as it performs better with large data [25]. The analyses were run for 150 iterations, using an initial step size of 0.01. The manually generated mesh for the individual st049_bl1_fo2 was used as the atlas for both the external and internal shape comparisons.