Skip to main content

Sashimi: A toolkit for facilitating high-throughput organismal image segmentation using deep learning

Cite this dataset

Schwartz, Shawn; Alfaro, Michael (2021). Sashimi: A toolkit for facilitating high-throughput organismal image segmentation using deep learning [Dataset]. Dryad.


1. Digitized specimens are an indispensable resource for rapidly acquiring big datasets and typically must be preprocessed prior to conducting analyses. One crucial image preprocessing step in any image analysis workflow is image segmentation, or the ability to clearly contrast the foreground target from the background noise in an image. This procedure is typically done manually, creating a potential bottleneck for efforts to quantify biodiversity from image databases. Image segmentation meta-algorithms using deep learning provide an opportunity to relax this bottleneck. However, the most accessible pre-trained convolutional neural networks (CNNs) have been trained on a small fraction of biodiversity, thus limiting their utility.

2. We trained a deep learning model to automatically segment target fish from images with both standardized and complex, noisy backgrounds. We then assessed the performance of our deep learning model using qualitative visual inspection and quantitative image segmentation metrics of pixel overlap between reference segmentation masks generated manually by experts and those automatically predicted by our model.

3. Visual inspection revealed that our model segmented fishes with high precision and relatively few artifacts. These results suggest that the meta-algorithm (Mask R-CNN), in which our current fish segmentation model relies on, is well-suited for generating high-fidelity segmented specimen images across a variety of background contexts at rapid pace.

4. We present Sashimi, a user-friendly command line toolkit to facilitate rapid, automated high-throughput image segmentation of digitized organisms. Sashimi is accessible to non-programmers and does not require experience with deep learning to use. The flexibility of Mask R-CNN allows users to generate a segmentation model for use on diverse animal and plant images using transfer learning with training datasets as small as a few hundred images. To help grow the taxonomic scope of images that can be recognized, Sashimi also includes a central database for sharing and distributing custom-trained segmentation models of other unrepresented organisms. Lastly, Sashimi includes both auxiliary image preprocessing functions useful for some popular downstream color pattern analysis workflows, as well as a simple script to aid users in qualitatively and quantitatively assessing segmentation model performance for complementary sets of automatically and manually segmented images.


Materials and Methods section directly quoted from published manuscript (Schwartz & Alfaro, in press, Methods in Ecology and Evolution):

2. Materials and Methods

The Sashimi toolkit is freely available via GitHub (

2.1. Mask R-CNN Architecture

Our software implements the Mask R-CNN architecture (Abdulla 2017; He et al. 2017), an extension of the Faster R-CNN (Ren et al. 2017) algorithm for generating regions of interest. Mask R-CNN not only detects a target object in an image, but also rapidly detects the pixel-level target region of interest, outputting a high-resolution segmentation contour reflecting the specific boundaries of the location of the target object within the image.

2.2. Model Training Dataset Acquisition

Our dataset comprises 910 images, sampled across seven phenotypically disparate reef fish families, randomly divided into training and validation sets (ntrain = 720, nvalidation = 190; approximately 80% train, 20% validation). We acquired standardized digitized specimens from J.E. Randall’s fish images (N = 747; ntrain = 598, nvalidation = 149) distributed through the Bishop Museum ( and more naturalistic images with noisy backgrounds (N = 163; ntrain = 122, nvalidation = 41) from iNaturalist ( Examples of the types of images included in model training are presented in Fig. 1.

2.3. Model Training Procedure

We first used the VGG Image Annotator Version 1.0.6 (; Dutta, Gupta & Zissermann 2016) to manually annotate pixel coordinates to create precise polygonal mask contours directly around the fish body boundary (i.e., where the foreground pixels of the target fish body meet those of the background). We intentionally assigned all segmentation masks for each image a class label name corresponding to the general biological name of the organism (e.g., “fish”). Given that our intention is to build broad, organism-specific models one-by-one, we suggest building organism-specific training sets where all segmentation contours across images are labelled the same name (i.e., “whale”). We then used these coordinates to train a model using transfer learning (Razavian et al. 2014) with the COCO pre-trained weights (Lin et al. 2014), a ResNet-101 (a CNN with 101 layers; He et al. 2016) and a Feature Pyramid Network (a generic feature extractor for detecting objects across scales; Lin et al. 2017) backbone. Despite COCO not containing any images nor segmentation mask annotations of marine organisms, we opted to use the pre-trained COCO model weights to help make our custom fish segmentation model generalizable for broader recognition and segmentation of a phenotypically diverse gamut of fish images – similar to the Gray et al. (2019) implementation of Mask R-CNN for automating cetacean species identification and length estimation.

We based training on Matterport’s open-source implementation of Mask R-CNN (Abdulla 2017) using a desktop computer equipped with a GeForce RTX 2080 GPU. We trained our model for 160 epochs over three stages. Stage one (epochs 1-40) trained the network heads, stage two (epochs 41-120) fine-tuned ResNet-101 layers stage-four and up, and stage three (epochs 121-160) fine-tuned all layers. Training stages one and two used a learning rate of .001, while stage three used a learning rate of .0001. All training stages had a weight decay of .0001, learning momentum of .9, and used image augmentation by flipping 50% of the images in the left-right orientation to increase the robustness of the neural network. Model training took approximately eight hours to complete.

2.4. Automated Segmentation Pipeline

The Sashimi command line interface allows users to automatically extract and segment target images in common image formats. Sashimi supports the extraction of multiple targets from a single image; however, the analysis pipeline described here focused on images of single specimens in lateral view, a common use case for color pattern analysis. Within Sashimi, users can specify the path to their image folder for batch processing, save images with a transparent background, assess segmentation accuracy, and train new organism-specific segmentation models. The full instructions and options are provided on the GitHub repository.

2.5. Sashimi Online Model Repository

We constructed a website to serve as a repository for the fish segmentation model (presented here) and future, community generated organismal segmentation models ( We aim to inspire other biologists interested in automated segmentation to create pre-trained models for their organism(s) of interest and share them to the Sashimi online database for the rest of the community to use and build upon. All models will be open-source and available to download, and users can submit requests to share new models, which will be evaluated before becoming publicly available.

2.6. Evaluating Fish Segmentation Model Efficacy

2.6.1. Qualitative Image Segmentation Evaluation

We qualitatively assessed the performance of the current fish segmentation model by visually inspecting segmented outputs and reporting the visible strong and weak characteristics of these outputs.

2.6.2. Quantitative Image Segmentation Evaluation Metrics

We evaluated the performance of our fish image segmentation model using four common metrics for assessing semantic segmentation accuracy: pixel accuracy, Eq. 1; mean accuracy, Eq. 2; mean intersection over union (IoU), Eq. 3; and frequency weighted IoU, Eq. 4 (Long, Shelhamer & Darrell 2015). The IoU approach is commonly used for instance segmentation tasks, with values greater than 50% generally indicative of good detection (He et al. 2017; Gray et al. 2019). Here, we let  be the number of pixels of class  predicted to belong to class ,  be the total number of pixels of class , and  be the number of different classes. We computed each metric using the reference (‘ground truth’) segmentation contours (images we manually annotated with high precision) and the predicted segmentation masks from our custom-trained model (Fig. 2) on our randomly selected validation image dataset, which included 41 images of fish in naturalistic and noisy backgrounds from iNaturalist and 149 standardized fish images from J.E. Randall’s collection. We also report segmentation metric results for a test dataset of 60 novel images in the online Supporting Information. Additionally, we compared the results of a color pattern analysis workflow from an earlier study (Alfaro et al. 2019) using manual and automatically segmented images (see Supporting Information for Background, Method, and Results).

2.7. Statistics

All statistics were performed using JASP (Version 0.11.1; Love et al. 2019). We ran a 2 (Source: iNaturalist, Randall) × 4 (Metric: pixel accuracy, mean accuracy, mean IoU, frequency weighted IoU) repeated-measures analysis of variance (ANOVA) to test for differences in image segmentation accuracy between manually annotated (reference) and Mask R-CNN generated segmentation mask contours for validation images from iNaturalist (complex backgrounds) and J.E. Randall’s collection (relatively uniform backgrounds).