Statistical filtering to aid in the classification of phytoplankton: The effects of image library size and phytoplankton shape

Ackerman, Josef 1 ; Farrow, Christopher1

Published Mar 19, 2026 on Dryad. https://doi.org/10.5061/dryad.prr4xgz0f

Data files

Mar 19, 2026 version files 24.25 KB

Data-from-Farrow-Ackerman-Statistical-Filtering.xlsx

19.43 KB
README.md

4.82 KB

Abstract

The demand for image classification methods has increased due to technological advancements that enable more intensive phytoplankton monitoring. Regardless of whether the methods are based on statistical or machine learning algorithms, algal taxa may be misidentified in taxonomically diverse samples, in which phytoplankton morphology and image traits can be variable. We evaluated the statistical filtering performance for two approaches to image library development, which we applied independently to seven commonly occurring algal shapes. To do so, we used the statistical filter in the image processing software of an imaging flow cytometer (FlowCAM) and previously classified samples. One statistical filtering approach used a small selection of images (5-15 images of a target taxon) from the same sample being filtered (i.e., intrinsic), and the other used a larger selection of images (30-80 images of a target taxon) compiled from different samples. Filter accuracy, precision, and recall varied with the type of image library, image library size, and target taxon. The largest image libraries offered high recall (> 90%) but low accuracy and precision with both image library building approaches. For the largest image libraries, accuracy and precision were higher for the intrinsic method (>90% and 72-97%) than the compiled method (>40% and 10-20% for most taxa, respectively). Statistical filtering performance was higher for larger, solitary-celled taxa with relatively uniform features (e.g., Gyrosigma) compared to small-celled colonial species with more complex or variable shapes (e.g., mucilaginous colonial cyanobacteria, and Scenedesmus). Results indicate that statistical filtering can be used to augment manual sample classification.

Dataset DOI: 10.5061/dryad.prr4xgz0f

Description of the data and file structure

Images for phytoplankton samples collected from the nearshore of Nottawasaga Bay, Lake Huron in 2015 were collected using a FlowCAM vs-4 (Yokagawa Fluid Imaging Technologies, Inc.) imaging flow cytometer, ran in trigger mode using the 10X objective. We statistically filtered the phytoplankton image dataset using the built-in "like selected particles (statistical)" functions of VisualSpreadsheet software (versions 3.4.5 and 5.9.074, Yokagawa Fluid Imaging Technologies, Inc.). Data in the provided Data-from-Farrow-Ackerman-Statistical-Filtering.xlsx file correspond to the raw statistical filtering outputs for the compiled library method and the intrinsic library method, which are necessary for the calculation of the following performance metrics: accuracy, precision, and recall. The calculation of the performance metrics and more detail about the statistical filtering procedure is described in the original article.

Worksheet/Tab 1: Compiled-Libraries_Filter

Column name	Description	Data format
Library Size	The number of library images used to filter a sample.	Number
Station	The sampling station the sample was collected from.	Category
Date	The date the sample was collected in the format (Year-Month-Day).	Date
Taxon	The algal taxon targeted by the statistical filter.	Category
Shape	The shape description of the corresponding algal taxon.	Category
NumSelected (TP+FP)	The number of images selected by the statistical filter (i.e., the sum of true positives and false positives).	Number
CorrSelected (TP)	The number of images correctly selected by the statistical filter for the targeted algal taxon (i.e., true positives).	Number
ActualCount (TP+FN)	The number of images belonging to the target algal taxon in the sample (i.e., the sum of true positives and false negatives).	Number
ErrSelected (FP)	The number of images incorrectly selected by the statistical filter (i.e., false positives).	Number
TotalCount (TP+TN+FP+FN)	The total number of images in the sample (i.e. the sum of true positives, true negatives, false positives, and false negatives).	Number

Worksheet/Tab 2: Intrinsic-Libraries_Filter

Column name	Description	Units	Data format
Station	The sampling station the sample was collected from.	-	Category
Date	The date the sample was collected in the format (Year-Month-Day).	-	Date
NumInFilter	The number of images selected to perform the statistical filtering operation	-	Number
Taxon	The algal taxon targeted by the statistical filter.	-	Category
Shape	The shape description of the corresponding algal taxon.	-	Category
NumSelected (TP+FP)	The number of images selected by the statistical filter (i.e., the sum of true positives and false positives).	-	Number
CorrSelected (TP)	The number of images correctly selected by the statistical filter for the targeted algal taxon (i.e., true positives).	-	Number
ActualCount (TP+FN)	The number of images belonging to the target algal taxon in the sample (i.e., the sum of true positives and false negatives).	-	Number
ErrSelected (FP)	The number of images incorrectly selected by the statistical filter (i.e., false positives).	-	Number
TotalCount (TP+TN+FP+FN)	The total number of images in the sample (i.e. the sum of true positives, true negatives, false positives, and false negatives).	-	Number