Statistical filtering to aid in the classification of phytoplankton: The effects of image library size and phytoplankton shape
Data files
Mar 19, 2026 version files 24.25 KB
-
Data-from-Farrow-Ackerman-Statistical-Filtering.xlsx
19.43 KB
-
README.md
4.82 KB
Abstract
The demand for image classification methods has increased due to technological advancements that enable more intensive phytoplankton monitoring. Regardless of whether the methods are based on statistical or machine learning algorithms, algal taxa may be misidentified in taxonomically diverse samples, in which phytoplankton morphology and image traits can be variable. We evaluated the statistical filtering performance for two approaches to image library development, which we applied independently to seven commonly occurring algal shapes. To do so, we used the statistical filter in the image processing software of an imaging flow cytometer (FlowCAM) and previously classified samples. One statistical filtering approach used a small selection of images (5-15 images of a target taxon) from the same sample being filtered (i.e., intrinsic), and the other used a larger selection of images (30-80 images of a target taxon) compiled from different samples. Filter accuracy, precision, and recall varied with the type of image library, image library size, and target taxon. The largest image libraries offered high recall (> 90%) but low accuracy and precision with both image library building approaches. For the largest image libraries, accuracy and precision were higher for the intrinsic method (>90% and 72-97%) than the compiled method (>40% and 10-20% for most taxa, respectively). Statistical filtering performance was higher for larger, solitary-celled taxa with relatively uniform features (e.g., Gyrosigma) compared to small-celled colonial species with more complex or variable shapes (e.g., mucilaginous colonial cyanobacteria, and Scenedesmus). Results indicate that statistical filtering can be used to augment manual sample classification.
Dataset DOI: 10.5061/dryad.prr4xgz0f
Description of the data and file structure
Images for phytoplankton samples collected from the nearshore of Nottawasaga Bay, Lake Huron in 2015 were collected using a FlowCAM vs-4 (Yokagawa Fluid Imaging Technologies, Inc.) imaging flow cytometer, ran in trigger mode using the 10X objective. We statistically filtered the phytoplankton image dataset using the built-in "like selected particles (statistical)" functions of VisualSpreadsheet software (versions 3.4.5 and 5.9.074, Yokagawa Fluid Imaging Technologies, Inc.). Data in the provided Data-from-Farrow-Ackerman-Statistical-Filtering.xlsx file correspond to the raw statistical filtering outputs for the compiled library method and the intrinsic library method, which are necessary for the calculation of the following performance metrics: accuracy, precision, and recall. The calculation of the performance metrics and more detail about the statistical filtering procedure is described in the original article.
Worksheet/Tab 1: Compiled-Libraries_Filter
| Column name | Description | Data format |
|---|---|---|
| Library Size | The number of library images used to filter a sample. | Number |
| Station | The sampling station the sample was collected from. | Category |
| Date | The date the sample was collected in the format (Year-Month-Day). | Date |
| Taxon | The algal taxon targeted by the statistical filter. | Category |
| Shape | The shape description of the corresponding algal taxon. | Category |
| NumSelected (TP+FP) | The number of images selected by the statistical filter (i.e., the sum of true positives and false positives). | Number |
| CorrSelected (TP) | The number of images correctly selected by the statistical filter for the targeted algal taxon (i.e., true positives). | Number |
| ActualCount (TP+FN) | The number of images belonging to the target algal taxon in the sample (i.e., the sum of true positives and false negatives). | Number |
| ErrSelected (FP) | The number of images incorrectly selected by the statistical filter (i.e., false positives). | Number |
| TotalCount (TP+TN+FP+FN) | The total number of images in the sample (i.e. the sum of true positives, true negatives, false positives, and false negatives). | Number |
Worksheet/Tab 2: Intrinsic-Libraries_Filter
| Column name | Description | Units | Data format |
|---|---|---|---|
| Station | The sampling station the sample was collected from. | - | Category |
| Date | The date the sample was collected in the format (Year-Month-Day). | - | Date |
| NumInFilter | The number of images selected to perform the statistical filtering operation | - | Number |
| Taxon | The algal taxon targeted by the statistical filter. | - | Category |
| Shape | The shape description of the corresponding algal taxon. | - | Category |
| NumSelected (TP+FP) | The number of images selected by the statistical filter (i.e., the sum of true positives and false positives). | - | Number |
| CorrSelected (TP) | The number of images correctly selected by the statistical filter for the targeted algal taxon (i.e., true positives). | - | Number |
| ActualCount (TP+FN) | The number of images belonging to the target algal taxon in the sample (i.e., the sum of true positives and false negatives). | - | Number |
| ErrSelected (FP) | The number of images incorrectly selected by the statistical filter (i.e., false positives). | - | Number |
| TotalCount (TP+TN+FP+FN) | The total number of images in the sample (i.e. the sum of true positives, true negatives, false positives, and false negatives). | - | Number |
