Skip to main content
Dryad

Data from: Impact of regularization methods and outlier removal on unsupervised sample classification

Data files

Abstract

Background: High-content assays (HCAs) have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the great number and heterogeneity of descriptors evaluated.  The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled or eliminated. The remaining batch effects arose from variations in materials, personnel, and cell culture environments.

Methods: Five trials were done using the same protocol. In each, one sample was treated by the same chemical mixture and another was treated with solvent vehicle alone. The means and population distributions of the treated (EXP) and control (CON) samples were indistinguishable within each trial. The datasets were ideal for studying whether false positive or negative results were fabricated due to data preprocessing. Descriptors’ values were measured directly from images. Latent factor analysis was used to calculate unobservable variables. One of these variables turned out to be an identifiable and interpretable protrusion known as filopodia. This accounted for the fourth greatest variability in more comprehensive datasets, and so it was termed factor 4. Factor 4 values were used to test reproducibility of the EXP and CON statistics.

Results: Descriptor values are typically preprocessed a process of autoscaling, normalization, regularization, or z-scoring. If datasets were regularized within each of the five trials, significant differences were found among both repeated CON and repeated EXP samples. The mean of Trial 3 CON differed significantly from all other CON samples.  Differences among the CON samples disappeared when the datasets were regularized to a comprehensive database.  Among repeated EXPs, however, regularization to more comprehensive databases had little effect. Classification of samples with respect to one another within each trial yielded the same patterns after regularization to any comprehensive database. If regularization was done using datasets derived by different protocols, the differences introduced into the classification pattern were slight. They amounted to elevation of differences that had been marginal to statistical significance.  Outlier removal was deleterious.  Even with the most sparing definition of outliers, over 3% of a single sample’s contents were removed from most trials.  Removing outliers based on the overall within-trial distributions caused type I and type II errors.

Conclusions: The results suggest that regularization is best done using a comprehensive database rather than a dataset restricted to the trial. This optimizes the repeatability of the mean values. However, repeatability is not a reliable measure of assay quality.  Technical factors that vary from one assay to another, combined with small sample sizes and skewed distributions of descriptors’ values, may account for non-repeatability.  Classification patterns are unaffected by such variations, despite irreducible technical factors as differences in materials, personnel, and non-repeatable environments. The current results are based on real-world data and suggest that class assignments of repeated samples may be a good indicator of assay quality.