Data from: Impact of regularization methods and outlier removal on unsupervised sample classification

Name: Impact of regularization methods and outlier removal on unsupervised sample classification
Creator: Carol Heckman

Heckman, Carol 1

Published May 19, 2026 on Dryad. https://doi.org/10.5061/dryad.gxd25481z

Data files

May 19, 2026 version files 2.20 MB

individually_regularized_trial-by-trial.zip

242.11 KB
matrix_800x102_plus_descriptive.csv

1.09 MB
README.md

9.31 KB
regularized_to_1510_cells_of_Trials1-5.zip

237.62 KB
regularized_to_2623_cells_same_protocol.zip

370.35 KB
regularized_to_448_controls.zip

91.83 KB
regularized_to_light_microscopy_DB.zip

35.40 KB
regularized_to_transfected_cells.zip

115.17 KB

Abstract

Background: High-content assays (HCAs) have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the great number and heterogeneity of descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled or eliminated. The remaining batch effects arose from variations in materials, personnel, and cell culture environments.

Methods: Five trials were done using the same protocol. In each, one sample was treated by the same chemical mixture and another was treated with solvent vehicle alone. The means and population distributions of the treated (EXP) and control (CON) samples were indistinguishable within each trial. The datasets were ideal for studying whether false positive or negative results were fabricated due to data preprocessing. Descriptors’ values were measured directly from images. Latent factor analysis was used to calculate unobservable variables. One of these variables turned out to be an identifiable and interpretable protrusion known as filopodia. This accounted for the fourth greatest variability in more comprehensive datasets, and so it was termed factor 4. Factor 4 values were used to test reproducibility of the EXP and CON statistics.

Results: Descriptor values are typically preprocessed a process of autoscaling, normalization, regularization, or z-scoring. If datasets were regularized within each of the five trials, significant differences were found among both repeated CON and repeated EXP samples. The mean of Trial 3 CON differed significantly from all other CON samples. Differences among the CON samples disappeared when the datasets were regularized to a comprehensive database. Among repeated EXPs, however, regularization to more comprehensive databases had little effect. Classification of samples with respect to one another within each trial yielded the same patterns after regularization to any comprehensive database. If regularization was done using datasets derived by different protocols, the differences introduced into the classification pattern were slight. They amounted to elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of a single sample’s contents were removed from most trials. Removing outliers based on the overall within-trial distributions caused type I and type II errors.

Conclusions: The results suggest that regularization is best done using a comprehensive database rather than a dataset restricted to the trial. This optimizes the repeatability of the mean values. However, repeatability is not a reliable measure of assay quality. Technical factors that vary from one assay to another, combined with small sample sizes and skewed distributions of descriptors’ values, may account for non-repeatability. Classification patterns are unaffected by such variations, despite irreducible technical factors as differences in materials, personnel, and non-repeatable environments. The current results are based on real-world data and suggest that class assignments of repeated samples may be a good indicator of assay quality.

Dataset DOI: 10.5061/dryad.gxd25481z

Description of the data and file structure

In every trial, one sample was treated with solvent vehicle alone, and another was treated for 10 hours with phorbol 13-myristate 12-acetate (PMA) and lysophosphatidic acid (LPA). A sample consisted of silhouettes of 27-36 cells of a treated or control group. Under these conditions, PMA and LPA had no effect on the means or distribution of the variable analyzed.

The steps of data acquisition were: 1) acquire images of single cells by scanning electron microscopy, 2) trace the boundary, and 3) compute a total of 33 dimensionless descriptors’ values. An unobservable variable was computed from a linear combination of descriptors’ weighted values, using latent factor analysis in SAS (Statistical Analysis System). The weights for the descriptors were derived from a database of 800 cells that varied in tissue origins and differentiated states. These data, called the light microscopy database, were mean-centered and scaled by the standard deviation before computing the scoring coefficients. This process, variously known as autoscaling, normalization, regularization, StandardScaler, or z-scoring, is typically incorporated into workflows for handling microscopic images.

The aim of this research was to learn how and whether preprocessing operations affected the relationship between CON and EXP samples within a trial. When the differences between the means and distributions of control (CON) and PMA- and LPA-treated (EXP) samples were evaluated, however, it appeared that class assignments were reproducible despite differences in the database used for preprocessing. Although the statistical classification of cells was unaffected by regularization to different databases, outlier removal was always deleterious. The outcomes of the analysis indicated that in vitro assays yielded highly reproducible results.

Archive: individually_regularized_trial-by-trial.zip

Archive: regularized_to_1510_cells_of_Trials1-5.zip

Archive: regularized_to_448_controls.zip

Additional untreated cells for the 448 Dataset: Surya Amarachintha, Marilyn L. Cayer, Carol Heckman

Archive: regularized_to_2623_cells_same_protocol.zip

Additional cells for the 2623 Dataset: Jason M. Urban, Mita Varghese, Andrew Roholt, Carol Heckman

Archive: regularized_to_light_microscopy_DB.zip

Archive: regularized_to_transfected_cells.zip

Cells for the transfection database: John G. Demuth, Santosh Malwade, Tamera Wales, Carol Heckman

The datasets show results from six different approaches to regularizing descriptor values

Folder: individually_regularized_trial_by_trial

Sub-folder: Trial 1- cells of Trial 1 are regularized to the mean and standard deviation of all 521 values

Sub-folder: Trial 2- cells of Trial 2 are regularized to the mean and standard deviation of all 254 values

Sub-folder: Trial 3- cells of Trial 3 are regularized to the mean and standard deviation of all 171 values

Sub-folder: Trial 4- cells of Trial 4 are regularized to the mean and standard deviation of all 270 values

Sub-folder: Trial 5- cells of Trial 5 are regularized to the mean and standard deviation of all 295 values

Trial 1 comprises 16 samples each treated in a different way. The means and standard deviations of the dataset are provided in 521_popMstddev and the primary variables’ values in 521CH_17193entries. The primary values are dimensionless.

Files:

alls1-test.stat

alls2-test.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)

alls3-test.stat

alls4-test.stat

alls5-test.stat

alls6-test.stat

alls7-test.stat

alls8-test.stat

alls9-test.stat

alls10-test.stat (treated with vehicle only)

alls11-test.stat

alls12-test.stat

alls13-test.stat

alls14-test.stat

alls15-test.stat

alls16-test.stat

Trial 2 comprises 8 samples each treated in a different way. The means and standard deviations of the dataset are provided in 254_popMstddev and the primary variables’ values in 254CH_8382entries. The primary values are dimensionless.

Files:

mita1A_julie.stat

mita1B_julie.stat

mita3A_julie.stat

mita3B_julie.stat

mita4A_julie.stat

mita4B_julie.stat

mita5control_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)

mita6control_julie.stat (treated with vehicle only)

Trial 3 comprises 5 samples each treated in a different way. The means and standard deviations of the dataset are provided in 171_popMstddev and the primary variables’ values in 171CH_5643entries. The primary values are dimensionless.

Files:

new_9G_julie.stat

new_10G_julie.stat

new_11G_julie.stat

new_13G_julie.stat (treated with vehicle only)

new_15G_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)

Trial 4 comprises 8 samples each treated in a different way. The means and standard deviations of the dataset are provided in 270_popMstddev and the primary variables’ values in 270CH_8910entries. The primary values are dimensionless.

Files:

new_1G_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)

new_2G_julie.stat

new_3G_julie.stat

new_4G_julie.stat

new_5G_julie.stat

new_6G_julie.stat

new_7G_julie.stat

new_8G_julie.stat (treated with vehicle only)

Trial 5- Trial 5 comprises 9 samples each treated in a different way. The means and standard deviations of the dataset are provided in 295_popMstddev and the primary variables’ values in 295CH_9735entries. The primary values are dimensionless.

Files:

new_LK1_julie.stat (treated with vehicle only)

new_LK2_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)

new_LK3_julie.stat

new_LK4_julie.stat

new_LK5_julie.stat

new_LK6_julie.stat

new_LK7_julie.stat

new_LK8_julie.stat

new_LK9_julie.stat

295_popMstddev

295CH_9735entries

NOTE: the only difference in subsequent datasets is the database used for regularization.

Folder: regularized_to_1510_cells_of_Trials1-5

Trials 1-5 cells regularized to the mean and standard deviation of all 1510 cells from Trials 1-5

Folder: regularized_to_448_controls

Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of cells treated with vehicle alone

Folder: regularized_to_2623_cells_same_protocol

Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of cells treated by the same protocol

Folder: regularized_to_light_microscopy_DB

Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of the 800-cell database (DB) used to develop low-dimensional descriptors.

Folder: regularized_to_transfected_cells

Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of 652 cells transfected with plasmids

File: matrix_800x102_plus_descriptive.csv

Description:

ID (cell number)

TIME (days in culture for the four cell types)

1000 W, preneoplastic cell line of Fisher rat trachea

2C1, nontumorigenic cell line of Fisher rat trachea

IAR20 PC1, preneoplastic cell line of BD-VI rat liver

BP3, highly malignant cell line of Fisher rat trachea

X1-X102 (primary variables)

CELL (code for cell line, HEC20= IAR20 PC1, HEC10=1000W, 2C= 2C1, HECBP=BP3)

FACTORS1-13 (a sample output of the lower dimensional scoring coefficients if parameters are constrained to 13 factors)

Datasets were generated as previously described [1-6], summarized in Abstract. Samples were from a total of five trials.

Trial 1: Jason M. Urban, Jessica Weber, Carol Heckman

Phorbol 12-myristate 13-acetate (PMA) and 1-oleoyl-sn-glycero-3-phosphate (lysophosphatidic acid, LPA) were purchased from LC Laboratories (Woburn, MA) and Fluka (Buchs, Germany), respectively. Trial 1 sample 10 was untreated.

Trial 1
Time (hours) treated	0.5 hours	2 hours	5 hours	10 hours	15 hours
PMA (2 nM) + wortmannin (60 nM)	sample 14	sample 11	sample 7	sample 1	sample 4
PMA (2 nM) + LPA (1.6 µM)	sample 15	sample 12	sample 8	sample 2	sample 5
PMA (2 nM)	sample 16	sample 13	sample 9	sample 3	sample 6

Trial 2: Mita Varghese, Carol Heckman

PMA was purchased from LC Laboratories (Woburn, MA) or Sigma-Aldrich (St. Louis, MO) and 1-oleoyl-sn-glycero-3-phosphate from Fluka (Buchs, Germany) or Sigma-Aldrich. Trial 2 sample 8 was untreated.

Trial 2
1a	1b	3a	3b	4a	4b	5
blebbistatin, 2 hours	blebbistatin, 2 hours	altenusin, 2 hours	altenusin, 2 hours	H1152, 2 hours	H1152, 2 hours	treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours
0.8 µM	0.2 µM	680 µM	170 µM	20 µM	5 µM

Trial 3: Andrew Roholt, Mita Varghese, Marilyn Cayer, Carol Heckman

In Trials 3-5, PMA, LPA, blebbistatin, colchicine, cytochalasin D (CD) were purchased from Sigma-Aldrich. Altenusin (SPC 16524) and H1152 were purchased from Alexis Chemicals (Lausen, Switzerland). In Trial 3, the medium was made with 10% fetal bovine serum (Atlanta Biologicals, GA), and sample 4 was untreated.

Trial 3
sample 1	sample 2	sample 3	sample 5
same as sample 5, treated with 0.3 µM CD at hour 4	same as sample 5, treated with 0.3 µM CD at hour 6	same as sample 5, treated with 0.3 µM CD at hour 8	treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours

Trial 4: Andrew Roholt, Mita Varghese, Marilyn Cayer, Carol Heckman

Medium was made with 10% fetal bovine serum (Hyclone, UT). Sample 8 was untreated.

Trial 4
sample 1	sample 2	sample 3	sample 4	sample 5	sample 6	sample 7
treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours	same as sample 1, treated with 0.3 µM CD at 4 hours	same as sample 1, treated with 0.3 µM CD at 6 hours	10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 4 hours	same as sample 1, treated with 0.3 µM CD at 8 hours	10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 6 hours	10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 8 hours

Trial 5: Mita Varghese, Lee-Key Tey, Marilyn Cayer, Carol Heckman. Sample 1 was untreated.

Trial 5
sample 2	sample 3	sample 4	sample 5	sample 6	sample 7	sample 8	sample 9
treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours	treated with 1 µM colchicine at 8 hours (no PMA or LPA)	same as sample 2, treated with 1 µM colchicine at 8 hours	treated with 0.3 µM CD at 8 hours (no PMA or LPA)	same as sample 2, treated with 0.3 µM CD at 8 hours	same as sample 2, treated with 0.3 µM CD at 6 hours	same as sample 2, treated with 0.3 µM CD at 4 hours	same as sample 2, treated with 0.3 µM CD at 2 hours

1. Heckman CA, Campbell AE, Wetzel B: Characteristic shape and surface changes in epithelial transformation. Experimental Cell Research 1987, 169:127-148.

2. Heckman CA, Jamasbi RJ: Describing shape dynamics in transformed rat cells through latent factors. Experimental Cell Research 1999, 246(1):69-82.

3. Heckman CA, Urban JM, Cayer M, Li Y, Boudreau N, Barnes J, Plummer HK, III, Hall C, Kozma R, Lim L: Novel p21-activated kinase-dependent protrusions characteristically formed at the edge of transformed cells. Experimental Cell Research 2004, 295:432-447.

4. Heckman CA, DeMuth JG, Deters D, Malwade SR, Cayer ML, Monfries C, Mamais A: Relationship of p21-activated kinase (PAK) and filopodia to persistence and oncogenic transformation. Journal Cellular Physiology 2009, 220:576-585.

5. Varghese M: To Be or Not To Be a Protrusion: Unraveling the Determinants of Protrusion Formation. Bowling Green State University; 2011.

6. Amarachintha SP, Ryan KJ, Cayer M, Boudreau NS, Johnson NM, Heckman CA: Effect of Cdc42 domains on filopodia sensing, cell orientation, and haptotaxis. Cellular Signalling 2015, 27(3):683-693.

Data from: Impact of regularization methods and outlier removal on unsupervised sample classification

Data files

Abstract

README: Data from: Impact of regularization methods and outlier removal on unsupervised sample classification

Description of the data and file structure

Folder: individually_regularized_trial_by_trial

Files:

Files:

Files:

Files:

Files:

Folder: regularized_to_1510_cells_of_Trials1-5

Folder: regularized_to_448_controls

Folder: regularized_to_2623_cells_same_protocol

Folder: regularized_to_light_microscopy_DB

Folder: regularized_to_transfected_cells

Methods