Data from: Impact of regularization methods and outlier removal on unsupervised sample classification
Data files
May 19, 2026 version files 2.20 MB
-
individually_regularized_trial-by-trial.zip
242.11 KB
-
matrix_800x102_plus_descriptive.csv
1.09 MB
-
README.md
9.31 KB
-
regularized_to_1510_cells_of_Trials1-5.zip
237.62 KB
-
regularized_to_2623_cells_same_protocol.zip
370.35 KB
-
regularized_to_448_controls.zip
91.83 KB
-
regularized_to_light_microscopy_DB.zip
35.40 KB
-
regularized_to_transfected_cells.zip
115.17 KB
Abstract
Background: High-content assays (HCAs) have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the great number and heterogeneity of descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled or eliminated. The remaining batch effects arose from variations in materials, personnel, and cell culture environments.
Methods: Five trials were done using the same protocol. In each, one sample was treated by the same chemical mixture and another was treated with solvent vehicle alone. The means and population distributions of the treated (EXP) and control (CON) samples were indistinguishable within each trial. The datasets were ideal for studying whether false positive or negative results were fabricated due to data preprocessing. Descriptors’ values were measured directly from images. Latent factor analysis was used to calculate unobservable variables. One of these variables turned out to be an identifiable and interpretable protrusion known as filopodia. This accounted for the fourth greatest variability in more comprehensive datasets, and so it was termed factor 4. Factor 4 values were used to test reproducibility of the EXP and CON statistics.
Results: Descriptor values are typically preprocessed a process of autoscaling, normalization, regularization, or z-scoring. If datasets were regularized within each of the five trials, significant differences were found among both repeated CON and repeated EXP samples. The mean of Trial 3 CON differed significantly from all other CON samples. Differences among the CON samples disappeared when the datasets were regularized to a comprehensive database. Among repeated EXPs, however, regularization to more comprehensive databases had little effect. Classification of samples with respect to one another within each trial yielded the same patterns after regularization to any comprehensive database. If regularization was done using datasets derived by different protocols, the differences introduced into the classification pattern were slight. They amounted to elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of a single sample’s contents were removed from most trials. Removing outliers based on the overall within-trial distributions caused type I and type II errors.
Conclusions: The results suggest that regularization is best done using a comprehensive database rather than a dataset restricted to the trial. This optimizes the repeatability of the mean values. However, repeatability is not a reliable measure of assay quality. Technical factors that vary from one assay to another, combined with small sample sizes and skewed distributions of descriptors’ values, may account for non-repeatability. Classification patterns are unaffected by such variations, despite irreducible technical factors as differences in materials, personnel, and non-repeatable environments. The current results are based on real-world data and suggest that class assignments of repeated samples may be a good indicator of assay quality.
Dataset DOI: 10.5061/dryad.gxd25481z
Description of the data and file structure
In every trial, one sample was treated with solvent vehicle alone, and another was treated for 10 hours with phorbol 13-myristate 12-acetate (PMA) and lysophosphatidic acid (LPA). A sample consisted of silhouettes of 27-36 cells of a treated or control group. Under these conditions, PMA and LPA had no effect on the means or distribution of the variable analyzed.
The steps of data acquisition were: 1) acquire images of single cells by scanning electron microscopy, 2) trace the boundary, and 3) compute a total of 33 dimensionless descriptors’ values. An unobservable variable was computed from a linear combination of descriptors’ weighted values, using latent factor analysis in SAS (Statistical Analysis System). The weights for the descriptors were derived from a database of 800 cells that varied in tissue origins and differentiated states. These data, called the light microscopy database, were mean-centered and scaled by the standard deviation before computing the scoring coefficients. This process, variously known as autoscaling, normalization, regularization, StandardScaler, or z-scoring, is typically incorporated into workflows for handling microscopic images.
The aim of this research was to learn how and whether preprocessing operations affected the relationship between CON and EXP samples within a trial. When the differences between the means and distributions of control (CON) and PMA- and LPA-treated (EXP) samples were evaluated, however, it appeared that class assignments were reproducible despite differences in the database used for preprocessing. Although the statistical classification of cells was unaffected by regularization to different databases, outlier removal was always deleterious. The outcomes of the analysis indicated that in vitro assays yielded highly reproducible results.
Archive: individually_regularized_trial-by-trial.zip
Archive: regularized_to_1510_cells_of_Trials1-5.zip
Archive: regularized_to_448_controls.zip
Additional untreated cells for the 448 Dataset: Surya Amarachintha, Marilyn L. Cayer, Carol Heckman
Archive: regularized_to_2623_cells_same_protocol.zip
Additional cells for the 2623 Dataset: Jason M. Urban, Mita Varghese, Andrew Roholt, Carol Heckman
Archive: regularized_to_light_microscopy_DB.zip
Archive: regularized_to_transfected_cells.zip
Cells for the transfection database: John G. Demuth, Santosh Malwade, Tamera Wales, Carol Heckman
The datasets show results from six different approaches to regularizing descriptor values
Folder: individually_regularized_trial_by_trial
Sub-folder: Trial 1- cells of Trial 1 are regularized to the mean and standard deviation of all 521 values
Sub-folder: Trial 2- cells of Trial 2 are regularized to the mean and standard deviation of all 254 values
Sub-folder: Trial 3- cells of Trial 3 are regularized to the mean and standard deviation of all 171 values
Sub-folder: Trial 4- cells of Trial 4 are regularized to the mean and standard deviation of all 270 values
Sub-folder: Trial 5- cells of Trial 5 are regularized to the mean and standard deviation of all 295 values
Trial 1 comprises 16 samples each treated in a different way. The means and standard deviations of the dataset are provided in 521_popMstddev and the primary variables’ values in 521CH_17193entries. The primary values are dimensionless.
Files:
alls1-test.stat
alls2-test.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)
alls3-test.stat
alls4-test.stat
alls5-test.stat
alls6-test.stat
alls7-test.stat
alls8-test.stat
alls9-test.stat
alls10-test.stat (treated with vehicle only)
alls11-test.stat
alls12-test.stat
alls13-test.stat
alls14-test.stat
alls15-test.stat
alls16-test.stat
Trial 2 comprises 8 samples each treated in a different way. The means and standard deviations of the dataset are provided in 254_popMstddev and the primary variables’ values in 254CH_8382entries. The primary values are dimensionless.
Files:
mita1A_julie.stat
mita1B_julie.stat
mita3A_julie.stat
mita3B_julie.stat
mita4A_julie.stat
mita4B_julie.stat
mita5control_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)
mita6control_julie.stat (treated with vehicle only)
Trial 3 comprises 5 samples each treated in a different way. The means and standard deviations of the dataset are provided in 171_popMstddev and the primary variables’ values in 171CH_5643entries. The primary values are dimensionless.
Files:
new_9G_julie.stat
new_10G_julie.stat
new_11G_julie.stat
new_13G_julie.stat (treated with vehicle only)
new_15G_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)
Trial 4 comprises 8 samples each treated in a different way. The means and standard deviations of the dataset are provided in 270_popMstddev and the primary variables’ values in 270CH_8910entries. The primary values are dimensionless.
Files:
new_1G_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)
new_2G_julie.stat
new_3G_julie.stat
new_4G_julie.stat
new_5G_julie.stat
new_6G_julie.stat
new_7G_julie.stat
new_8G_julie.stat (treated with vehicle only)
Trial 5- Trial 5 comprises 9 samples each treated in a different way. The means and standard deviations of the dataset are provided in 295_popMstddev and the primary variables’ values in 295CH_9735entries. The primary values are dimensionless.
Files:
new_LK1_julie.stat (treated with vehicle only)
new_LK2_julie.stat (treated 10 hours with PMA (2 nM) + LPA (1.6 µM), see Methods)
new_LK3_julie.stat
new_LK4_julie.stat
new_LK5_julie.stat
new_LK6_julie.stat
new_LK7_julie.stat
new_LK8_julie.stat
new_LK9_julie.stat
295_popMstddev
295CH_9735entries
NOTE: the only difference in subsequent datasets is the database used for regularization.
Folder: regularized_to_1510_cells_of_Trials1-5
Trials 1-5 cells regularized to the mean and standard deviation of all 1510 cells from Trials 1-5
Folder: regularized_to_448_controls
Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of cells treated with vehicle alone
Folder: regularized_to_2623_cells_same_protocol
Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of cells treated by the same protocol
Folder: regularized_to_light_microscopy_DB
Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of the 800-cell database (DB) used to develop low-dimensional descriptors.
Folder: regularized_to_transfected_cells
Sub-folders: Trials 1-5 cells regularized to the mean and standard deviation of 652 cells transfected with plasmids
File: matrix_800x102_plus_descriptive.csv
Description:
ID (cell number)
TIME (days in culture for the four cell types)
1000 W, preneoplastic cell line of Fisher rat trachea
2C1, nontumorigenic cell line of Fisher rat trachea
IAR20 PC1, preneoplastic cell line of BD-VI rat liver
BP3, highly malignant cell line of Fisher rat trachea
X1-X102 (primary variables)
CELL (code for cell line, HEC20= IAR20 PC1, HEC10=1000W, 2C= 2C1, HECBP=BP3)
FACTORS1-13 (a sample output of the lower dimensional scoring coefficients if parameters are constrained to 13 factors)
Datasets were generated as previously described [1-6], summarized in Abstract. Samples were from a total of five trials.
Trial 1: Jason M. Urban, Jessica Weber, Carol Heckman
Phorbol 12-myristate 13-acetate (PMA) and 1-oleoyl-sn-glycero-3-phosphate (lysophosphatidic acid, LPA) were purchased from LC Laboratories (Woburn, MA) and Fluka (Buchs, Germany), respectively. Trial 1 sample 10 was untreated.
| Trial 1 | |||||
|---|---|---|---|---|---|
| Time (hours) treated | 0.5 hours | 2 hours | 5 hours | 10 hours | 15 hours |
| PMA (2 nM) + wortmannin (60 nM) | sample 14 | sample 11 | sample 7 | sample 1 | sample 4 |
| PMA (2 nM) + LPA (1.6 µM) | sample 15 | sample 12 | sample 8 | sample 2 | sample 5 |
| PMA (2 nM) | sample 16 | sample 13 | sample 9 | sample 3 | sample 6 |
Trial 2: Mita Varghese, Carol Heckman
PMA was purchased from LC Laboratories (Woburn, MA) or Sigma-Aldrich (St. Louis, MO) and 1-oleoyl-sn-glycero-3-phosphate from Fluka (Buchs, Germany) or Sigma-Aldrich. Trial 2 sample 8 was untreated.
| Trial 2 | ||||||
|---|---|---|---|---|---|---|
| 1a | 1b | 3a | 3b | 4a | 4b | 5 |
| blebbistatin, 2 hours | blebbistatin, 2 hours | altenusin, 2 hours | altenusin, 2 hours | H1152, 2 hours | H1152, 2 hours | treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours |
| 0.8 µM | 0.2 µM | 680 µM | 170 µM | 20 µM | 5 µM |
Trial 3: Andrew Roholt, Mita Varghese, Marilyn Cayer, Carol Heckman
In Trials 3-5, PMA, LPA, blebbistatin, colchicine, cytochalasin D (CD) were purchased from Sigma-Aldrich. Altenusin (SPC 16524) and H1152 were purchased from Alexis Chemicals (Lausen, Switzerland). In Trial 3, the medium was made with 10% fetal bovine serum (Atlanta Biologicals, GA), and sample 4 was untreated.
| Trial 3 | |||
|---|---|---|---|
| sample 1 | sample 2 | sample 3 | sample 5 |
| same as sample 5, treated with 0.3 µM CD at hour 4 | same as sample 5, treated with 0.3 µM CD at hour 6 | same as sample 5, treated with 0.3 µM CD at hour 8 | treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours |
Trial 4: Andrew Roholt, Mita Varghese, Marilyn Cayer, Carol Heckman
Medium was made with 10% fetal bovine serum (Hyclone, UT). Sample 8 was untreated.
| Trial 4 | ||||||
|---|---|---|---|---|---|---|
| sample 1 | sample 2 | sample 3 | sample 4 | sample 5 | sample 6 | sample 7 |
| treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours | same as sample 1, treated with 0.3 µM CD at 4 hours | same as sample 1, treated with 0.3 µM CD at 6 hours | 10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 4 hours | same as sample 1, treated with 0.3 µM CD at 8 hours | 10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 6 hours | 10 h PMA (2 nM) + A23187 (100 nM), treated with 0.3 µM CD at 8 hours |
Trial 5: Mita Varghese, Lee-Key Tey, Marilyn Cayer, Carol Heckman. Sample 1 was untreated.
| Trial 5 | |||||||
|---|---|---|---|---|---|---|---|
| sample 2 | sample 3 | sample 4 | sample 5 | sample 6 | sample 7 | sample 8 | sample 9 |
| treated with PMA (2 nM) + LPA (1.6 µM) for 10 hours | treated with 1 µM colchicine at 8 hours (no PMA or LPA) | same as sample 2, treated with 1 µM colchicine at 8 hours | treated with 0.3 µM CD at 8 hours (no PMA or LPA) | same as sample 2, treated with 0.3 µM CD at 8 hours | same as sample 2, treated with 0.3 µM CD at 6 hours | same as sample 2, treated with 0.3 µM CD at 4 hours | same as sample 2, treated with 0.3 µM CD at 2 hours |
1. Heckman CA, Campbell AE, Wetzel B: Characteristic shape and surface changes in epithelial transformation. Experimental Cell Research 1987, 169:127-148.
2. Heckman CA, Jamasbi RJ: Describing shape dynamics in transformed rat cells through latent factors. Experimental Cell Research 1999, 246(1):69-82.
3. Heckman CA, Urban JM, Cayer M, Li Y, Boudreau N, Barnes J, Plummer HK, III, Hall C, Kozma R, Lim L: Novel p21-activated kinase-dependent protrusions characteristically formed at the edge of transformed cells. Experimental Cell Research 2004, 295:432-447.
4. Heckman CA, DeMuth JG, Deters D, Malwade SR, Cayer ML, Monfries C, Mamais A: Relationship of p21-activated kinase (PAK) and filopodia to persistence and oncogenic transformation. Journal Cellular Physiology 2009, 220:576-585.
5. Varghese M: To Be or Not To Be a Protrusion: Unraveling the Determinants of Protrusion Formation. Bowling Green State University; 2011.
6. Amarachintha SP, Ryan KJ, Cayer M, Boudreau NS, Johnson NM, Heckman CA: Effect of Cdc42 domains on filopodia sensing, cell orientation, and haptotaxis. Cellular Signalling 2015, 27(3):683-693.
