Code from: Testing for normality in regression models: mistakes abound (but may not matter)
Data files
May 17, 2025 version files 12.28 KB
-
Normality_Code.zip
11.14 KB
-
README.md
1.14 KB
Abstract
This study examines the misuse of normality tests in linear regression within ecology and biology, focusing on common misconceptions. A bibliometric review found that over 70% of ecology papers and 90% of biology papers incorrectly applied normality tests to raw data instead of model residuals. To assess the impact of this error, we simulated datasets with normal, interval, and skewed distributions across various sample and effect sizes. We compared statistical power between two approaches: testing the whole dataset for normality (incorrect) versus testing model residuals (correct) to determine whether to use a parametric (t-test) or nonparametric (Mann-Whitney U test) method. Our results showed minimal differences in statistical power between the approaches, even when normality was incorrectly tested on raw data. However, when residuals violated the normality assumption, using the Mann-Whitney U test increased statistical power by 3–4%. Overall, the study suggests that, while correctly testing residuals for normality enhances model performance, the impact of testing raw data is negligible in terms of power loss, especially with large sample sizes. The findings highlight the need for more awareness around proper statistical practices, especially in evaluating the assumptions of linear models.
https://doi.org/10.5061/dryad.sqv9s4nd0
Description of the data and file structure
The data files include those required to reproduce the analysis in "It’s OK Not to be Normal: Usage of Normality Tests in Linear Models" by S.R. Midway and J.W. White.
Files and variables
File: Normality_Code.zip
Description: Unzips to 5 files. "interval_sims.R", "lognormal_sims.R", and "normal_sims.R" are all R scripts that generate the data used in the study, each based on their respective distribution. "normality_comp.R" is an R script to reproduce the comparison of different tests of normality. "workflows_power.R" is an R script that reproduces the 3 analytical decisions in the manuscript.
Code/software
All code is included in the attached files. All code are R scripts that can be run through the free software R, with the associated libraries that are specified in the scripts.
Access information
Other publicly accessible locations of the data:
- None.
Data was derived from the following sources:
- All data is simulated.
