Data from: When correcting for regression to the mean is worse than no correction at all
Data files
Mar 23, 2026 version files 233.79 KB
-
README.md
5.06 KB
-
Repository_Folders.zip
228.73 KB
Abstract
This repository contains the source code and data for our study on the statistical pitfalls of correcting for Regression to the Mean (RTM). In biological research, observed changes between initial and final measurements are often negatively correlated with initial values. While researchers frequently apply statistical corrections to remove this artifact, we demonstrate through a structural modeling framework that these corrections can introduce more bias than they remove if the underlying causal model is not properly specified. Using simulations of blood pressure systems and empirical analyses of lizard heat tolerance and bird telomere attrition, we show that standard adjustments (e.g., Berry et al. 1984) can create spurious biological trends. We conclude that valid RTM correction requires an explicit causal model—specifically, distinguishing between stable between-subject variance and transient measurement or biological noise—rather than the application of generic statistical formulas.
José F. Fontanari1 and Mauro Santos2,3
1Instituto de Física de São Carlos, Universidade de São Paulo, São Carlos, SP, Brazil
2Departament de Genètica i de Microbiologia, Grup de Genòmica, Bioinformàtica i Biologia Evolutiva (GBBE), Universitat Autònoma de Barcelona, Spain
3cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Lisboa, Portugal
Corresponding author; e-mail: mauro.santos@uab.es
Study Summary
This repository contains the source code and data for our study on the statistical pitfalls of correcting for Regression to the Mean (RTM). In biological research, observed changes between initial and final measurements are often negatively correlated with initial values. While researchers frequently apply statistical corrections to remove this artifact, we demonstrate through a structural modeling framework that these corrections can introduce more bias than they remove if the underlying causal model is not properly specified.
Using simulations of blood pressure systems and empirical analyses of lizard heat tolerance and bird telomere attrition, we show that standard adjustments (e.g., Berry et al. 1984) can create spurious biological trends. We conclude that valid RTM correction requires an explicit causal model—specifically, distinguishing between stable between-subject variance and transient measurement or biological noise—rather than the application of generic statistical formulas.The repository is organized into ten folders, each corresponding to a specific figure or analysis in the manuscript.
Note: Each folder contains a dedicated README.md file with detailed information regarding specific variable mappings, column descriptions for the data files, and precise compilation instructions for that analysis.
Repository Organization
The repository is organized into ten folders, each corresponding to a specific figure or analysis in the manuscript.
Note: Each folder contains a dedicated README.md file with detailed information regarding specific variable mappings, column descriptions for the data files, and precise compilation instructions for that analysis.
Data Dictionary (General Structure)
While specific details are provided within each folder, the data files generally follow these formats:
- Empirical/Bivariate Data: Two-column space-delimited files where Column 1 is the initial measurement (x1) and Column 2 is the final measurement (x2).
- Simulation Results: Multi-column files (3–5 columns) representing different statistical estimators (Crude, Berry, Blomqvist, and True).
File: Repository_Folders.zip
Each folder includes a README file that provides a description of all files within the folder.
Folder Name // Description
Fig1_Structural_Null // Simulation of the blood pressure model under a null biological effect.
Fig2_Theoretical_Analysis // Deterministic calculations of expected slope bias across noise gradients.
Fig3_Sampling_Distributions // Monte Carlo comparison of four different slope estimators.
Fig4_Bootstrap_Testing // Bootstrap procedure for testing the null structural hypothesis.
Fig5_A_carolinensis_Empirical // Application of the model to lizard heat tolerance plasticity.
Fig6_Telomere_Analysis // Comparative adjustments (Berry vs. Blomqvist) for telomere data.
Fig7_Telomere_Bootstrap // Bootstrap distributions for empirical telomere attrition slopes.
FigS1_MSE_Analysis // Large-scale simulation of Mean Squared Error and numerical instability.
FigS2_Profile_Likelihood // Likelihood analysis under varying assumptions of repeatability (R).
FigS3_Unknown_R_Analysis // Illustration of parameter non-identifiability and likelihood plateaus.
- Histogram/Profile Data: Two-column files where Column 1 is the bin value (e.g., Slope or β) and Column 2 is the density or log-likelihood value.
Software & Hardware Requirements
- Compiler: GNU Fortran (gfortran)
- Version Used: GNU Fortran (Homebrew GCC 15.1.0) 15.1.0
- Language: Fortran 90/95
- Testing Environment: The scripts were compiled and executed on macOS (Silicon), Linux (Ubuntu), and Windows (via MinGW). No external libraries are required beyond the standard gfortran library.
Data Sources
The empirical analyses utilize datasets from the following sources:
- Lizard (Anolis) Data:
Telomere (Blue Tit) Data:
Usage Instructions
To reproduce an analysis:
1. Navigate to the folder corresponding to the specific figure.
2. Consult the internal README.md for variable definitions.
3. Compile the source code using gfortran [filename].f90 -o [outputname].
