Data from: When correcting for regression to the mean is worse than no correction at all

Fontanari, José F.1 ; Santos, Mauro 2

Published Mar 23, 2026 on Dryad. https://doi.org/10.5061/dryad.r4xgxd2s8

Data files

Mar 23, 2026 version files 233.79 KB

README.md

5.06 KB
Repository_Folders.zip

228.73 KB

Abstract

This repository contains the source code and data for our study on the statistical pitfalls of correcting for Regression to the Mean (RTM). In biological research, observed changes between initial and final measurements are often negatively correlated with initial values. While researchers frequently apply statistical corrections to remove this artifact, we demonstrate through a structural modeling framework that these corrections can introduce more bias than they remove if the underlying causal model is not properly specified. Using simulations of blood pressure systems and empirical analyses of lizard heat tolerance and bird telomere attrition, we show that standard adjustments (e.g., Berry et al. 1984) can create spurious biological trends. We conclude that valid RTM correction requires an explicit causal model—specifically, distinguishing between stable between-subject variance and transient measurement or biological noise—rather than the application of generic statistical formulas.

José F. Fontanari¹ and Mauro Santos^2,3

¹Instituto de Física de São Carlos, Universidade de São Paulo, São Carlos, SP, Brazil

²Departament de Genètica i de Microbiologia, Grup de Genòmica, Bioinformàtica i Biologia Evolutiva (GBBE), Universitat Autònoma de Barcelona, Spain

³cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Lisboa, Portugal

Corresponding author; e-mail: mauro.santos@uab.es

Study Summary

Using simulations of blood pressure systems and empirical analyses of lizard heat tolerance and bird telomere attrition, we show that standard adjustments (e.g., Berry et al. 1984) can create spurious biological trends. We conclude that valid RTM correction requires an explicit causal model—specifically, distinguishing between stable between-subject variance and transient measurement or biological noise—rather than the application of generic statistical formulas.The repository is organized into ten folders, each corresponding to a specific figure or analysis in the manuscript.

Note: Each folder contains a dedicated README.md file with detailed information regarding specific variable mappings, column descriptions for the data files, and precise compilation instructions for that analysis.

Repository Organization

The repository is organized into ten folders, each corresponding to a specific figure or analysis in the manuscript.

Note: Each folder contains a dedicated README.md file with detailed information regarding specific variable mappings, column descriptions for the data files, and precise compilation instructions for that analysis.

Data Dictionary (General Structure)

While specific details are provided within each folder, the data files generally follow these formats:

Empirical/Bivariate Data: Two-column space-delimited files where Column 1 is the initial measurement (x1) and Column 2 is the final measurement (x2).

Simulation Results: Multi-column files (3–5 columns) representing different statistical estimators (Crude, Berry, Blomqvist, and True).

File: Repository_Folders.zip

Each folder includes a README file that provides a description of all files within the folder.

Folder Name // Description

Fig1_Structural_Null // Simulation of the blood pressure model under a null biological effect.

Fig2_Theoretical_Analysis // Deterministic calculations of expected slope bias across noise gradients.

Fig3_Sampling_Distributions // Monte Carlo comparison of four different slope estimators.

Fig4_Bootstrap_Testing // Bootstrap procedure for testing the null structural hypothesis.

Fig5_A_carolinensis_Empirical // Application of the model to lizard heat tolerance plasticity.

Fig6_Telomere_Analysis // Comparative adjustments (Berry vs. Blomqvist) for telomere data.

Fig7_Telomere_Bootstrap // Bootstrap distributions for empirical telomere attrition slopes.

FigS1_MSE_Analysis // Large-scale simulation of Mean Squared Error and numerical instability.

FigS2_Profile_Likelihood // Likelihood analysis under varying assumptions of repeatability (R).

FigS3_Unknown_R_Analysis // Illustration of parameter non-identifiability and likelihood plateaus.

Histogram/Profile Data: Two-column files where Column 1 is the bin value (e.g., Slope or β) and Column 2 is the density or log-likelihood value.

Software & Hardware Requirements

Compiler: GNU Fortran (gfortran)

Version Used: GNU Fortran (Homebrew GCC 15.1.0) 15.1.0

Language: Fortran 90/95

Testing Environment: The scripts were compiled and executed on macOS (Silicon), Linux (Ubuntu), and Windows (via MinGW). No external libraries are required beyond the standard gfortran library.

Data Sources

The empirical analyses utilize datasets from the following sources:

Lizard (Anolis) Data:

[https://doi.org/10.6084/m9.figshare.c.4955858.v4 ]

Telomere (Blue Tit) Data:

[https://doi.org/10.5061/dryad.m073h5q]

Usage Instructions

To reproduce an analysis:

1. Navigate to the folder corresponding to the specific figure.

2. Consult the internal README.md for variable definitions.

3. Compile the source code using gfortran [filename].f90 -o [outputname].