Skip to main content
Dryad

Overcoming the pitfalls of categorizing continuous variables in ecology and evolutionary biology

Cite this dataset

Beltran, Roxanne; Tarwater, Corey (2023). Overcoming the pitfalls of categorizing continuous variables in ecology and evolutionary biology [Dataset]. Dryad. https://doi.org/10.5061/dryad.5x69p8d9r

Abstract

  1. Many metrics in biological research – from body size to life history timing to environmental metrics – are measured continuously (e.g., body size in grams) but analyzed as categories (e.g., large versus small). The pitfalls of categorization are well-recognized in statistics, but many scientists in the fields of ecology, evolution, and behavior may not be aware of this literature. These fields lack a review of common examples and feasible solutions to avoid the hazards of categorizing continuous data. 
  2. Our goal was to summarize current practices of categorizing continuous predictors in ecology and evolutionary biology and provide guidance for overcoming those pitfalls. We conducted a mini-review of 72 recent publications in six popular journals to quantify the prevalence of categorization. We then summarized commonly categorized metrics and simulated a dataset to demonstrate the drawbacks of categorization using common metrics and realistic examples from ecology and evolutionary biology. 
  3. We show that categorizing continuous variables is common (31% of publications reviewed), especially in the animal behavior field, and underscore that predictor variables – including abiotic, morphological, physiological, behavioral, and demographic metrics – can and should be collected and analyzed continuously. Our analysis of the simulated field dataset demonstrates how categorizing continuous variables can lower statistical power and change interpretation, especially when arbitrary breakpoints are used. Finally, we provide recommendations on how to keep variables continuous throughout the entire scientific process. 
  4. Together, these pieces comprise an actionable guide to increasing statistical power and facilitating large synthesis studies by simply leaving continuous variables alone. Overcoming the pitfalls of categorizing continuous variables will allow ecologists and evolutionary biologists to continue making trustworthy conclusions about natural processes, along with predictions about their responses to climate change and other environmental contexts. We hope that this manuscript and its associated code will provide a useful lab practical for students and teachers to develop programming skills including data simulation, plotting, and model comparisons, as well as research skills including reporting and interpretation.

README: Overcoming the pitfalls of categorizing continuous variables in ecology and evolutionary biology

https://doi.org/10.5061/dryad.5x69p8d9r

We simulated data to quantify the detrimental impact of categorizing continuous variables using various statistical breakpoints and sample sizes (details below). To give the example biological relevance, we created a dataset that illustrates the complexity of life history theory and climate change impacts, and contains a predictor variable that is frequently categorized (Table 2) - reproductive timing in one year and its effect on body size in the following year. A reasonable research question would be: How does timing of reproduction in year t influence body mass at the start of the breeding season in year t+1? For illustrative purposes, let’s say we collected data from individually banded penguins in Antarctica. Based on the mechanistic relationships between seasonally available sea ice and food availability, we hypothesize that late reproductive timing could negatively impact the abilities of penguins to grow larger before the next breeding season. Let’s say we wander around the penguin colony recording the initiation date of first nest of each banded penguin (reproductive timing, measured as “day of year”, continuous), and then return to Antarctica the following year to weigh those same penguins using a platform scale (body mass, kilograms, continuous). The data have a Gamma distribution with a long tail because there are only a few very small breeding penguins. With these data, we will describe simulations to answer the questions: how does the relationship between reproductive timing and body mass change if 1) Reproductive Timing data are categorized using different breakpoints, and 2) the dataset contains different sample sizes. We used R version 4.3.1 (R Core Team, 2022)  for all analyses, including the glm() function (Gamma, link=”log”) for model fitting, the AIC() function for Akaike Information Criterion model comparisons (Akaike, 1998), the emmeans() function to calculate 95% Confidence Intervals (±1.96*SE back-transformed from the log scale), and the cld() function to create compact letter displays. We calculated R2 using McFaddens' Pseudo-R2 (McFadden, 1973).

Description of the data and file structure

The code files are used to create data (BodySize, in kilograms, and ReproTime, in days).

Code/Software

Code to reproduce data, analyses, and figures, to be opened in R. The "CatVsCont Fake Data Categorization Methodsv7.R" file simulates datasets to test for the influence of different categorization methods on results. The "CatVsCont Fake Data Sample Sizev7.R" file simulates datasets to test for the influence of sample size on results.

Funding

National Science Foundation