Skip to main content
Dryad

specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

Data files

Nov 04, 2025 version files 241.40 MB

Click names to download individual files

Abstract

Developing species distribution models (SDM) requires high-quality species occurrence records. These records, stemming from varying sources with different sampling procedures, are often archived in open-access databases, which makes automated data quality checks inevitable. Temporal, geographic, and taxonomic quality checks are usually conducted in SDM workflows, but checking for records distant in environmental space, i.e., outliers, is often ignored. Here, we present specleanr, an R package that contains 20 outlier detection methods (ODMs) that can be ensembled to identify potential outliers in environmental predictors. These methods are categorized into (i) species-specific ecological range, (ii) univariate, and (iii) multivariate ODMs. All potential outliers flagged from the different methods are pooled to identify absolute outliers (records appearing in multiple methods). The local regression (LOESS) method is then used to automatically set a threshold that optimally identifies the absolute outliers. Also, clustering records into poor, fair, moderate, very strong, perfect outliers, and non-outliers is possible, based on each record's likelihood as a potential outlier, which allows expert assessment. We demonstrated the approach to 15 fish species from the Danube River Basin, including native, alien, threatened, and common species. We fitted SDMs using bioclimatic and hydromorphological parameters. We compared the model Area Under the Curve (AUC) before and after outlier removal using three scenarios: (1) the LOESS method, (2) removing very strong outliers, and (3) removing perfect outliers. The results showed a significant improvement in the model AUC with generally small to moderate effect sizes after outlier removal. specleanr is generalizable across taxonomic groups, data types, ecological realms, and geographic regions. Beyond SDM, it can also be broadly used in general data analysis where outlier detection is essential. We provide vignettes to support the package use. specleanr offers a user-friendly and reproducible approach for handling outliers in biogeographical modeling and general data analysis workflows.