specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

Basooma, Anthony 1 2 3 ; Schmidt-Kloiber, Astrid1; Domisch, Sami4; Torres-Cambas, Yusdiel4; Smederevac-Lalić, Marija5; Bremerich, Vanessa4; Tschikof, Martin1; Meulenbroek, Paul1; Funk, Andrea1; Hein, Thomas 1 ; Borgwardt, Florian1

Research facility: BOKU University

Published Nov 04, 2025 on Dryad. https://doi.org/10.5061/dryad.6m905qgd7

Data files

Nov 04, 2025 version files 241.40 MB

data_links.docx

14.81 KB
modeloutput.RData

120.63 MB
modeloutput2.RData

120.51 MB
README.md

5.23 KB
sdm_functions.R

7.47 KB
sdmodeling.R

16.41 KB
speciec_model_predictions_output.zip

219.22 KB

Abstract

Developing species distribution models (SDM) requires high-quality species occurrence records. These records, stemming from varying sources with different sampling procedures, are often archived in open-access databases, which makes automated data quality checks inevitable. Temporal, geographic, and taxonomic quality checks are usually conducted in SDM workflows, but checking for records distant in environmental space, i.e., outliers, is often ignored. Here, we present specleanr, an R package that contains 20 outlier detection methods (ODMs) that can be ensembled to identify potential outliers in environmental predictors. These methods are categorized into (i) species-specific ecological range, (ii) univariate, and (iii) multivariate ODMs. All potential outliers flagged from the different methods are pooled to identify absolute outliers (records appearing in multiple methods). The local regression (LOESS) method is then used to automatically set a threshold that optimally identifies the absolute outliers. Also, clustering records into poor, fair, moderate, very strong, perfect outliers, and non-outliers is possible, based on each record's likelihood as a potential outlier, which allows expert assessment. We demonstrated the approach to 15 fish species from the Danube River Basin, including native, alien, threatened, and common species. We fitted SDMs using bioclimatic and hydromorphological parameters. We compared the model Area Under the Curve (AUC) before and after outlier removal using three scenarios: (1) the LOESS method, (2) removing very strong outliers, and (3) removing perfect outliers. The results showed a significant improvement in the model AUC with generally small to moderate effect sizes after outlier removal. specleanr is generalizable across taxonomic groups, data types, ecological realms, and geographic regions. Beyond SDM, it can also be broadly used in general data analysis where outlier detection is essential. We provide vignettes to support the package use. specleanr offers a user-friendly and reproducible approach for handling outliers in biogeographical modeling and general data analysis workflows.

Dataset DOI: 10.5061/dryad.6m905qgd7

Description of the data and file structure

The files include species occurrences from the Global Biodiversity Information Facility. Refer to the data links file to access the original data.
Environmental data was retrieved from CHELSA and Hydrography90m. These files included B101 to 19 for CHELSA and cti, orderstrahler, slopecurvdw_cel, accumulation, spi, sti, and subcatchment from Hydrography90m. The data link file has the URL to connect to the original dataset.
Model outputs were data outputs packaged after model implementation, including modeloutput and modeloutput2.
The sdm function was implemented in the sdm_function file.
sdmodeling file that processed all files.
species prediction were archived in species model prediction output.

Files and variables

File: sdmodeling.R

Description: Model implementation file. All analysis codes are archived in this file.

File: speciec_model_predictions_output.zip

Description: It contain species prediction out for each the 15 species test in this package. The different parameters were computed.

1. criteria

The metric used to select the optimal model or threshold. The max(se+sp) was used to select the optimal model.

2. threshold

The cut-off value used to convert predicted probabilities into binary outcomes (presence/absence or yes/no).

3. sensitivity (true positive rate)

Proportion of actual positives correctly identified by the model.

4. specificity (true negative rate):

Proportion of actual negatives correctly identified.

5. TSS (True Skill Statistic)

Combines sensitivity and specificity to evaluate model performance

6. MCC (Matthews Correlation Coefficient)

Measures the quality of binary classifications considering all four confusion matrix categories (TP, TN, FP, FN).

7. F1

Harmonic mean of precision (PPV) and recall (sensitivity). Balances false positives and false negatives.

8. Kappa

Measures agreement between observed and predicted values, correcting for chance agreement. Range: -1 to 1, higher values = better agreement.

9. NMI (Normalized Mutual Information)

Measures similarity between predicted and actual class distributions. Used in clustering or classification.

10. phi (Φ coefficient)

Correlation coefficient for binary variables (similar to MCC). Measures the association between predicted and actual classes.

11. ppv (Positive Predictive Value / Precision)

Proportion of positive predictions that are correct.

12. npv (Negative Predictive Value)

Proportion of negative predictions that are correct.

13. ccr (Correct Classification Rate / Accuracy)

Proportion of all correct predictions.

14. mcr (Misclassification Rate)

Proportion of incorrect predictions.

15. omission

Fraction of actual positives missed by the model (false negatives).

16. commission

Fraction of false positives predicted by the model.

17. prevalence

Proportion of actual positives in the dataset.

18. auc (Area Under the Curve)

Area under the ROC curve (plot of sensitivity vs 1-specificity). Measures overall model discrimination ability. Range: 0.5 = random, 1 = perfect.

19. deviance

Often used in statistical models (e.g., GLMs) as a measure of model fit. Lower deviance → better fit.

Model scenarios

20. runs

The number of model iterations or replications used to evaluate stability.

21. modelused

Indicates which algorithm or model was applied, including BIOCLIM, SVM, and Random Forest.

22. scenario

The context or dataset under which the model is evaluated (strong outlier removed, perfect outlier removed ,and using the loess method).

File: sdm_functions.R

Description: Model functions to compile models.

File: modeloutput.RData

Description: Model prediction output files packaged.

File: modeloutput2.RData

Description: Model prediction output files packaged.

File: data_links.docx

Description: Contains data URL links to the orginal data which can be loaded in the r scripts to conduct the analysis.

Code/software

The package was developed in R.

Access information

Other publicly accessible locations of the data:

Figshare: 10.6084/m9.figshare.29126783
Zenodo: https://doi.org/10.5281/zenodo.17076781

Data was derived from the following sources:

GBIF data

GBIF.org (26 October 2024) GBIF Occurrence Download https://doi.org/10.15468/dl.9x7vqv

CHELSA data

https://envicloud.wsl.ch/#/?bucket=https://os.zhdk.cloud.switch.ch/chelsav2/&prefix=GLOBAL/climatologies/2011-2040/IPSL-CM6A-LR/

Hydrography90m data

https://public.igb-berlin.de/index.php/s/agciopgzXjWswF4/download