specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows
Data files
Nov 04, 2025 version files 241.40 MB
-
data_links.docx
14.81 KB
-
modeloutput.RData
120.63 MB
-
modeloutput2.RData
120.51 MB
-
README.md
5.23 KB
-
sdm_functions.R
7.47 KB
-
sdmodeling.R
16.41 KB
-
speciec_model_predictions_output.zip
219.22 KB
Abstract
Dataset DOI: 10.5061/dryad.6m905qgd7
Description of the data and file structure
- The files include species occurrences from the Global Biodiversity Information Facility. Refer to the data links file to access the original data.
- Environmental data was retrieved from CHELSA and Hydrography90m. These files included B101 to 19 for CHELSA and cti, orderstrahler, slopecurvdw_cel, accumulation, spi, sti, and subcatchment from Hydrography90m. The data link file has the URL to connect to the original dataset.
- Model outputs were data outputs packaged after model implementation, including modeloutput and modeloutput2.
- The sdm function was implemented in the sdm_function file.
- sdmodeling file that processed all files.
- species prediction were archived in species model prediction output.
Files and variables
File: sdmodeling.R
Description: Model implementation file. All analysis codes are archived in this file.
File: speciec_model_predictions_output.zip
Description: It contain species prediction out for each the 15 species test in this package. The different parameters were computed.
1. criteria
The metric used to select the optimal model or threshold. The max(se+sp) was used to select the optimal model.
2. threshold
The cut-off value used to convert predicted probabilities into binary outcomes (presence/absence or yes/no).
3. sensitivity (true positive rate)
Proportion of actual positives correctly identified by the model.
4. specificity (true negative rate):
Proportion of actual negatives correctly identified.
5. TSS (True Skill Statistic)
Combines sensitivity and specificity to evaluate model performance
6. MCC (Matthews Correlation Coefficient)
Measures the quality of binary classifications considering all four confusion matrix categories (TP, TN, FP, FN).
7. F1
Harmonic mean of precision (PPV) and recall (sensitivity). Balances false positives and false negatives.
8. Kappa
Measures agreement between observed and predicted values, correcting for chance agreement. Range: -1 to 1, higher values = better agreement.
9. NMI (Normalized Mutual Information)
Measures similarity between predicted and actual class distributions. Used in clustering or classification.
10. phi (Φ coefficient)
Correlation coefficient for binary variables (similar to MCC). Measures the association between predicted and actual classes.
11. ppv (Positive Predictive Value / Precision)
Proportion of positive predictions that are correct.
12. npv (Negative Predictive Value)
Proportion of negative predictions that are correct.
13. ccr (Correct Classification Rate / Accuracy)
Proportion of all correct predictions.
14. mcr (Misclassification Rate)
Proportion of incorrect predictions.
15. omission
Fraction of actual positives missed by the model (false negatives).
16. commission
Fraction of false positives predicted by the model.
17. prevalence
Proportion of actual positives in the dataset.
18. auc (Area Under the Curve)
Area under the ROC curve (plot of sensitivity vs 1-specificity). Measures overall model discrimination ability. Range: 0.5 = random, 1 = perfect.
19. deviance
Often used in statistical models (e.g., GLMs) as a measure of model fit. Lower deviance → better fit.
Model scenarios
20. runs
The number of model iterations or replications used to evaluate stability.
21. modelused
Indicates which algorithm or model was applied, including BIOCLIM, SVM, and Random Forest.
22. scenario
The context or dataset under which the model is evaluated (strong outlier removed, perfect outlier removed ,and using the loess method).
File: sdm_functions.R
Description: Model functions to compile models.
File: modeloutput.RData
Description: Model prediction output files packaged.
File: modeloutput2.RData
Description: Model prediction output files packaged.
File: data_links.docx
Description: Contains data URL links to the orginal data which can be loaded in the r scripts to conduct the analysis.
Code/software
The package was developed in R.
Access information
Other publicly accessible locations of the data:
- Figshare: 10.6084/m9.figshare.29126783
- Zenodo: https://doi.org/10.5281/zenodo.17076781
Data was derived from the following sources:
-
GBIF data
GBIF.org (26 October 2024) GBIF Occurrence Download https://doi.org/10.15468/dl.9x7vqv
CHELSA data
Hydrography90m data
https://public.igb-berlin.de/index.php/s/agciopgzXjWswF4/download
