Calibration of probability predictions from machine-learning and statistical models
Dormann, Carsten (2020), Calibration of probability predictions from machine-learning and statistical models, Dryad, Dataset, https://doi.org/10.5061/dryad.xksn02vbq
This data set describes the occurrence (yes/no) of a bird, the Southern Whiteface (Aphelocephala leucopsis) in Australia. A suite of environmental variables is provided, which are used in the paper to illustrate a statistical problem. The data are meant to allow reproduction of the analysis in this paper. They are not intended for actual ecological analysis. The data come as .Rdata-file, i.e. as an R-dataset (described technically here: https://www.loc.gov/preservation/digital/formats/fdd/fdd000470.shtml).
Here is the paper's abstract:
Aim: Predictions from statistical models may be uncalibrated, meaning that the predicted values do not have the nominal coverage probability. This is easiest seen with probability predictions in machine-learning classification, including the common species occurrence probabilities. Here, a predicted probability of, say, 0.7 should indicate that out of 100 cases with these environmental conditions, and hence the same predicted probability, the species should be present in 70 and absent in 30.
Innovation: A simple calibration plot shows that this is not necessarily the case, particularly not for over-fitted models or algorithms that use non-likelihood target functions. As a consequence, “raw” predictions from such model could easily be off by 0.2, are unsuitable for averaging across model types, and resulting maps hence be substantially distorted. The solution, a flexible calibration regression, is simple and can be applied whenever deviations are observed.
Conclusion: “Raw”, uncalibrated probability predictions should be calibrated before interpreting or averaging them in a probabilistic way.
The bird distribution data are from BirdLife International (http://datazone.birdlife.org/species/requestdis).
The climate data are from WorldClim (https://www.worldclim.org/).
The land cover data are from GLC 2000 (https://forobs.jrc.ec.europa.eu/products/glc2000/glc2000.php).
All data were reprojected to Lamberth Equal Area projection and gridded to 100 x 100 km.
This data set is a proof-of-principle data set, and its data are not fit meant for ecological analysis!
Neither the climate nor the land cover data are selected to be the most sensible for this species. The species, Southern Whiteface, was a random decision and is not representative in any way. The purpose of this data set was to illustrate a statistical problem arising in a typical species distribution analysis.
The sole purpose of providing this data set is to make the analysis reproducible.