Skip to main content

Prediction of risk of acquiring urinary tract infection during hospital stay based on machine-learning: a retrospective cohort study

Cite this dataset

Møller, Jens Kjølseth (2021). Prediction of risk of acquiring urinary tract infection during hospital stay based on machine-learning: a retrospective cohort study [Dataset]. Dryad.


Objective: The aim of the current study was to develop two predictive models, using data from the index admission as well as historic data on a patient, to predict the development of UTI at the time of entry to the hospital. 

Methods: Retrospective cohort analysis of approx. 300,000 adult admissions in a Danish region was performed. We developed models for UTI prediction with five machine-learning algorithms using demographic information, laboratory results, data on antibiotic treatment, past medical history (ICD10 codes) , and clinical data by transformation of unstructured narrative text in Electronic Medical Records to structured data by Natural Language Processing.

Results: The five machine-learning algorithms have been evaluated by the performance measures average squared error, cumulative lift, and area under the curve (ROC-index). The algorithms had an area under the curve (ROC-index) ranging from 0.82 to 0.84 for the entry model (T = 0 hours after admission).

Conclusion: The study is proof of concept that it is possible to create a machine-learning model that can serve as an early warning system to predict patients at risk of acquiring urinary tract infections during admission. The entry model performs with a high ROC-index indicating a sufficient sensitivity and specificity, which may make the model instrumental in individualized prevention of UTI in hospitalized patients. The favored machine-learning methodology is Decision Trees to ensure the most transparent results and to increase clinical understanding and implementation of the model.


Retrospective cohort analysis of consecutive patient admissions in a Danish region over a 16 months period beginning January 2017. Data were obtained from the four public somatic hospital trusts in the Region of Southern Denmark and comprised 301,932 adult admissions. Data are extracted from a regional data warehouse containing 1) a copy of EMR notes from the four hospital trusts covering a population of 1.2 million inhabitants; 2) administrative data on all hospital admissions; 3) positive culture results from the four clinical microbiology laboratories in the region. The unique patient civil registration number used for all health contacts in Denmark enables linkage between the various public healthcare registries and the construction of the medical history of a patient.

Data types (predictor variables) include dates and place of admission, demographics, historical diagnosis codes, data from an automatic electronic infection registry, trigger based text mining from the electronic medical records in hospital (EMR). Risk factor variables chosen were based upon review of the literature. The target variable is any urinary tract infection (UTI) detected during admission. Urinary tract infection is predominantly diagnosed based on a combination of clinical features, culture of significant amounts of bacterial pathogens in urine, and relevant antimicrobial treatment. Information about UTIs is copied from the automated electronic hospital infection monitoring system HAIR used in the Region of Southern Denmark.

The primary goal of our modelling approach is to find patterns/relations in data between the dependent variable (UTI) and the independent variables (predictor variables or risk factors). For UTI prediction, we use five different machine-learning algorithms: Neural Networks, Gradient Boosting, Regression, Decision Tree 3 Way Split, and Decision Tree. All models are on target Any UTI (HA-UTI or CA-UTI), yes (=1) versus no (=0). The development data set has been partitioned into 60 % for training and 40 % for validation. Assessment and comparisons of models are based on average squared error, ROC-index, and cumulative Lift. SAS Enterprise Miner 14.3 has been used for the model development. Variable selection has been done using automate variable selection procedures, such as Stepwise Selection for Regression Models and Log-Worth for Decision Trees. For Neural Networks and Gradient Boosting, auto-tuning options have been used.

Usage notes

Missing values for admission_hospital_text (hospital id) for 240 of 301932 admission cases.


Region of Southern Denmark Research Foundation, Award: 17/15659

Region of Southern Denmark Research Foundation, Award: 17/15659