Descriptive statistics for: Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium

Giesa, Niklas1 ; Sekutowicz, Maria1; Rubarth, Kerstin1; Spies, Claudia1; Haufe, Stefan1; Boie, Sebastian Daniel1

Published Jan 06, 2025 on Dryad. https://doi.org/10.5061/dryad.bvq83bkhv

Data files

Jan 06, 2025 version files 164.28 KB

Abstract

Background. Patients who experienced postoperative delirium (POD) are at higher risk of poor outcomes like dementia or death. Previous machine learning models predicting POD mostly relied on time-aggregated features. We aimed to assess the potential of temporal patterns in clinical parameters during surgeries to predict POD.

Methods. Long short-term memory (LSTM) and transformer models, directly consuming time series, were compared to multi-layer perceptrons (MLPs) trained on time-aggregated features. We also fitted hybrid models, fusing either LSTM or transformer models with MLPs. Univariate Spearman’s rank correlations and linear mixed-effect models establish the importance of individual features that we compared to transformers’ attention weights.

Results. We found that best performance is achieved by a transformer architecture ingesting 30 minutes of intraoperative parameter sequences. Systolic invasive blood pressure and given opioids mark the most important input variables, in line with univariate feature importances.

Conclusion. Intraoperative temporal dynamics of clinical parameters, exploited by a transformer architecture named TRAPOD, are critical for the accurate prediction of POD

README: Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium

https://doi.org/10.5061/dryad.bvq83bkhv

This dataset contains the supplementary file for the above-mentioned study including table information that could not be displayed in the Appendix. This is a supplementary dataset containing descriptive statistics about the raw data on which prediction models were trained. We could not share raw patient data due to privacy concerns but provide comprehensive summary statistics in Tables 1 and 4 to rebuild our results. Clinical context is provided in Tables 3 and 5 describing clinical codes and data encodings. For analyzing further results regarding model performance and univariate feature importance are provided in 8 and 9.

Table Overview

Feature Descriptions
Missingness Information
Clinical Codings
Train- and Test Split Statistics
Baseline Model Descriptions
Hyperparameter Sets
(TRIPOD Guidelines, not uploaded)
Spearmans Rank Coefficients
Evaluation Metrics

Table (supplementary_data__1 to supplementary_data__9)Descriptions and Use

Features that were extracted from clinical information systems. We selected 148 features out of 197 extracted ones due to the availability during the intraoperative phase. To create analysis as displayed in the manuscript, the clinical institution should extract related variables as described in the list encoding features as binary or numeric and also applying the defined valid ranges to clean the data.
The missingness is shown per feature. We calculated the number of patients having no recorded longitudinal data at all for one feature from all patients (FETS). In addition, we calculated the sparsity per time series and feature as the fraction of missing time points from all time points. The results were mean averaged across patients. For using this information at another institution, the FETS and missingness should be calculated as described.
ICD and OPS codes encodings for procedures and diagnosis. Due to the collection of retrospective data, comorbidities, and additional symptoms or diagnoses could not be encoded directly but had to be extracted via the ICD documentation. The table shows the corresponding coding system, the variable we have used for our model inputs, and the codes that are applied to the specific item. OPS codes are not international standards but describe surgical procedures in the German health system. External authors could use their ICD procedure codes if applicable.
Descriptive statistics of features for initial train and test split per sampling interval. We wanted to give an overview of raw input data that were pre-processed via our extraction pipeline including cleansing and data wrangling steps. Summary statistics are reported in the form of mean, standard deviation, min, max, and three quartiles. When extracting similar data, researchers must keep these distributions in mind. We could not share raw input data directly due to data privacy regulations.
Feature encodings for baseline models. Numeric feature values were median-aggregated intraoperatively. We included comprehensive descriptions of baseline model parameters and the pre-processing steps that were undertaken for data ingest. The mappings to clinical coding systems and the clinical domains are displayed alongside authors information for further use. Please also refer to the GitHub repository where you can find implemented and encapsulated reference models.
Hyperparameter sets evaluated via 3x3 fold nested cross-validation. Various versions of our model variants are listed and the associated configuration files with model parameters. We have trained and tuned our models across different observation windows that are mentioned as well. The sampling interval that was included in a GridSearch process is included as well. These results act directly on our descriptions in the appendix outlining our training procedure.
Spearman's correlation coefficient with FDR corrected p-values applied for mean values aggregated for time windows residing inside the intraoperative time phase [T_begin, T_end]. The feature is cited alongside the Spearman statistics. The p-value is related to the null hypothesis of zero coefficients. Due to the application of FDR, we also report the alpha level that was used to assess if a result was significant.
Evaluation metrics on 1000x bootstrapped testing set for model variants and observation windows. Precision is calculated for 0.8 recall. Sensitivities and specificities are retrieved for the threshold where their sums are maximized. Random classification levels are 0.5 and 0.9 for AUROC and AUPRC respectively. Results are reported as mean (95%-CI)

Table Columns

feature, varname, domain, time_variant, unit, valid_range, binary, selected
Feature Sparsity (3 min. sampling), Sparsity (5 min. sampling), Fraction of empty time series (FETS)
Code System, Variable Name, Codes
Summary Statistics like mean, std, and percentiles for all train and test features
Author, Features, Encodings
model, config, observation_window, sampling
(Selection / Topic, Checklist Item, Heading / Section)
feature, corr_coef, abs_corr_coef, p-value, metric, window, alpha_corr
Model, Observation Window [min.], AUROC, AUPRC, Sensitivity, Specificity, Precision, F1-Score

Sharing/Access

The code for reproducing the above statistics can be found in the public repository

https://github.com/ngiesa/TRAPOD.

Methods

We identified promising features due to a literature review and found a potential number of 197 features in the clinical information systems (CIS) across three different hospital sites of our center (see Table B.3 in Supplement B). We selected 148 out of 197 variables due to their availability for at least 1% of patients. Thus, we investigated the influence of rare as well as highly (100%) available features. Details on feature availability (and missingness) are provided in Table B.4 in Supplement B. Table 3 summarizes the feature encoding process. Feature values were either considered as time-static, not changing over the intraoperative phase, or time-dynamic, fluctuating during the surgery. In addition to 148 selected features, we derived four composite features that combined 1. non-invasive and invasive mean blood pressure, 2. set and measured fraction of inspired oxygen (FiO2), 3. invasive and spontaneous urine output, 4. set and measured positive end-expiratory pressure (PEEP). Single feature vectors were simply concatenated for these pooled measures before sampling with an interval of e.g. three minutes. We introduced four composite features to increase data availability for these variables, as they depict the same physiological attributes such as blood pressure. By keeping the original single vectors in our feature set, we could differentiate e.g. between spontaneous and mechanical ventilation. For 19 medications, the cumulative sum of administered volumes or amounts over time was calculated. In addition to these derived variables, we encoded data availability with binary missingness indicators for 67 features, assigning 1 if a value was missing and 0 otherwise79. Binary missingness indicators were included for the following clinical domains: EEG (5 features), inputs (19 features), outputs (3 features), laboratory values (8 features), scores (4 features), vital signs (12 features), respiratory signals (8 features), demographics (5 features excluding gender), and four composite features. For other domains, like medical history, we could not differentiate between a missing measurement (variable not present) and a true negative (variable encodes a negative result). Thus, no binary missingness indicators were added here. A total of 238 features were included in our final feature set (see Table B.5 in Supplement B)