Annotated dataset of clinical notes for predicting social determinants of mental health in opioid use disorder using a Human-in-the-Loop Large Language Model Interaction for Annotation (HLLIA) framework
Data files
Mar 06, 2026 version files 163.76 KB
-
model1_2636.csv
31.64 KB
-
model2_2636.csv
31.64 KB
-
model3_2636.csv
94.93 KB
-
README.md
5.56 KB
Abstract
This dataset comprises 2,636 deidentified discharge summaries from the MIMIC-IV-Note database, annotated for 13 Social Determinants of Mental Health (SDOMH) relevant to Opioid Use Disorder (OUD). The dataset was created to support natural language processing (NLP) and machine learning research aimed at identifying social factors influencing OUD outcomes. Using a Human-in-the-Loop Large Language Model Interaction for Annotation (HLLIA) framework, initial SDOMH labels were generated by GPT-3.5/4 and subsequently refined through expert review, partial-correlation–based validation, and iterative consensus refinement to ensure label consistency and reliability. Each record includes: (1) a subject ID, (2) binary indicators for OUD presence (Hierarchy 1), SDOMH presence (Hierarchy 2), and (3) thirteen binary columns representing specific determinants such as Social Detachment, Financial Uncertainty, Housing Instability, Substance Misuse, Violence, and Suicide Mortality (Hierarchy 3).
The dataset enables hierarchical, multi-label classification of SDOMHs and serves as training data for transformer-based models such as the Multilevel Hierarchical Clinical-Longformer Embeddings (MHCLE) algorithm. Potential reuse includes applications in social and behavioral health informatics, causal inference, clinical decision support, and bias-aware LLM annotation studies.
Dataset DOI: 10.5061/dryad.d51c5b0h7
Description of the data and file structure
The dataset was developed as part of an experimental study designed to improve the automated identification of Social Determinants of Mental Health (SDOMH) among patients with Opioid Use Disorder (OUD) using unstructured clinical notes. The data originates from the MIMIC-IV-Note database, which contains over 330,000 deidentified discharge summaries from Beth Israel Deaconess Medical Center (BIDMC). From this source, 2,636 discharge summaries were selected to balance computational feasibility with analytical depth.
The experimental effort focused on creating a high-quality annotated corpus for training and evaluating transformer-based NLP models. To achieve this, a Human-in-the-Loop Large Language Model Interaction for Annotation (HLLIA) framework was implemented. In this framework, Large Language Models (GPT-3.5 and GPT-4) produced initial SDOMH labels, which were subsequently refined and validated by four expert annotators specializing in mental health research.
Each clinical note was annotated across three hierarchical levels:
H1: Presence or absence of Opioid Use Disorder.
H2: Presence or absence of any SDOMH factors.
H3: Thirteen binary indicators representing specific SDOMH categories such as Social Detachment, Financial Uncertainty, Housing Instability, Substance Misuse, and Suicide Mortality.
The annotated dataset supported the training of the Multilevel Hierarchical Clinical-Longformer Embeddings (MHCLE) model, enabling structured prediction of SDOMHs from unstructured text. These efforts collectively aimed to advance computational methods for understanding how social factors of mental health contribute to OUD progression and to create a reusable benchmark for future research in clinical NLP and health informatics.
Data files
model1_2636.csv: the first column are clinical notes (texts) which we have removed due to sensitive information and specified subject ids instead and remaining column is the label which states whether opioid use disorder present or absent for which we used binary class as 0 or 1.(present = 1, absent = 0)
model2_2636.csv: the first column are clinical notes (texts) which we have removed due to sensitive information and specified subject ids instead and remaining column is the label which states whether social determinants of mental health (SDOMH) is present or absent within the text for which we have used binary class as 0 or 1 (present = 1, absent = 0).
model3_2636.csv: the first column are clinical notes(texts) which we have removed due to sensitive information and specified subject ids instead and remaining column are the labels which states exactly which determinants are present or absent out of 13 SDOMH. For these multilabels,we have binary class as 0 or 1 for each SDOMH labels (present = 1, absent = 0). The 13 SDOMH are: Socially Detached, Health Care Handover, Obstacles to Medical Care, Financial Uncertainty, Residential Instability, Nutritional Shortage, Violence, Judicial Obstacles, Substance Misuse, Mental Disturbance Symptoms, Acute Pain, Medical Disability, and Suicide Mortality, respectively.
Code/Software
Python (3.9+). Used for preprocessing, annotation orchestration, modeling, and evaluation.
Common Python libraries (open-source):
pandas (tabular I/O and cleaning)
numpy (vectorized ops)
NLTK (tokenization, lemmatization; download punkt, stopwords, wordnet)
PyTorch (deep learning backend; CPU/GPU)
transformers and tokenizers (loading Clinical-Longformer/ClinicalBERT etc)
scikit-learn (metrics/utilities)
matplotlib (optional figures)
GPU (optional): Any CUDA-capable NVIDIA GPU (e.g., A100 as used in the study) accelerates training/inference; CPU works for small tests.
OpenAI API (optional): Only needed if you wish to replicate HLLIA’s LLM-assisted labeling step (GPT-3.5/4) rather than using the already-provided labels.
The dataset itself (notes and labels) can be viewed with any text editor or spreadsheet tool (e.g., VS Code, Excel, libreoffice). No proprietary software is required.
Access Information
Other publicly accessible locations of the data:
Data was derived from the following sources:
-
https://physionet.org/content/mimiciv/3.1/
Follow this below steps:
Create a PhysioNet Account
Go to https://physionet.org. Click 'Sign Up' and register with your institutional or academic email. Verify your email address - this is required before proceeding.
Go to the MIMIC-IV-Note page on PhysioNet -> click “Request Access”.
Sign the Data Use Agreement (DUA) and upload your CITI completion report.
Download or Access the Data: Once your request is approved,Navigate back to the MIMIC-IV-Note project page.
You can: Download ZIP or CSV files directly (de-identified free-text notes), or Use Google BigQuery / AWS cloud access for large-scale querying.
Human subjects data
Please note this are deidentified clinical notes from MIMIC IV datasets which are publicly accessible.We hold PhysioNet credentialed access and adhere to its Data Use Agreement (DUA).
