Antibiotic Resistance Microbiology Dataset (ARMD)
Data files
Jan 23, 2025 version files 24.40 GB
-
implied_susceptibility_rules.csv
2.94 KB
-
microbiology_cultures_adi_scores.csv
48.02 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
1.21 GB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
1.26 GB
-
microbiology_cultures_cohort.csv
257.01 MB
-
microbiology_cultures_comorbidity.csv
19.67 GB
-
microbiology_cultures_demographics.csv
34.78 MB
-
microbiology_cultures_implied_susceptibility.csv
160.70 MB
-
microbiology_cultures_labs.csv
448.78 MB
-
microbiology_cultures_microbial_resistance.csv
238.30 MB
-
microbiology_cultures_nursing_home_visits.csv
462.73 KB
-
microbiology_cultures_prior_med.csv
153.36 MB
-
microbiology_cultures_priorprocedures.csv
129.99 MB
-
microbiology_cultures_vitals.csv
741.90 MB
-
microbiology_cultures_ward_info.csv
49.89 MB
-
README.md
14.86 KB
Abstract
The Antibiotic Resistance Microbiology Dataset (ARMD) is a structured and de-identified resource developed using electronic health records (EHR) from Stanford Healthcare. It provides a comprehensive overview of microbiological cultures including urine, respiratory, and blood cultures. This dataset includes 283,715 unique adult patients and features detailed information on culture results, identified organisms, antibiotic susceptibility, and associated demographic and clinical data. The dataset was meticulously constructed through a multi-step process designed to enhance data quality and relevance. By enabling the study of antimicrobial resistance patterns and supporting antimicrobial stewardship efforts, ARMD offers a valuable resource for researchers and clinicians seeking to improve the management of infectious diseases and combat the growing threat of antimicrobial resistance.
README: Antibiotic Resistance Microbiology Dataset (ARMD)
Overview
The Antibiotic Resistance Microbiology Dataset (ARMD) is a structured, de-identified resource created using electronic health records (EHR) from Stanford Healthcare. It is designed to support research on antimicrobial resistance (AMR) and antimicrobial stewardship by providing detailed insights into culture positivity, organism identification, and antibiotic susceptibility.
This dataset includes core microbiological data along with related clinical features, making it a valuable resource for studying AMR patterns, predicting resistance trends, and developing antimicrobial stewardship strategies.
Description of the data and file structure
The dataset is organized into multiple files, each focusing on specific aspects of antimicrobial resistance, treatment practices, and patient outcomes. Each table can be linked using the unique combination of anon_id, pat_enc_csn_id_coded, order_proc_id_coded, and order_time_jittered_utc, which serves as a composite key for consistent tracking of individual culture tests.
The dataset consists of the following files, each representing distinct but interrelated data components:
- microbiology_cultures_cohort.csv: Core dataset containing information on microbiological cultures, including culture type, identified organism, antibiotic tested, and susceptibility results.
- microbiology_cultures_ward_info.csv: Contains ward-level details on where cultures were collected (e.g., ICU, ER, inpatient, outpatient).
- microbiology_cultures_prior_med.csv: Tracks prior medication exposure relative to culture orders, including medication names and timing details.
- microbiology_cultures_microbial_resistance.csv: Provides microbial resistance data, capturing timelines for resistance confirmation relative to culture orders.
- microbiology_cultures_demographics.csv: Includes patient demographic details, such as age (grouped into bins, e.g., "18–24", "25–34") and anonymized gender (represented as binary values, 0 and 1, without further specification).
- microbiology_cultures_labs.csv: Contains laboratory test results, including WBC, hemoglobin, creatinine, and other key metrics, along with summary statistics and first/last recorded values.
- microbiology_cultures_vitals.csv: Provides vital sign measurements, including heart rate, blood pressure, and temperature, with summary statistics and first/last recorded values.
- microbiology_cultures_antibiotic_class_exposure.csv: Tracks prior exposure to antibiotic classes and timeframes relative to culture orders.
- microbiology_cultures_antibiotic_subtype_exposure.csv: Details prior exposure to specific antibiotic subtypes with timing information.
- microbiology_cultures_comorbidity.csv: Captures comorbidity data based on standardized indices, including timing relative to culture orders.
- microbiology_cultures_priorprocedures.csv: Lists prior medical procedures with timing details relative to culture orders.
- microbiology_cultures_adi_scores.csv: Includes Area Deprivation Index (ADI) scores for socioeconomic analysis.
- microbiology_cultures_nursing_home_visits.csv: Tracks nursing home visits relative to culture orders, capturing timing data to study potential impacts on resistance patterns.
- microbiology_cultures_implied_susceptibility.csv: Provides inferred susceptibility relationships between antibiotics based on logical rules.
- implied_susceptibility_rules.csv: Documents the rules used to infer susceptibility relationships, including organisms, antibiotics, and corresponding logic.
Data file details
1. microbiology_cultures_cohort.csv
- Description: Contains primary information about microbiological cultures, including culture type, organism identified, and antibiotic susceptibility.
- Key Columns:
- culture_description: Description of the culture (e.g., Urine, Blood).
- was_positive: Indicates whether the culture was positive for an organism.
- organism: Identified microorganism.
- antibiotic: Antimicrobial agent tested.
- susceptibility: Categorized susceptibility result (e.g., Susceptible, Resistant, Intermediate).
2. microbiology_cultures_ward_info.csv
- Description: Provides information about the ward setting where cultures were collected (e.g., ICU, ER).
- Key Columns:
- hosp_ward_IP: Indicates if the culture was taken in an inpatient ward.
- hosp_ward_OP: Indicates if the culture was taken in an outpatient setting.
- hosp_ward_ER: Indicates if the culture was taken in the emergency department.
- hosp_ward_ICU: Indicates if the culture was taken in the ICU.
3. microbiology_cultures_prior_med.csv
- Description: Details prior medication exposure for patients relative to the culture order.
- Key Columns:
- medication_name: Generic name of the medication.
- medication_time_to_culturetime: Days between medication start and culture order.
- medication_category: Category of the medication.
4. microbiology_cultures_microbial_resistance.csv
- Description: Contains data on microbial resistance and the timeline of resistance development.
- Key Columns:
- organism: Microorganism identified.
- antibiotic: Antimicrobial agent tested.
- resistant_time_to_culturetime: Days between resistance confirmation and culture order.
5. microbiology_cultures_demographics.csv
- Description: Includes patient demographic information at the time of the culture order.
- Key Columns:
- age: Patient age is categorized into bins (e.g., 18–24, 25–34, etc.) to prevent identification through exact ages. Patients aged 89 years or older are grouped into a single category (90+).
- gender: Gender is encoded as binary values (0 or 1) without any indication of which value corresponds to male or female.
6. microbiology_cultures_labs.csv
- Description: Includes detailed laboratory results recorded within specific time windows relative to culture orders, including summary statistics (median, Q25, Q75) for key laboratory metrics.
- Key Columns:
- Period_Day: Window frame representing the number of days from the culture order during which lab measurements were recorded.
- Laboratory Metrics (summary statistics: Q25, Median, Q75):
- White blood cell count (wbc), neutrophils (neutrophils), lymphocytes (lymphocytes).
- Hemoglobin (hgb), platelets (plt), sodium (na), bicarbonate (hco3), blood urea nitrogen (bun), creatinine (cr).
- Lactate (lactate), procalcitonin (procalcitonin).
- Laboratory Metrics (first and last recorded values):
- first_wbc, last_wbc, first_hgb, last_hgb, first_cr, last_cr, etc.
- Units of Measurement:
Feature | Unit of Measurement | Feature | Unit of Measurement |
---|---|---|---|
Heart Rate | Beats/minutes | Neutrophils | % |
Respiratory Rate | Breaths/minute | Lymphocytes | % |
Temperature | Fahrenheit | Platelets | k/ul |
Systolic Blood pressure | mmHg | Sodium | mmol/l |
Diastolic blood pressure | mmHg | HCO3 | mEq/l |
White Blood Count | k/ul | Blood Uria Nitrogen | mg/dl |
Hemoglobin | mg/dl | Creatinine | mg/dl |
Lactate | mmol/l | procalcitonin | ng/ml |
7. microbiology_cultures_vitals.csv
- Description: Contains detailed vital sign measurements near the time of the culture order, including summary statistics (median, Q25, Q75) and the first and last recorded values.
- Key Columns:
- Vital Signs (summary statistics: Q25, Median, Q75):
- Heart rate (heartrate), respiratory rate (resprate), body temperature (temp), systolic blood pressure (sysbp), diastolic blood pressure (diasbp).
- Vital Signs (first and last recorded values):
- first_heartrate, last_heartrate, first_resprate, last_resprate, first_temp, last_temp, first_sysbp, last_sysbp, first_diasbp, last_diasbp.
8. microbiology_cultures_antibiotic_class_exposure.csv
- Description: Tracks prior exposure to antibiotic classes.
- Key Columns:
- antibiotic_class: The antibiotic class.
- time_to_culturetime: Days between antibiotic exposure and culture order.
9. microbiology_cultures_antibiotic_subtype_exposure.csv
- Description: Details prior exposure to antibiotic subclasses.
- Key Columns:
- antibiotic_subtype: The antibiotic subclass.
- time_to_culturetime: Days between exposure and culture order.
10. microbiology_cultures_comorbidity.csv
- Description: Contains detailed information on comorbidities, including components derived from the AHRQ CCSR diagnosis and Elixhauser comorbidity, with timing relative to the culture order.
- Key Columns:
- comorbidity_component: Specific component of either the AHRQ CCSR diagnosis or Elixhauser comorbidity index.
- comorbidity_component_start_days_culture: Number of days between the start of the component and the culture order.
- comorbidity_component_end_days_culture: Number of days between the end of the component and the culture order. NULL values indicate the component is still active.
11. microbiology_cultures_priorprocedures.csv
- Description: Lists procedures performed on patients before culture orders.
- Key Columns:
- procedure_name: Name of the procedure (e.g., Central Venous Catheter, Mechanical Ventilation).
- procedure_time_to_culturetime: Days between procedure and culture order.
12. microbiology_cultures_adi_scores.csv
Description: The Area Deprivation Index (ADI) data, designed for 9-digit ZIP codes, was obtained from the Neighborhood Atlas and mapped to cohort ZIP codes to provide socioeconomic context. Missing or invalid ADI scores (e.g., P, U, NA, etc.) were addressed as follows:
- 5-Digit ZIP Code Imputation: For records with only 5-digit ZIP codes, missing ADI scores were replaced with the average ADI score calculated from 9-digit ZIP codes sharing the same first 5 digits.
- Final Dataset: Imputed ADI values were included to ensure data completeness for analysis.
Key Columns:
- adi_score: ADI score representing socioeconomic disadvantage.
- adi_state_rank: State-level rank for the ADI score.
13. microbiology_cultures_nursing_home_visits.csv
- Description: Tracks nursing home visits relative to the culture order date to analyze their potential impact on microbiological cultures and antimicrobial resistance patterns.
- Key Columns:
- nursing_home_visit_culture: Number of days between the nursing home visit and the culture order.
14. microbiology_cultures_implied_susceptibility.csv
- Description: Determines the implied susceptibility of organisms to antibiotics in cases where direct susceptibility testing has not been performed. The susceptibility is inferred based on established guidelines and rules. These rules are documented in the file implied_susceptibility_rules.csv
- Key Columns:
- Implied_susceptibility: The implied susceptibility of organisms to an antibiotic
15. implied_susceptibility_rules.csv
- Description: This file defines the rules used to infer susceptibility relationships between antibiotics. It provides the logic and rationale behind the derivation of implied susceptibility values.
- Key Columns:
- Organism: The organism for which the susceptibility rules apply.
- Antibiotic: The antibiotic for which the susceptibility is inferred.
- Rule: The logic or condition used to determine the inferred susceptibility.
Linking Across Files
The files are linked using:
- anon_id: De-identified patient identifier.
- pat_enc_csn_id_coded: Patient encounter identifier.
- order_proc_id_coded: Unique culture order identifier.
- order_time_jittered_utc: Jittered timestamp.
De-identification
To ensure compliance with HIPAA guidelines and protect patient privacy:
- Patient IDs are anonymized, and all identifiable information is removed.
- Ages are grouped into bins (e.g., "18–24", "25–35") to prevent exact identification.
- Gender is anonymized as binary values (0 and 1) without further specification.
- Temporal data (e.g., culture collection times) are adjusted using jittering to maintain meaningful relationships without revealing exact dates.
Sharing/Access Information
- This dataset is publicly available on Dryad for research purposes.
- Users are encouraged to contact the dataset creators for support or further clarification if needed.
Code/Software
- No scripts are included in the dataset. However, users may contact the authors if they require guidance on using the data or developing analytical workflows.
- Analysis software such as Python, R, or statistical platforms like SPSS or SAS can be used to interpret and analyze the dataset.
Usage notes
This dataset is ideal for researchers and clinicians aiming to:
- Analyze trends in antimicrobial resistance.
- Develop predictive models for empirical antibiotic selection.
- Study the impact of comorbidities, nursing home visits, and clinical variables on resistance patterns.
Handling missing data
- Empty cells in the dataset are represented as "null" to ensure clarity.
Ethical considerations
- This dataset is shared under ethical guidelines and does not contain any personally identifiable information.
Acknowledgments
This dataset was developed with support from Stanford University and funded by the National Institutes of Health (NIH) R01 grant. The authors acknowledge all contributors who assisted in creating and validating this dataset.
Citation
When using this dataset, please cite:
Nateghi Haredasht, F., et al. Antibiotic Resistance Microbiology Dataset (ARMD). Stanford Healthcare, 2024.
and the dataset
Nateghi Haredasht, Fateme; Amrollahi, Fatemeh; Maddali, Manoj et al. (2025). Antibiotic Resistance Microbiology Dataset (ARMD) [Dataset]. Dryad. https://doi.org/10.5061/dryad.jq2bvq8kp
Contact
For questions, clarifications, or further support, please contact:
Fateme Nateghi Haredasht, PhD
Stanford University
fnateghi@stanford.edu
Methods
Antimicrobial resistance (AMR) represents a pressing global health challenge, exacerbated by the overuse and misuse of antibiotics. Efforts to mitigate AMR require high-quality datasets to analyze trends in microbial susceptibility, guide clinical decision-making, and inform stewardship programs. EHRs are a rich source of real-world data that can be leveraged to study antimicrobial use and resistance patterns. However, constructing meaningful datasets from EHR data requires rigorous curation and preprocessing to ensure accuracy, relevance, and usability. ARMD aims to facilitate research in antimicrobial stewardship, with applications in identifying resistance patterns, evaluating treatment practices, and informing public health interventions. By leveraging de-identified EHR data from Stanford Healthcare, this dataset provides a unique opportunity to generate insights that can help improve infectious disease management and curb the spread of AMR. The dataset includes detailed information on culture positivity, organism identification, and antibiotic susceptibility across 55 antibiotics. By supporting the development of algorithms for predicting susceptibility and selecting effective treatments, ARMD offers researchers the tools to optimize empiric antibiotic therapy while minimizing the overuse of broad-spectrum antibiotics. Additionally, ARMD enables the exploration of broader questions in causal inference and policy learning by leveraging antibiotic susceptibility testing as a proxy for counterfactual outcomes under different treatments. With its large cohort of over 283,000 adult patients and a diverse set of microbiological cultures, this dataset supports a range of research applications, from evaluating resistance patterns to improving clinical guidelines for antimicrobial stewardship.
Cohort Selection
The ARMD was created using de-identified EHR data from Stanford Healthcare to address this need. This dataset provides a comprehensive overview of microbiological cultures from adult patients (≥18 years old) and includes key clinical data points relevant to studying antimicrobial resistance. The cohort construction involved the following features and processes:
-
Culture Types: Microbiological cultures were included, specifically urine, respiratory, and blood cultures.
-
Temporal Adjustment: The timing of culture orders was adjusted for data privacy through jittering, ensuring patient confidentiality while retaining meaningful temporal relationships.
-
Culture Positivity: Each culture is flagged as either positive or negative, indicating whether an organism was identified. Cultures flagged as negative are represented by a null value in the susceptibility field.
-
Organism Identification and Susceptibility: For positive cultures, the identified organism and its antibiotic susceptibility are recorded. Susceptibility values were categorized using the following logic:
-
NULL: The original susceptibility was NULL, indicating the culture was not positive (e.g., no growth).
-
Susceptible: Includes values such as Susceptible, Negative, or Not Detected.
-
Resistant: Includes values such as Resistant, Non Susceptible, Detected, or Positive.
-
Intermediate: Includes values such as Intermediate or Susceptible - Dose Dependent.
-
Inconclusive: Includes values such as No Interpretation, Not done, Inconclusive, or See Comment.
-
Synergism: Includes values such as Synergy and No Synergy.
-
Antibiotic Standardization: Antibiotic names were cleaned and standardized to the generic form for consistency in analysis, allowing for accurate comparisons across records.
-
Antibiotic Susceptibility: Detailed susceptibility data is available for 55 different antibiotics, providing a robust framework for analyzing antimicrobial resistance patterns.
The cohort was generated through a systematic, multi-step process to ensure high-quality data:
-
Filtering for Clinical Relevance: Microbiological cultures associated with significant clinical outcomes were selected to focus on cases with actionable insights.
-
Adult Patient Restriction: The dataset was limited to adult patients (≥18 years old) using demographic data.
-
Exclusion Criteria: Patients with prior microbiological cultures within two weeks before the current culture were excluded to avoid overlapping data and ensure distinct clinical events.
-
Identification of Culture Positivity: Positivity was determined based on the presence of susceptibility results in the corresponding records.
This rigorous cohort selection process ensures that the ARMD dataset is well-suited for research on antimicrobial resistance, supporting clinical and epidemiological studies aimed at improving antimicrobial stewardship and treatment outcomes.
Implied susceptibility
The Implied Susceptibility table is a derived dataset created to provide inferred insights into antibiotic susceptibility patterns based on predefined relationships between antibiotics. This table captures cases where susceptibility to one antibiotic can imply susceptibility or resistance to another, based on established microbiological and pharmacological principles. The table is designed to enhance the interpretability of susceptibility data by incorporating implied relationships between antibiotics, which can be critical for guiding clinical decision-making and understanding resistance patterns. Additionally, we share the rules applied to derive these implied relationships, providing transparency and enabling researchers to understand and reproduce the logic behind the inferred data.
De-Identification
To ensure patient privacy and comply with data-sharing policies, the ARMD employs the following de-identification measures:
-
Unique Identifiers:
-
Each patient and culture order is assigned a unique, randomly generated identifier (anon_id and order_proc_id_coded). These identifiers are consistent across the dataset and allow linkage between associated data elements while preserving anonymity.
-
Temporal De-Identification:
-
Dates and times are not included in their original format. Instead, all timestamps (e.g., order_time_jittered_utc) are jittered randomly to maintain temporal relationships without revealing exact times.
-
The jittering process ensures the dataset retains analytical utility while removing direct identifiers.
-
Age Censoring:
-
To further ensure anonymity, patient ages are categorized into predefined age bins (e.g., 18–24, 25–34, etc.), with all patients aged 89 or older grouped into a single category (90+). This approach prevents the re-identification of individuals based on age outliers.
-
Gender Encoding:
-
Gender is recorded as binary values (0 or 1) without defining which value corresponds to male or female, eliminating any interpretative bias and enhancing privacy.
-
Exclusion of Direct Identifiers:
-
No direct patient identifiers (e.g., names, medical record numbers) are included in the dataset.
-
All demographic and clinical details are provided in a de-identified format.
Ethical Approval and Patient Consent
This study was approved by the Stanford University Institutional Review Board (IRB) under eProtocol #70466. The IRB determined the study involved minimal risk, and patient consent was waived due to the use of de-identified retrospective data.