Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource for studying antimicrobial resistance using electronic health records
Data files
Jan 23, 2025 version files 24.40 GB
-
implied_susceptibility_rules.csv
2.94 KB
-
microbiology_cultures_adi_scores.csv
48.02 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
1.21 GB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
1.26 GB
-
microbiology_cultures_cohort.csv
257.01 MB
-
microbiology_cultures_comorbidity.csv
19.67 GB
-
microbiology_cultures_demographics.csv
34.78 MB
-
microbiology_cultures_implied_susceptibility.csv
160.70 MB
-
microbiology_cultures_labs.csv
448.78 MB
-
microbiology_cultures_microbial_resistance.csv
238.30 MB
-
microbiology_cultures_nursing_home_visits.csv
462.73 KB
-
microbiology_cultures_prior_med.csv
153.36 MB
-
microbiology_cultures_priorprocedures.csv
129.99 MB
-
microbiology_cultures_vitals.csv
741.90 MB
-
microbiology_cultures_ward_info.csv
49.89 MB
-
README.md
14.86 KB
Mar 04, 2025 version files 24.48 GB
-
implied_susceptibility_rules.csv
2.94 KB
-
microbiology_cultures_adi_scores.csv
48.02 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
1.21 GB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
1.26 GB
-
microbiology_cultures_cohort.csv
257.01 MB
-
microbiology_cultures_comorbidity.csv
19.67 GB
-
microbiology_cultures_demographics.csv
34.78 MB
-
microbiology_cultures_implied_susceptibility.csv
160.70 MB
-
microbiology_cultures_labs.csv
448.78 MB
-
microbiology_cultures_microbial_resistance.csv
238.30 MB
-
microbiology_cultures_nursing_home_visits.csv
462.73 KB
-
microbiology_cultures_prior_infecting_organism.csv
81.56 MB
-
microbiology_cultures_prior_med.csv
153.36 MB
-
microbiology_cultures_priorprocedures.csv
129.99 MB
-
microbiology_cultures_vitals.csv
741.90 MB
-
microbiology_cultures_ward_info.csv
49.89 MB
-
README.md
11.34 KB
Mar 04, 2025 version files 24.48 GB
-
implied_susceptibility_rules.csv
2.94 KB
-
microbiology_cultures_adi_scores.csv
48.02 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
1.21 GB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
1.26 GB
-
microbiology_cultures_cohort.csv
257.01 MB
-
microbiology_cultures_comorbidity.csv
19.67 GB
-
microbiology_cultures_demographics.csv
34.78 MB
-
microbiology_cultures_implied_susceptibility.csv
160.70 MB
-
microbiology_cultures_labs.csv
448.78 MB
-
microbiology_cultures_microbial_resistance.csv
238.30 MB
-
microbiology_cultures_nursing_home_visits.csv
462.73 KB
-
microbiology_cultures_prior_infecting_organism.csv
81.56 MB
-
microbiology_cultures_prior_med.csv
153.36 MB
-
microbiology_cultures_priorprocedures.csv
129.99 MB
-
microbiology_cultures_vitals.csv
741.90 MB
-
microbiology_cultures_ward_info.csv
49.89 MB
-
README.md
11.77 KB
Apr 11, 2025 version files 23 GB
-
implied_susceptibility_rules.csv
2.94 KB
-
microbiology_culture_prior_infecting_organism.csv
81.53 MB
-
microbiology_cultures_adi_scores.csv
48.02 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
540.30 MB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
563.21 MB
-
microbiology_cultures_cohort.csv
257.01 MB
-
microbiology_cultures_comorbidity.csv
19.67 GB
-
microbiology_cultures_demographics.csv
34.78 MB
-
microbiology_cultures_implied_susceptibility.csv
160.70 MB
-
microbiology_cultures_labs.csv
448.78 MB
-
microbiology_cultures_microbial_resistance.csv
210.86 MB
-
microbiology_cultures_nursing_home_visits.csv
462.73 KB
-
microbiology_cultures_prior_med.csv
71.57 MB
-
microbiology_cultures_priorprocedures.csv
126.23 MB
-
microbiology_cultures_vitals.csv
741.90 MB
-
microbiology_cultures_ward_info.csv
49.89 MB
-
README.md
12.42 KB
Abstract
The Antibiotic Resistance Microbiology Dataset (ARMD) is a structured and de-identified resource developed using electronic health records (EHR) from Stanford Healthcare. It provides a comprehensive overview of microbiological cultures including urine, respiratory, and blood cultures. This dataset includes 283,715 unique adult patients and features detailed information on culture results, identified organisms, antibiotic susceptibility, and associated demographic and clinical data. The dataset was meticulously constructed through a multi-step process designed to enhance data quality and relevance. By enabling the study of antimicrobial resistance patterns and supporting antimicrobial stewardship efforts, ARMD offers a valuable resource for researchers and clinicians seeking to improve the management of infectious diseases and combat the growing threat of antimicrobial resistance.
Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource for studying antimicrobial resistance using electronic health records
Background
Antimicrobial resistance (AMR) represents a pressing global health challenge, exacerbated by the overuse and misuse of antibiotics. Efforts to mitigate AMR require high-quality datasets to analyze trends in microbial susceptibility, guide clinical decision-making, and inform stewardship programs. Electronic health records (EHR) are a rich source of real-world data that can be leveraged to study antimicrobial use and resistance patterns. However, constructing meaningful datasets from EHR data requires rigorous curation and preprocessing to ensure accuracy, relevance, and usability. ARMD aims to facilitate research in antimicrobial stewardship, with applications in identifying resistance patterns, evaluating treatment practices, and informing public health interventions. By leveraging de-identified EHR data from Stanford Healthcare, this dataset provides a unique opportunity to generate insights that can help improve infectious disease management and curb the spread of AMR. The dataset includes detailed information on culture positivity, organism identification, and antibiotic susceptibility across 55 antibiotics. By supporting the development of algorithms for predicting susceptibility and selecting effective treatments, ARMD offers researchers the tools to optimize empiric antibiotic therapy while minimizing the overuse of broad-spectrum antibiotics. Additionally, ARMD enables the exploration of broader questions in causal inference and policy learning by leveraging antibiotic susceptibility testing as a proxy for counterfactual outcomes under different treatments. With its large cohort of over 283,000 adult patients and a diverse set of microbiological cultures, this dataset supports a range of research applications, from evaluating resistance patterns to improving clinical guidelines for antimicrobial stewardship.
Description of the data and file structure
Data Files
1. microbiology_cultures_cohort.csv
Description: Contains primary information about microbiological cultures, including culture type, organism identified, and antibiotic susceptibility.
Key Columns:
culture_description
: Description of the culture (e.g., Urine, Blood).was_positive
: Indicates whether the culture was positive for an organism.organism
: Identified microorganism.antibiotic
: Antimicrobial agent tested.susceptibility
: Categorized susceptibility result (e.g., Susceptible, Resistant, Intermediate).
2. microbiology_cultures_ward_info.csv
Description: Provides information about the ward setting where cultures were collected (e.g., ICU, ER).
Key Columns:
hosp_ward_IP
: Indicates if the culture was taken in an inpatient ward.hosp_ward_OP
: Indicates if the culture was taken in an outpatient setting.hosp_ward_ER
: Indicates if the culture was taken in the emergency department.hosp_ward_ICU
: Indicates if the culture was taken in the ICU.
3. microbiology_cultures_prior_med.csv
Description: Details prior medication exposure for patients relative to the culture order.
Key Columns:
medication_name
: Generic name of the medication.medication_time_to_culturetime
: Days between medication start and culture order.medication_category
: Category of the medication.
4. microbiology_cultures_microbial_resistance.csv
Description: Contains data on microbial resistance and the timeline of resistance development.
Key Columns:
organism
: Microorganism identified.antibiotic
: Antimicrobial agent tested.resistant_time_to_culturetime
: Days between resistance confirmation and culture order.
5. microbiology_cultures_demographics.csv
Description: Includes patient demographic information at the time of the culture order.
Key Columns:
age
: Patient age categorized into bins (e.g., 18–24, 25–34, etc.). Patients aged 89 years or older are grouped into a single category (90+).gender
: Gender is encoded as binary values (0 or 1) without indicating which value corresponds to male or female.
6. microbiology_cultures_labs.csv
Description: Includes detailed laboratory results recorded within specific time windows relative to culture orders, including summary statistics (median, Q25, Q75) for key laboratory metrics.
Key Columns:
Period_Day
: Window frame representing the number of days from the culture order during which lab measurements were recorded.- Laboratory Metrics (summary statistics: Q25, Median, Q75):
- White blood cell count (wbc), neutrophils (neutrophils), lymphocytes (lymphocytes).
- Hemoglobin (hgb), platelets (plt), sodium (na), bicarbonate (hco3), blood urea nitrogen (bun), creatinine (cr).
- Lactate (lactate), procalcitonin (procalcitonin).
- Laboratory Metrics (first and last recorded values):
first_wbc
,last_wbc
,first_hgb
,last_hgb
,first_cr
,last_cr
, etc.
7. microbiology_cultures_vitals.csv
Description: Contains detailed vital sign measurements near the time of the culture order, including summary statistics and first and last recorded values.
Key Columns:
- Vital Signs (summary statistics: Q25, Median, Q75):
- Heart rate (
heartrate
), respiratory rate (resprate
), body temperature (temp
), systolic blood pressure (sysbp
), diastolic blood pressure (diasbp
).
- Heart rate (
- Vital Signs (first and last recorded values):
first_heartrate
,last_heartrate
,first_resprate
,last_resprate
,first_temp
,last_temp
,first_sysbp
,last_sysbp
,first_diasbp
,last_diasbp
.
8. microbiology_cultures_antibiotic_class_exposure.csv
Description: Tracks prior exposure to antibiotic classes.
Key Columns:
antibiotic_class
: The antibiotic class.time_to_culturetime
: Days between antibiotic exposure and culture order.
9. microbiology_cultures_antibiotic_subtype_exposure.csv
Description: Details prior exposure to antibiotic subclasses.
Key Columns:
antibiotic_subtype
: The antibiotic subclass.time_to_culturetime
: Days between exposure and culture order.
10. microbiology_culture_prior_infecting_organism.csv
Description: Contains data on prior infecting organisms identified in previous microbiological cultures for each patient.
Key Columns:
prior_organism
: Indicates the presence of a prior infection caused by this organism.prior_infecting_organism_days_to_culture
: Days between the previously recorded infection and the culture order.
11. microbiology_cultures_comorbidity.csv
Description: Contains detailed information on comorbidities.
Key Columns:
comorbidity_component
: Specific component of either the AHRQ CCSR diagnosis or Elixhauser comorbidity index.comorbidity_component_start_days_culture
: Days between the start of the component and the culture order.comorbidity_component_end_days_culture
: Days between the end of the component and the culture order (NULL indicates ongoing conditions).
12. microbiology_cultures_priorprocedures.csv
Description: Lists procedures performed on patients before culture orders.
Key Columns:
procedure_name
: Name of the procedure (e.g., Central Venous Catheter, Mechanical Ventilation).procedure_time_to_culturetime
: Days between procedure and culture order.
13. microbiology_cultures_adi_scores.csv
Description: Contains Area Deprivation Index (ADI) data mapped to cohort ZIP codes.
Key Columns:
adi_score
: ADI score representing socioeconomic disadvantage.adi_state_rank
: State-level rank for the ADI score.
14. microbiology_cultures_nursing_home_visits.csv
Description: Tracks nursing home visits relative to the culture order date.
Key Columns:
nursing_home_visit_culture
: Days between the nursing home visit and the culture order.
15. microbiology_cultures_implied_susceptibility.csv
Description: Determines the implied susceptibility of organisms to antibiotics.
Key Columns:
Implied_susceptibility
: The inferred susceptibility of organisms to an antibiotic.
16. implied_susceptibility_rules.csv
Description: Defines the rules used to infer susceptibility relationships between antibiotics.
Key Columns:
Organism
: The organism for which the susceptibility rules apply.Antibiotic
: The antibiotic for which susceptibility is inferred.Rule
: The logic or condition used to determine inferred susceptibility.
Linking Across Files
The files are linked using:
anon_id
: De-identified patient identifier.pat_enc_csn_id_coded
: Patient encounter identifier.order_proc_id_coded
: Unique culture order identifier.order_time_jittered_utc
: Jittered timestamp.
Change Log
April 11, 2025
- Removed Time_0 measurements (clinical events recorded on the same day as the culture order) to prevent temporal data leakage and ensure correct sequencing of clinical information.
- Updated the following files to exclude Time_0 measurements:
microbiology_cultures_antibiotic_class_exposure.csv
microbiology_cultures_antibiotic_subtype_exposure.csv
microbiology_cultures_microbial_resistance.csv
microbiology_culture_prior_infecting_organism.csv
microbiology_cultures_prior_med.csv
microbiology_cultures_priorprocedures.csv
- Rationale: Time_0 measurements may reflect information that was not available before the culture order and can introduce data leakage in temporal or predictive modeling tasks. Removing them ensures the dataset accurately reflects prior patient history and improves its validity for modeling and analysis.
March 4, 2025
- Added
microbiology_culture_prior_infecting_organism.csv
to track prior infections by organisms. - Updated documentation to reflect new data file inclusion.
De-Identification
To ensure compliance with HIPAA guidelines and protect patient privacy:
- Patient IDs are anonymized, and all identifiable information is removed.
- Ages are grouped into bins (e.g., “18–24”, “25–35”) to prevent exact identification.
- Gender is anonymized as binary values (0 and 1) without further specification.
- Temporal data (e.g., culture collection times) are adjusted using jittering to maintain meaningful relationships without revealing exact dates.
Sharing/Access Information
- This dataset is publicly available on Dryad for research purposes.
- Users are encouraged to contact the dataset creators for support or further clarification if needed.
Code/Software
- No scripts are included in the dataset. However, users may contact the authors if they require guidance on using the data or developing analytical workflows.
- Analysis software such as Python, R, or statistical platforms like SPSS or SAS can be used to interpret and analyze the dataset.
Usage Notes
This dataset is ideal for researchers and clinicians aiming to:
- Analyze trends in antimicrobial resistance.
- Develop predictive models for empirical antibiotic selection.
- Study the impact of comorbidities, nursing home visits, and clinical variables on resistance patterns.
Handling Missing Data
- Empty cells in the dataset are represented as “null” to ensure clarity.
Ethical Considerations
- This study was approved by the Stanford University Institutional Review Board (IRB) under eProtocol #70466. The IRB determined the study involves minimal risk, and patient consent was waived due to the use of de-identified retrospective data.
Acknowledgments
This dataset was developed with support from Stanford University and funded by the National Institutes of Health (NIH) R01 grant. The authors acknowledge all contributors who assisted in creating and validating this dataset.
Citation
When using this dataset, please cite:
Nateghi Haredasht, F., et al. Antibiotic Resistance Microbiology Dataset (ARMD). Stanford Healthcare, 2025. arXiv:2503.07664.
Contact
For questions, clarifications, or further support, please contact:
Fateme Nateghi Haredasht, PhD
Stanford University
fnateghi@stanford.edu
Cohort Selection
The ARMD was created using de-identified EHR data from Stanford Healthcare to address this need. This dataset provides microbiological cultures from adult patients (≥18 years old) and includes key clinical data points relevant to studying antimicrobial resistance. The cohort construction involved the following features and processes:
-
Culture Types: Microbiological cultures were included, specifically urine, respiratory, and blood cultures.
-
Temporal Adjustment: The timing of culture orders was adjusted for data privacy through jittering, ensuring patient confidentiality while retaining meaningful temporal relationships.
-
Culture Positivity: Each culture is flagged as either positive or negative, indicating whether an organism was identified. Cultures flagged as negative are represented by a null value in the susceptibility field.
-
Organism Identification and Susceptibility: For positive cultures, the identified organism and its antibiotic susceptibility are recorded. Susceptibility values were categorized using the following logic:
-
NULL: The original susceptibility was NULL, indicating the culture was not positive (e.g., no growth).
-
Susceptible: Includes values such as Susceptible, Negative, or Not Detected.
-
Resistant: Includes values such as Resistant, Non Susceptible, Detected, or Positive.
-
Intermediate: Includes values such as Intermediate or Susceptible - Dose Dependent.
-
Inconclusive: Includes values such as No Interpretation, Not done, Inconclusive, or See Comment.
-
Synergism: Includes values such as Synergy and No Synergy.
-
Antibiotic Standardization: Antibiotic names were cleaned and standardized to the generic form for consistency in analysis, allowing for accurate comparisons across records.
-
Antibiotic Susceptibility: Detailed susceptibility data is available for 55 different antibiotics, providing a robust framework for analyzing antimicrobial resistance patterns.
The cohort was generated through a systematic, multi-step process to ensure high-quality data:
-
Filtering for Clinical Relevance: Microbiological cultures associated with significant clinical outcomes were selected to focus on cases with actionable insights.
-
Adult Patient Restriction: The dataset was limited to adult patients (≥18 years old) using demographic data.
-
Exclusion Criteria: Patients with prior microbiological cultures within two weeks before the current culture were excluded to avoid overlapping data and ensure distinct clinical events.
-
Identification of Culture Positivity: Positivity was determined based on the presence of susceptibility results in the corresponding records.
This rigorous cohort selection process ensures that the ARMD dataset is well-suited for research on antimicrobial resistance, supporting clinical and epidemiological studies aimed at improving antimicrobial stewardship and treatment outcomes.
Implied susceptibility
The Implied Susceptibility table is a derived dataset created to provide inferred insights into antibiotic susceptibility patterns based on predefined relationships between antibiotics. This table captures cases where susceptibility to one antibiotic can imply susceptibility or resistance to another, based on established microbiological and pharmacological principles. The table is designed to enhance the interpretability of susceptibility data by incorporating implied relationships between antibiotics, which can be critical for guiding clinical decision-making and understanding resistance patterns. Additionally, we share the rules applied to derive these implied relationships, providing transparency and enabling researchers to understand and reproduce the logic behind the inferred data.
De-Identification
To ensure patient privacy and comply with data-sharing policies, the ARMD employs the following de-identification measures:
-
Unique Identifiers:
-
Each patient and culture order is assigned a unique, randomly generated identifier (anon_id and order_proc_id_coded). These identifiers are consistent across the dataset and allow linkage between associated data elements while preserving anonymity.
-
Temporal De-Identification:
-
Dates and times are not included in their original format. Instead, all timestamps (e.g., order_time_jittered_utc) are jittered randomly to maintain temporal relationships without revealing exact times.
-
The jittering process ensures the dataset retains analytical utility while removing direct identifiers.
-
Age Censoring:
-
To further ensure anonymity, patient ages are categorized into predefined age bins (e.g., 18–24, 25–34, etc.), with all patients aged 89 or older grouped into a single category (90+). This approach prevents re-identification of individuals based on age outliers.
-
Gender Encoding:
-
Gender is recorded as binary values (0 or 1) without defining which value corresponds to male or female, eliminating any interpretative bias and enhancing privacy.
-
Exclusion of Direct Identifiers:
-
No direct patient identifiers (e.g., names, medical record numbers) are included in the dataset.
-
All demographic and clinical details are provided in a de-identified format.
Ethical Approval and Patient Consent
This study was approved by the Stanford University Institutional Review Board (IRB) under eProtocol #70466. The IRB determined the study involves minimal risk, and patient consent was waived due to the use of de-identified retrospective data.