Antimicrobial Resistance Microbiological Dataset (ARMD-UTSW): A deidentified collection of electronic health records, from a quaternary, academic medical center, for antimicrobial resistance research
Data files
Sep 26, 2025 version files 4.45 GB
-
microbiology_cultures_adi_scores.csv
46.83 MB
-
microbiology_cultures_antibiotic_class_exposure.csv
655.83 MB
-
microbiology_cultures_antibiotic_subtype_exposure.csv
685.04 MB
-
microbiology_cultures_cohort.csv
202.29 MB
-
microbiology_cultures_comorbidity.csv
1.30 GB
-
microbiology_cultures_demographics.csv
32.34 MB
-
microbiology_cultures_labs.csv
462.23 MB
-
microbiology_cultures_microbial_resistance.csv
302.40 MB
-
microbiology_cultures_nursing_home_visits.csv
44.73 MB
-
microbiology_cultures_prior_infecting_organism.csv
49.20 MB
-
microbiology_cultures_prior_med.csv
551.61 MB
-
microbiology_cultures_prior_procedures.csv
69.78 MB
-
microbiology_cultures_ward_info.csv
47.78 MB
-
README.md
10.28 KB
Abstract
Antibiotic resistance is a global public health emergency, but quality, real world, EHR based data sets that can be utilized for antibiotic resistance research are limited. We have developed the Antibiotic Resistance Microbiology Dataset: UTSW (ARMD: UTSW) which includes microbiological culture testing and susceptibility results for 237,258 patients at the University of Texas Southwestern Medical Center (UTSW) from 2005-2025. Longitudinal demographics, prior medical histories, medications, procedures, and clinical specific information such as testing locations and recent laboratory values are also incorporated into the deidentified data set. With the standardization of data values and careful deidentification of the ARMD: UTSW data set, researchers globally will be able to improve patient outcomes, increase awareness, and add to the collective knowledge regarding antibiotic resistance.
Dataset DOI: 10.5061/dryad.0rxwdbsd5
Description of the data and file structure
Our dataset is comprised of longitudinal electronic health records from the University of Texas Southwestern Medical Center. This collection includes deidentified urine, respiratory, and blood based microbiological culture results and susceptibilities from a cohort of adult patients (≥18 years old) regardless of culture positivity. Additional data included in the dataset consists of prior medical histories such as comorbidities, socioeconomic indicators, prior medications, prior infections, and prior medical procedures. The dataset includes 13 .csv files that can be used for future analyses such as machine learning predictive models. All identifying personal information has been removed or anonymized according to the deidentification methods mentioned below.
All files are joined by the following identifiers:
- anon_id: the deidentified patient number
- pat_enc_csn_id_coded: the deidentified patient encounter number
- order_proc_id_coded: the deidentified unique culture order number
- order_time_jittered: the jittered timestamp for the unique culture order
All missing values have been replaced with “Null” values.
Deidentification Methods
The data set is deidentified to comply with the Health Insurance Portability and Accountability Act (HIPAA) and the National Institute of Standards and Technology (NIST) Safe Harbor regulations.
- All identifying patient information has been removed from the data set.
- Any patient, encounter, or culture order identification numbers have been anonymized. The anon_id was created by giving each patient identification number a random pair of letters accompanied by a random set of numbers. The pat_enc_csn_id_coded and order_proc_id_coded values were created by giving each encounter number or culture order a prefix (10 for pat_enc_csn_id_coded and 21 for order_proc_id_coded) accompanied by a serialized value based on the initial randomization of the accompanying anon_id. This allows for each patient's encounter and culture order identifiers to be linked to the appropriate patient, but remains anonymized.
- Demographic information has been removed with the exception of age and gender. Age has been bucketed into age ranges such as 18-24, 25-34, etc. and gender (male or female) has been adapted to a binary indicator of 0 or 1 without indication of which gender is represented by 0 or 1.
- Dates and times have been offset (jittered) by a random number of days within a two-month period while keeping consistent temporal relationships.
Files and variables
Files and Variables
- microbiology_cultures_adi_scores.csv
- Area Deprivation Index (ADI) information for both national and state ranks that was mapped to patient nine digit (where available, otherwise five digit was used) zip codes.
- Key Columns:
- adi_score: ADI percentile score at the national level; higher values indicate higher levels of deprivation
- adi_state_rank: ADI decile rankings at the state (Texas, USA) level; higher values indicate higher levels of deprivation
- microbiology_cultures_antibiotic_class_exposure.csv
- Prior exposure to various antibiotic classes
- Key Columns:
- medication_name: Name of antibiotic
- antibiotic_class: Class of antibiotic
- time_to_culturetime: Days between culture order and antibiotic exposure
- microbiology_cultures_antibiotic_subtype_exposure.csv
- Prior exposure to antibiotic subclasses
- Key Columns:
- medication_name: Name of antibiotic
- antibiotic_subtype: Subclass of antibiotic
- antibiotic_subtype_category: Categorical name of antibiotic subclass
- medication_time_to_culturetime: Days between culture order and antibiotic exposure
- microbiology_cultures_cohort.csv
- The comprehensive listing of culture types, organisms, and susceptibility testing results
- Key Columns:
- culture_description: The type of culture ordered (Blood, Respiratory, or Urine)
- was_positive: Binary indicator of a positive culture result (0 or 1)
- organism: Name of organism
- antibiotic: Name of antibiotic tested
- susceptibility: Susceptibility results such as susceptible, intermediate, or resistant per antibiotic
- microbiology_cultures_comorbidity.csv
- Comorbidity information for each patient in the cohort
- Key Columns:
- comorbidity_component: Comorbidity component type based on either the AHRQ CCSR diagnosis or Elixhauser Comorbidity Index
- comorbidity_component_start_days_culture: Days between the start of comorbidity and the culture order
- comorbidity_component_end_days_culture: Days between the end of the comorbidity and the culture order with “Null” indicating an ongoing comorbidity and negative values indicating comorbidities that ended after the culture order
- microbiology_cultures_demographics.csv
- Basic demographic information for each patient in the cohort
- Key columns:
- age: The age range of the patient with values such as 18-24, 25-24, and above 90
- gender: Binary indicator of 0 or 1 to indicate a particular gender. The indicators are purposely not identified for deidentification purposes.
- microbiology_cultures_labs.csv
- Laboratory results recorded for the following:
- wbc: White blood cell count
- neutrophils: Neutrophils
- lymphocytes: Lymphocytes
- hgb: Hemoglobin
- plt: Platelets
- na: sodium
- hco3: bicarbonate
- bun: blood urea nitrogen
- procalcitonin: Procalcitonin
- Key columns:
- Period_Day: Days between the culture order and the time range in which the laboratory values were measured
- Q25_xx: Lower quartile laboratory values for listed measurements
- median_xx: Median laboratory values for listed measurements
- Q75_xx: Upper quartile laboratory values for listed measurements
- first_xx: First laboratory value for listed measurements
- last_xx: Last laboratory value for listed measurements
- Laboratory results recorded for the following:
- microbiology_cultures_microbial_resistance.csv
- Microbial resistance information for each culture ordered
- Key columns:
- organism: Name of organism
- antibiotic: Name of antibiotic
- Resistant_time_to_culturetime: Days between the culture order and the susceptibility result
- microbiology_cultures_nursing_home_visits.csv
- Information about visits to nursing facilities prior to the date of culture order
- Key Columns:
- nursing_home_visit_culture: Days between the culture order and nursing facility visit
- microbiology_cultures_prior_infecting_organism.csv
- Information on identified organisms prior to the culture order
- Key columns:
- prior_organism: Name of organism
- prior_infecting_organism_days_to_culture: Days between the culture order and prior infection
- microbiology_cultures_prior_med.csv
- Information regarding medications used prior to the culture order
- Key columns:
- medication_name: Name of prior medication used
- medication_time_to_culturetime: Days between the culture order and prior medication start
- medication_category: Name of medication category
- microbiology_cultures_prior_procedures.csv
- Information on procedures such as mechanical ventilation, parenteral nutrition, surgical procedures, dialysis, or catheterization (central venous and urethral) prior to culture order
- Key columns:
- procedure_description: Name of procedure
- procedure_time_to_culturetime: Days between the prior procedure and the culture order
- microbiology_cultures_ward_info.csv
- Location of the ward in which the culture was taken such as the intensive care unit, the emergency department, an inpatient setting, or an outpatient setting
- Key columns:
- hosp_ward_IP: Binary indicator of the culture location in an inpatient setting
- hosp_ward_OP: Binary indicator of the culture location in an outpatient setting
- hosp_ward_ER: Binary indicator of the culture location in the emergency department
- hosp_ward_ICU: Binary indicator of the culture location in the intensive care unit
Code/software
- No special software is needed to access the dataset. The data files are all .csv (comma separated value) files which can be used with standard statistical methods such as SAS, R, and Python.
- No additional scripts, or software are provided with the dataset.
Access information
Other publicly accessible locations of the data:
- No other publicly accessible locations are available to download the dataset
Data was derived from the following sources:
- Data was sourced from the Epic based electronic health records system of the University of Texas Southwestern Medical Center in Dallas, Texas, USA.
Human subjects data
This study was approved by the IRB committee at the University of Texas Southwestern Medical Center (#STU-2023-0583). Patient consent was not required as all data will be deidentified and is secondary use in nature.
The data set is deidentified to comply with the Health Insurance Portability and Accountability Act (HIPAA) and the National Institute of Standards and Technology (NIST) Safe Harbor regulations.
• All identifying patient information has been removed from the data set.
• Any patient, encounter, or culture order identification numbers have been anonymized.
• Demographic information has been removed with the exception of age and gender. Age has been bucketed into age ranges such as 18-24, 25-34, etc. and gender (male or female) has been adapted to a binary indicator of 0 or 1 without indication of which gender is represented by 0 or 1.
• Dates and times have been offset (jittered) by a random number of days within a two-month period while keeping consistent temporal relationships.
Our Antibiotic Resistance Microbiology Dataset: UTSW (ARMD: UTSW) data set comprises a longitudinal collection of Epic based EHR data from the University of Texas Southwestern Medical Center (UTSW) for adults (≥18 years old) from 2005 to 2025. It includes deidentified microbiological laboratory results for urine, blood, and respiratory cultures. Also included is patient demographics, comorbidities, socioeconomic factors via the area deprivation index, and prior exposure to antibiotics and procedures. All data was collected from UTSW’s Epic Clarity database via Microsoft’s T-SQL based SQL Server Management Studio.
The raw data was then transformed into standardized values to assist with future research uses. This includes standardizing gender into two deidentified values of “0” or “1” (Null was used for missing gender data), bucketing patient ages at the time when the culture was taken into age ranges such as 18-24, 25-34, etc., and standardizing medication names into generic names for consistency. Additionally, susceptibility results have been standardized to values of “susceptible”, “intermediate”, “resistant”, “synergism”, and “inconclusive” to account for the various reporting means from different laboratories. Culture positivity was also standardized to a binary indicator of “0” or “1” based on the inclusion of susceptibility results for positive cultures. In addition, we accounted for patients with an active infection that might have multiple cultures taken in a short time period by excluding patients with prior microbiological cultures within the two weeks before the encounter.
Deidentification was completed according to Safe Harbor regulations. All patient identifiers were either not included or were anonymized. Examples of this include the anonymization of patient identification numbers such as patient ID numbers, encounter ID numbers, and culture order ID numbers through a randomization process for each while keeping continuity between all individual patient records. While patient zip codes were used to identify values for the area deprivation index, they were removed prior to the final data set. As mentioned previously, patient ages were aggregated into age ranges and gender has been concealed with either “0”, “1”, or “Null”. All date and time information has been shifted with a randomly assigned offset for each patient. This allows for consistent offset values while still being able to account for possible temporal relationships within the dataset.
