Antimicrobial Resistance Microbiological Dataset (ARMD-ECUH): A deidentified collection of electronic health records from a rural academic health system for antimicrobial resistance research
Data files
Nov 10, 2025 version files 3.94 GB
-
00_microbiology_cultures_cohort.csv
313.53 MB
-
01_microbiology_cultures_adi_scores.csv
51.98 MB
-
010_microbiology_cultures_prior_med.csv
194.66 MB
-
011_microbiology_cultures_prior_procedures.csv
91.40 MB
-
012_microbiology_cultures_ward_info.csv
52.13 MB
-
013_microbiology_cultures_vitals.csv
122.55 MB
-
02_microbiology_cultures_antibiotic_class_exposure.csv
227.75 MB
-
03_microbiology_cultures_antibiotic_subtype_exposure.csv
251.85 MB
-
04_microbiology_cultures_comorbidity.csv
1.62 GB
-
05_microbiology_cultures_demographics.csv
31.19 MB
-
06_microbiology_cultures_labs.csv
621.81 MB
-
07_microbiology_cultures_microbial_resistance.csv
257.09 MB
-
08_microbiology_cultures_nursing_home_visits.csv
47.47 MB
-
09_microbiology_cultures_prior_infecting_organism.csv
58.13 MB
-
README.md
11.44 KB
Nov 10, 2025 version files 3.94 GB
-
00_microbiology_cultures_cohort.csv
313.53 MB
-
01_microbiology_cultures_adi_scores.csv
51.98 MB
-
010_microbiology_cultures_prior_med.csv
194.66 MB
-
011_microbiology_cultures_prior_procedures.csv
91.40 MB
-
012_microbiology_cultures_ward_info.csv
52.13 MB
-
013_microbiology_cultures_vitals.csv
122.55 MB
-
02_microbiology_cultures_antibiotic_class_exposure.csv
227.75 MB
-
03_microbiology_cultures_antibiotic_subtype_exposure.csv
251.85 MB
-
04_microbiology_cultures_comorbidity.csv
1.62 GB
-
05_microbiology_cultures_demographics.csv
31.19 MB
-
06_microbiology_cultures_labs.csv
621.81 MB
-
07_microbiology_cultures_microbial_resistance.csv
257.09 MB
-
08_microbiology_cultures_nursing_home_visits.csv
47.47 MB
-
09_microbiology_cultures_prior_infecting_organism.csv
58.13 MB
-
README.md
11.70 KB
Abstract
As antimicrobial resistance is increasingly becoming an emergent public health issue, quality, real-world electronic health record-based data sets available for research are lacking. To help remedy this, we have developed the Antimicrobial Resistance Microbiological Dataset: East Carolina University Health (ARMD-ECUH), which includes microbiological culture and susceptibility results for 261,217 patients from ECU Health from 2015 to 2025. Additionally, the inclusion of longitudinal data such as patient demographics, prior medical histories, medications, and procedures adds to the significance of the data set. Clinically relevant data, such as the locations where the cultures were gathered, recent laboratory values, and vitals taken during the respective encounters, are also included. The deidentified ARMD: ECUH data set, with standardized data values to minimize the need for data transformations, will allow researchers across the globe to improve patient outcomes and increase awareness and understanding of antimicrobial resistance.
Dataset DOI: 10.5061/dryad.7sqv9s55x
Description of the data and file structure
Data Description
Our data set is comprised of longitudinal electronic health records from ECU Health. This collection includes deidentified urine, respiratory, and blood-based microbiological culture results and susceptibilities from a cohort of adult patients (≥18 years old) regardless of culture positivity. Additional data included in the data set consists of prior medical histories such as comorbidities, socioeconomic indicators, prior medications, prior infections, and prior medical procedures. Vitals were collected for the encounter in which the culture was taken. The data set includes 14 .csv files that can be used for future analyses. All identifying personal information has been removed or anonymized according to the deidentification methods mentioned below.
All files are joined by the following identifiers:
- anon_id: the deidentified patient number
- pat_enc_csn_id_coded: the deidentified patient encounter number
- order_proc_id_coded: the deidentified unique culture order number
- order_time_jittered: the jittered timestamp for the unique culture order
All missing values have been replaced with “Null” values.
Deidentification Methods
The data set is deidentified to comply with the Health Insurance Portability and Accountability Act (HIPAA) and the National Institute of Standards and Technology (NIST) Safe Harbor regulations.
- All identifying patient information has been removed from the data set.
- Any patient, encounter, or culture order identification numbers have been anonymized. The anon_id was created by giving each patient an identification number, a random pair of letters, accompanied by a random set of numbers. The pat_enc_csn_id_coded and order_proc_id_coded values were created by giving each encounter number or culture order a prefix (3 for pat_enc_csn_id_coded and 4 for order_proc_id_coded), accompanied by a serialized value based on the initial randomization of the accompanying anon_id. This allows for each patient’s encounter and culture order identifiers to be linked to the appropriate patient, but remain anonymized.
- Demographic information has been removed except for age and gender. Age has been bucketed into age ranges such as 18-24, 25-34, etc., and gender (male or female) has been adapted to a binary indicator of 0 or 1 without indication of which gender is represented by 0 or 1.
- Dates and times have been offset (jittered) by a random number of days within two months while keeping consistent temporal relationships.
Files and variables
Files and Variables
- 00_microbiology_cultures_cohort.csv
- The comprehensive listing of culture types, organisms, and susceptibility testing results
- Key Columns:
- ordering_mode: The mode of the culture order (Inpatient or Outpatient)
- culture_description: The type of culture ordered (Blood, Respiratory, or Urine)
- was_positive: Binary indicator of a positive culture result (0 or 1)
- organism: Name of organism
- antibiotic: Name of antibiotic tested
- susceptibility: Susceptibility results such as susceptible, intermediate, or resistant per antibiotic
- 01_microbiology_cultures_adi_scores.csv
- Area Deprivation Index (ADI) information for both national and state ranks that was mapped to patient nine-digit (where available, otherwise five-digit) zip codes.
- Key Columns:
- adi_score: ADI percentile score at the national level; higher values indicate higher levels of deprivation
- adi_state_rank: ADI decile rankings at the state (North Carolina, USA) level; higher values indicate higher levels of deprivation
- 02_microbiology_cultures_antibiotic_class_exposure.csv
- Prior exposure to various antibiotic classes
- Key Columns:
- medication_name: Name of antibiotic
- antibiotic_class: Class of antibiotic
- time_to_culturetime: Days between culture order and antibiotic exposure
- 03_microbiology_cultures_antibiotic_subtype_exposure.csv
- Prior exposure to antibiotic subclasses
- Key Columns:
- medication_name: Name of antibiotic
- antibiotic_subtype: Subclass of antibiotic
- antibiotic_subtype_category: Categorical name of antibiotic subclass
- medication_time_to_culturetime: Days between culture order and antibiotic exposure
- 04_microbiology_cultures_comorbidity.csv
- Comorbidity information for each patient in the cohort
- Key Columns:
- comorbidity_component: Comorbidity component type based on either the AHRQ CCSR diagnosis or Elixhauser Comorbidity Index
- comorbidity_component_start_days_culture: Days between the start of comorbidity and the culture order
- comorbidity_component_end_days_culture: Days between the end of the comorbidity and the culture order, with “Null” indicating an ongoing comorbidity and negative values indicating comorbidities that ended after the culture order
- 05_microbiology_cultures_demographics.csv
- Basic demographic information for each patient in the cohort
- Key columns:
- age: The age range of the patient with values such as 18-24, 25-24, and above 90
- gender: Binary indicator of 0 or 1 to indicate a particular gender. The indicators are purposely not identified for deidentification purposes.
- 06_microbiology_cultures_labs.csv
- Laboratory results recorded for the following:
- wbc: White blood cell count (k/μL)
- neutrophils: Neutrophils (%)
- lymphocytes: Lymphocytes (%)
- hgb: Hemoglobin (mg/dL)
- plt: Platelets (k/μL)
- na: sodium (mmol/L)
- hco3: bicarbonate (mmol/L)
- bun: blood urea nitrogen (mg/dL)
- procalcitonin: Procalcitonin (ng/mL)
- Key columns:
- Period_Day: Days between the culture order and the time range in which the laboratory values were measured
- Q25_xx: Lower quartile laboratory values for listed measurements
- median_xx: Median laboratory values for listed measurements
- Q75_xx: Upper quartile laboratory values for listed measurements
- first_xx: First laboratory value for listed measurements
- last_xx: Last laboratory value for listed measurements
- Laboratory results recorded for the following:
- 07_microbiology_cultures_microbial_resistance.csv
- Microbial resistance information for each culture ordered
- Key columns:
- organism: Name of organism
- antibiotic: Name of antibiotic
- Resistant_time_to_culturetime: Days between the culture order and the susceptibility result
- 08_microbiology_cultures_nursing_home_visits.csv
- Information about visits to nursing facilities prior to the date of the culture order
- Key Columns:
- nursing_home_visit_culture: Days between the culture order and nursing facility visit
- 09_microbiology_cultures_prior_infecting_organism.csv
-
Information on identified organisms prior to the culture order
-
Key columns:
1. prior_organism: Name of organism
2. prior_infecting_organism_days_to_culture: Days between the culture order and prior infection
-
- 010_microbiology_cultures_prior_med.csv
-
Information regarding medications used prior to the culture order
-
Key columns:
1. medication_name: Name of prior medication used
2. medication_time_to_culturetime: Days between the culture order and prior medication start
3. medication_category: Name of medication category
-
- 011_microbiology_cultures_prior_procedures.csv
-
Information on procedures such as mechanical ventilation, parenteral nutrition, surgical procedures, dialysis, or catheterization (central venous and urethral) prior to culture order
-
Key columns:
1. procedure_description: Name of procedure
2. procedure_time_to_culturetime: Days between the prior procedure and the culture order
-
- 012_microbiology_cultures_ward_info.csv
-
Location of the ward in which the culture was taken, such as the intensive care unit, the emergency department, an inpatient setting, or an outpatient setting
-
Key columns:
1. hosp_ward_IP: Binary indicator of the culture location in an inpatient setting
2. hosp_ward_OP: Binary indicator of the culture location in an outpatient setting
3. hosp_ward_ER: Binary indicator of the culture location in the emergency department
4. hosp_ward_ICU: Binary indicator of the culture location in the intensive care unit
-
- 013_microbiology_cultures_vitals.csv
-
Vitals recorded for the following:
- heartrate: Heart Rate (Pulse)
- resprate: Respiratory Rate
- temp: Temperature in Fahrenheit
- sysbp: Systolic Blood Pressure
- diasbp: Diastolic Blood Pressure
-
Key columns:
1. Period_Day: Days between the culture order and the time range in which the laboratory values were measured
2. Q25_xx: Lower quartile laboratory values for listed measurements
3. median_xx: Median laboratory values for listed measurements
4. Q75_xx: Upper quartile laboratory values for listed measurements
5. first_xx: First laboratory value for listed measurements
6. last_xx: Last laboratory value for listed measurements
-
Code/software
Code/Software
- No special software is needed to access the data set. The data files are all .csv (comma-separated value) files, which can be used with standard statistical methods such as SAS, R, and Python.
- No additional scripts or software are provided with the data set.
Access information
Sharing/Access
- No other publicly accessible locations are available to download the data set
- The data were sourced from the Epic-based electronic health records system of ECU Health in Greenville, NC, USA.
Human subjects data
This study was approved by the IRB committee at East Carolina University (UMCIRB 24-001121). Patient consent was not required as all data will be deidentified and is for secondary use in nature.
The data set is deidentified to comply with the Health Insurance Portability and Accountability Act (HIPAA) and the National Institute of Standards and Technology (NIST) Safe Harbor regulations.
• All identifying patient information has been removed from the data set.
• Any patient, encounter, or culture order identification numbers have been anonymized. The anon_id was created by giving each patient an identification number, a random pair of letters, accompanied by a random set of numbers. The pat_enc_csn_id_coded and order_proc_id_coded values were created by giving each encounter number or culture order a prefix (3 for pat_enc_csn_id_coded and 4 for order_proc_id_coded), accompanied by a serialized value based on the initial randomization of the accompanying anon_id. This allows for each patient’s encounter and culture order identifiers to be linked to the appropriate patient, but remain anonymized.
• Demographic information has been removed except for age and gender. Age has been bucketed into age ranges such as 18-24, 25-34, etc., and gender (male or female) has been adapted to a binary indicator of 0 or 1 without indication of which gender is represented by 0 or 1.
• Dates and times have been offset (jittered) by a random number of days within two months while keeping consistent temporal relationships.
Our Antimicrobial Resistance Microbiological Dataset - East Carolina University Health (ARMD-ECUH) data set is a longitudinal collection of Epic-based EHR from the ECU Health (ECUH) health system of adults (≥18 years old) from 2015 to 2025 (prior to date jittering for deidentification purposes). The data set includes deidentified microbiological laboratory results for blood, urine, and respiratory cultures. Encounter-based vitals, patient demographics, comorbidities, socioeconomic factors quantified by the Area Deprivation Index (ADI), and prior exposures to medications and procedures are included in the data set. All data were collected via a Microsoft Fabric-supported data warehouse, which contains daily updates from ECUH’s Epic Clarity database. The data were queried using Spark SQL in Fabric notebooks.
In similar methods described by both ARMD and ARMD-UTSW, all raw data were standardized to assist with future research applications. This standardization includes identifying gender into two anonymized values of “0” or “1” (Null was used for missing gender data), patient ages at the time when the culture was taken were divided into age ranges such as 18-24, 25-34, 35-44, etc. Medication names were also standardized by using generic names. To account for the various reporting methods used by different microbiology laboratories, susceptibility results were consolidated to the values of “susceptible”, “intermediate”, “resistant”, “synergism”, and “inconclusive.” Binary indicators of “0” or “1” were used to designate culture positivity as determined by the inclusion of susceptibility results for positive cultures. Patients with possible active infections were identified by having multiple cultures within the previous two weeks from the encounter and were thus excluded from the data set.
Safe Harbor regulations were utilized for the deidentification process, including removing or anonymizing all patient identifiers. This includes identifiers such as patient ID numbers, encounter ID numbers, and culture order ID numbers, which were anonymized through a process that produced randomized identification numbers consistent between all individual patient records. Socioeconomic factors were identified using the ADI, which requires the use of patient zip codes. Zip codes were removed after the identification of the ADI values. All datetime information has been offset by a randomly assigned number of days per patient, while keeping possible temporal relationships intact. Age and gender for patients were deidentified using the methods described previously.
