This README describes the data file contents accompanying the paper entitled, "Concept Co-occurrence From Millions of Clinical Narratives" by Finlayson et al. (2014). This README is intended to be understood with the Data Records section of the paper. Contact: Nigam Shah, nigam@stanford.edu ==================================================================== OVERVIEW ==================================================================== There are 3 main data folders: Data File Folder 1 -- Co-Frequency Counts Data File Folder 2 -- Singleton Frequency Counts Data File Folder 3 -– Term and Concept ID Mappings Data File Folder 4 -- Processing Scripts Each folder will be archived using the tar command. Each file will be compressed using the gzip command. A complete manifest is listed at the bottom of this README. To EXTRACT a folder archive, use the tar command: %> tar -xvf foldername.tar To UNZIP a file, use the GNU gunzip command: %> gunzip filename.txt.gz ==================================================================== FILE MANIFFEST ==================================================================== Data File Folder 1: (co-frequency counts) FORMAT: tab-delimited; col 1 & 2 = term ID, col 3 = freq. count Term-level, per-bin agg., 1-day, 14-day ... inf-day windows: cofreqs_terms_perBin_1d.txt.gz cofreqs_terms_perBin_7d.txt.gz cofreqs_terms_perBin_30d.txt.gz cofreqs_terms_perBin_90d.txt.gz cofreqs_terms_perBin_180d.txt.gz cofreqs_terms_perBin_365d.txt.gz cofreqs_terms_perBin_alld.txt.gz Term-level, per-patient agg., 1-day, 14-day ... inf-day windows: cofreqs_terms_perPat_1d.txt.gz cofreqs_terms_perPat_7d.txt.gz cofreqs_terms_perPat_30d.txt.gz cofreqs_terms_perPat_90d.txt.gz cofreqs_terms_perPat_180d.txt.gz cofreqs_terms_perPat_365d.txt.gz cofreqs_terms_perPat_alld.txt.gz FORMAT: tab-delimited; col 1 & 2 = concept ID, col 3 = freq. count Concept-level, per-bin agg., 1-day, 14-day ... inf-day windows: cofreqs_concepts_perBin_1d.txt.gz cofreqs_concepts_perBin_7d.txt.gz cofreqs_concepts_perBin_30d.txt.gz cofreqs_concepts_perBin_90d.txt.gz cofreqs_concepts_perBin_180d.txt.gz cofreqs_concepts_perBin_365d.txt.gz cofreqs_concepts_perBin_alld.txt.gz Concept-level, per-patient agg., 1-day, 14-day ... inf-day windows: cofreqs_concepts_perPat_1d.txt.gz cofreqs_concepts_perPat_7d.txt.gz cofreqs_concepts_perPat_30d.txt.gz cofreqs_concepts_perPat_180d.txt.gz cofreqs_concepts_perPat_90d.txt.gz cofreqs_concepts_perPat_365d.txt.gz cofreqs_concepts_perPat_alld.txt.gz Data File Folder 2: (singleton counts) FORMAT: tab-delimited; col 1 = term ID, col 2 = freq. count Term-level, per-bin agg., 1-day, 14-day ... inf-day windows: singlets_terms_perBin_1d.txt.gz singlets_terms_perBin_7d.txt.gz singlets_terms_perBin_30d.txt.gz singlets_terms_perBin_90d.txt.gz singlets_terms_perBin_180d.txt.gz singlets_terms_perBin_365d.txt.gz singlets_terms_perBin_alld.txt.gz Term-level, per-patient agg. (time window independent): singlets_terms_perPat.txt.gz FORMAT: tab-delimited; col 1 = concept ID, col 2 = freq. count Concept-level, per-bin agg., 1-day, 14-day ... inf-day windows: singlets_concepts_perBin_180d.txt.gz singlets_concepts_perBin_1d.txt.gz singlets_concepts_perBin_30d.txt.gz singlets_concepts_perBin_365d.txt.gz singlets_concepts_perBin_7d.txt.gz singlets_concepts_perBin_90d.txt.gz singlets_concepts_perBin_alld.txt.gz Concept-level, per-patient agg. (time window independent): singlets_concepts_perPat.txt.gz Data File Folder 3: (term and concept mappings) FORMAT: tab-delimited; col 1 = term ID, col 2 = string 1_term_ID_to_string.txt.gz - mapping dict., term ID to strings FORMAT: tab-delimited; col 1 = concept ID, col 2 = string 2a_concept_ID_to_string.txt.gz - mapping, concept ID to strings FORMAT: tab-delimited; col 1 = concept ID, col 2 = UMLS CUI 2b_concept_ID_to_CUI.txt.gz - mapping, concept ID to CUI FORMAT: tab-delimited; col 1 = term ID, col 2 = concept ID 3_term_ID_to_concept_ID.txt.gz - mapping, term ID to concept ID FORMAT: one column of strings 4_stopwords.txt.gz - list of strings excluded from counts Data File Folder 4: (processing scripts) decode_cofreqs.py - transforms files in folder 1 to readable form decode_singlets.py - transforms files in folder 2 to readable form