Gene-language models are whole genome representation learners
Data files
Feb 28, 2024 version files 52.27 MB
-
narms_2017_and_2022_genespace.dir.zarr.zip
-
narms_metadata_2017.csv
-
narms_metadata_2022.csv
-
narms_phenotypes_2017.csv
-
README.md
Abstract
The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic supervised prediction accuracy in 11 out of 12 antibiotic resistance phenotypes.
README: Gene-language models are whole genome representation learners
This directory holds the clean phenotype & metadata tables for both 2017 and the updated 2022 NARMS population, and the genespace zarr file.
[Metadata]
Two metadata files, narms_metadata_2017.csv and narms_metadata_2022.csv, containing various qualifiers for each sample belonging to the NARMS population used in the study. For example, the 'serotype' column indicates the sample-in-question's serotype designation. Index (first column) refers to the original accession ID for the corresponding Sequence Read Archive.
[Phenotypes]
A phenotype file (narms_phenotype_2017.csv) containing the collection of phenotypic qualities for each sample belonging to the NARMS population used in the study. Virtually all phenotype columns reflect the minimum inhibitory concentration (MIC) for that particular drug to reflect resistance. For example, 'mic_ampicillin' refers to the MIC values for ampicillin for each sample. Index (first column) refers to the original accession ID for the corresponding Sequence Read Archive.
[Genespace]
A Zarr file in DirectoryFile format. Used by the model to construct the differential AB genespace for both 2017 and 2022 datasets. The 'ids' z-group contains the sample identifiers that correspond to the metadata and phenotypes. The 'gene_presence_absence_matrix' z-group contains the binary presence and absence array for genes. 'gene_names' contains the gene identifers that correspond to each column in 'gene_presence_absence_matrix'. 'database_names' contains the databases that were queried for each identified gene. Requires the Zarr package to interface with.
Detailed description of Columns for narms_phenotype_2017.csv and narms_metadata_2022.csv, along with the technical notes for each dataset and a usage example of .zarr with Python are given below:
[Metadata]
A comma-delimited metadata file containing various qualifiers for each sample belonging to the NARMS population used in the study.
-Columns for narms_metadata_2017.csv
accession_id: The original accession ID for the corresponding Sequence Read Archive.
narms_sample_id: Internal CDC NARMS sample ID.
narms_isolate_id: Internal CDC NARMS isolate ID.
received_date: The date that the meat sample was received.
sellby_date: The sellby date of the meat that the sample was extracted fron.
acquisition_date: The date that the sample was first acquired.
is_organic: Indicates if the sample meat was listed as 'organic' or not.
is_labled_antibiotic_free: Indicates if the sample meat was listed as being antibiotic free or not.
serotype: Serotype of the bacterial isolate. This is usually based on what the submitting laboratory reported, but may be updated if identification is performed at CDC and a different serotype result is obtained.
species: The species of the isolated bacteria.
country_of_origin: The origin country for the meat source.
state: The origin state of the meat source.
collected_month: The mouth that the sample was collected.
collected_year: The year that the sample was collected.
agency: The collecting angency.
host_species: The host organism of the isolate.
meat_type: The type of retail meat.
meat_cut: The type of cut (e.g. ground, cut, etc.).
source: The original source meat.
source_species_info: Optional additional information to contextualize the source species.
testing_plate: The type of antimicrobial agent dilution panel used.
aminoglycoside_resist_genes: Genes identified within the isolate that may confer resistance to aminoglycosides.
beta-lactam_resist_genes: Genes identified within the isolate that may confer resistance to beta-lactams.
glycopeptide_resist_genes: Genes identified within the isolate that may confer resistance to glycopeptides.
macrolide_resist_genes: Genes identified within the isolate that may confer resistance to macrolides.
quaternary_ammonium_resist_genes: Genes identified within the isolate that may confer resistance to quaternary ammoniums.
quinolone_resist_genes: Genes identified within the isolate that may confer resistance to quinolones.
sulfonamide_resist_genes: Genes identified within the isolate that may confer resistance to sulfonamides.
tetracycline_resist_genes: Genes identified within the isolate that may confer resistance to tetracyclines.
trimethoprim_resist_genes: Genes identified within the isolate that may confer resistance to trimethoprims.
phenicol_resist_genes: Genes identified within the isolate that may confer resistance to phenicols.
other_classes_resist_genes: Genes identified within the isolate that may confer resistance to other antibiotics.
Note 1: Due to the distributed nature of the large-scale live field recordation of these fields, some cells may be empty.
Note 2: For resist_gene columns, an empty cell indicates no annotation matches for that resistance category.
-Columns for narms_metadata_2022.csv
accession_id: The original accession ID for the corresponding Sequence Read Archive.
narms_specimen_id: Internal CDC NARMS sample ID used as the unique identification number of the specimen.
narms_wgs_id: Unique ID assigned by PulseNet for an assembled whole-genome sequence. Used to identify sequences uploaded to NCBI.
is_ast_approved: If yes (true), the isolate was collected for routine surveillance purposes and underwent antimicrobial susceptibly testing (AST) at CDC and AST results are approved by CDC NARMS. If no (false), the isolate was not collected for routine surveillance purposes, or did not undergo antimicrobial susceptibly testing (AST) at CDC, or AST results are not approved by CDC NARMS.
is_wgs_approved: If yes (true), the isolate underwent whole genome sequencing (WGS), the WGS data was screened for resistance genes, and results are approved by CDC NARMS. If no (false), the isolate did not undergo whole genome sequencing (WGS), or WGS data was not screened for resistance genes, or the results are not approved by CDC NARMS.
species: Species of the bacterial isolate. This is usually based on what the submitting laboratory reported, but may be updated if identification is performed at CDC and a different species result is obtained. Species is confirmed at CDC for all Campylobacter submitted for NARMS surveillance.
serotype: Serotype of the bacterial isolate. This is usually based on what the submitting laboratory reported, but may be updated if identification is performed at CDC and a different serotype result is obtained.
collected_year: The calendar year for the isolate (i.e. calendar year is assigned based off of specimen collection date).
region: HHS (Health & Human Services) Region is the geographic location for the state that submitted the isolate to NARMS.
age_group: Age category (in years: 0-4, 5-9, 10-19, 20-29, 30- 39, 40- 49, 50- 59, 60- 69, 70- 79, 80+)
specimen_source: Type of specimen from which the isolate was obtained (i.e. blood or stool)
resistance_pattern: Antibiotics that were found to be resistant for the isolate; otherwise, will read "No resistance detected"
resistance_determinants: Genes or mutations known to predict resistance found through whole genome sequencing. If none were found, will read "No determinants detected". If whole genome sequencing was not performed on the isolate, will read "Not sequenced". If whole genome sequencing was performed on the isolate, but it has not yet been analyzed for the presence of resistance genes, will read "Not analyzed".
predictive_resistance_pattern: Predicted resistance pattern based on resistance determinants. If no determinants were found, will read "No determinants detected". If whole genome sequencing was not performed on the isolate, will read "Not sequenced". If whole genome sequencing was performed on the isolate, but it has not yet been analyzed for the presence of resistance genes, will read "Not analyzed".
lost_resistance_on_retest: If yes (true), the isolate has lost resistance through some mechanism.
Note 1: Due to the distributed nature of the large-scale live field recordation of these fields, some cells may be empty.
[Phenotypes]
A comma-delimited metadata file containing the collection of phenotypic qualities for each sample belonging to the NARMS population used in the study. Virtually all phenotype columns reflect the minimum inhibitory concentration (MIC) for that particular drug to reflect resistance.
-Columns for narms_phenotypes_2022.csv
accession_id: The original accession ID for the corresponding Sequence Read Archive.
mic_amoxicillin-clavulanic_acid: The minimum inhibitory concentration for amoxicillin-clavulanic_acid.
mic_amikacin: The minimum inhibitory concentration for amikacin.
mic_ampicillin: The minimum inhibitory concentration for ampicillin.
mic_apramycin: The minimum inhibitory concentration for apramycin.
mic_aztreonam: The minimum inhibitory concentration for aztreonam.
mic_ceftriaxone: The minimum inhibitory concentration for ceftriaxone.
mic_azithromycin: The minimum inhibitory concentration for azithromycin.
mic_benzalkonium-chloride: The minimum inhibitory concentration for benzalkonium-chloride.
mic_ceftazidime: The minimum inhibitory concentration for ceftazidime.
mic_cephalothin: The minimum inhibitory concentration for cephalothin.
mic_chloramphenicol: The minimum inhibitory concentration for chloramphenicol.
mic_ciprofloxacin: The minimum inhibitory concentration for ciprofloxacin.
mic_clindamycin: The minimum inhibitory concentration for clindamycin.
mic_trimethoprim-sulfamethoxazole: The minimum inhibitory concentration for trimethoprim-sulfamethoxazole.
mic_cefotaxime: The minimum inhibitory concentration for cefotaxime.
mic_daptomycin: The minimum inhibitory concentration for daptomycin.
mic_doxycycline: The minimum inhibitory concentration for doxycycline.
mic_erythromycin: The minimum inhibitory concentration for erythromycin.
mic_cefepime: The minimum inhibitory concentration for cefepime.
mic_florfenicol: The minimum inhibitory concentration for florfenicol.
mic_sulfisoxazole: The minimum inhibitory concentration for sulfisoxazole.
mic_cefoxitin: The minimum inhibitory concentration for cefoxitin.
mic_gentamicin: The minimum inhibitory concentration for gentamicin.
mic_imipenem: The minimum inhibitory concentration for imipenem.
mic_kanamycin: The minimum inhibitory concentration for kanamycin.
mic_lincomycin: The minimum inhibitory concentration for lincomycin.
mic_linezolid: The minimum inhibitory concentration for linezolid.
mic_meropenem: The minimum inhibitory concentration for meropenem.
mic_nalidixic_acid: The minimum inhibitory concentration for nalidixic_acid.
mic_nitrofurantoin: The minimum inhibitory concentration for nitrofurantoin.
mic_penicillin: The minimum inhibitory concentration for penicillin.
mic_piperacillin-tazobactam: The minimum inhibitory concentration for piperacillin-tazobactam.
mic_quinupristin-dalfopristin: The minimum inhibitory concentration for quinupristin-dalfopristin.
mic_sulfamethoxazole: The minimum inhibitory concentration for sulfamethoxazole.
mic_streptomycin: The minimum inhibitory concentration for streptomycin.
mic_telithromcyin: The minimum inhibitory concentration for telithromcyin.
mic_tetracycline: The minimum inhibitory concentration for tetracycline.
mic_tigecycline: The minimum inhibitory concentration for tigecycline.
mic_ceftiofur: The minimum inhibitory concentration for ceftiofur.
mic_tylosin: The minimum inhibitory concentration for tylosin.
mic_vancomycin: The minimum inhibitory concentration for vancomycin.
[Genespace]
A heterogeneous file object used by the transformer model to construct the differential AB genespace. To allow for real-time training, the Zarr data format (https://zarr.readthedocs.io/en/stable/) is used. Specifically, this file is a zarr.storage file object in DirectoryStore format. While the directory is observable and traversable with standard linux utilities (cd, ls, etc.), it is instead recommended that the official Zarr interface be used for any interactions.
Usage example with Python and zarr:
# Import the mandatory zarr package
import zarr
# To view the sample identifiers (accessions IDs) that correspond to the metadata and phenotypes:
print(z_root['genotypes']['ids'][:])
# To view the gene identifiers that correspond to each column in 'gene_presence_absence_matrix':
print(z_root['genotypes']['gene_names'][:])
# To view the databases that were queried for each identified gene.
print(z_root['genotypes']['database_names'][:])
# To view the binary presence and absence array for genes, load the 'gene_presence_absence_matrix' z-group:
print(z_root['genotypes']['gene_presence_absence_matrix'][:])