FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling
Data files
Jul 16, 2024 version files 1.33 GB
-
bp_terms.pkl
274.89 KB
-
cc_terms.pkl
39.44 KB
-
go_descriptions1.4.txt
2.71 MB
-
go1.4-basic.obo
31.13 MB
-
interpro_domain_dataset.csv
824.76 MB
-
mf_terms.pkl
107.80 KB
-
mol_instruction_catalytic_activity.json
64.90 MB
-
mol_instruction_domain_motif.json
50 MB
-
mol_instruction_general_function.json
127.40 MB
-
mol_instruction_protein_function.json
154.65 MB
-
README.md
2.60 KB
-
test_exp_prompt_bp.csv
1.48 MB
-
test_exp_prompt_cc.csv
1.22 MB
-
test_exp_prompt_mf.csv
976.07 KB
-
train_exp_prompt_bp.csv
28.50 MB
-
train_exp_prompt_cc.csv
23.29 MB
-
train_exp_prompt_mf.csv
18.79 MB
-
val_exp_prompt_bp.csv
1.50 MB
-
val_exp_prompt_cc.csv
1.24 MB
-
val_exp_prompt_mf.csv
970.53 KB
Abstract
Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation.