Data from: Protein Set Transformer: A protein-based genome language model to power high diversity viromics
Data files
Sep 19, 2024 version files 214.56 GB
-
aai.tar.gz
40.81 GB
-
esm-large_protein_embeddings.tar.gz
30.58 GB
-
esm-small_protein_embeddings.tar.gz
15.26 GB
-
fasta.tar.gz
4.69 GB
-
genome_clusters.tar.gz
121.03 MB
-
genome_embeddings.tar.gz
4.74 GB
-
genslm_protein_embeddings.tar.gz
23.10 GB
-
host_prediction.tar.gz
1.24 GB
-
protein_clusters.tar.gz
11.68 GB
-
pst-large_protein_embeddings.tar.gz
60.20 GB
-
pst-small_protein_embeddings.tar.gz
19.22 GB
-
README.md
19.62 KB
-
supplementary_data.tar.gz
786.27 MB
-
supplementary_tables.zip
183.57 MB
-
trained_models.tar.gz
1.95 GB
Jun 18, 2025 version files 235.10 GB
-
esm_embeddings.tar.gz
45.46 GB
-
fasta.tar.gz
4.77 GB
-
foldseek_databases.tar.gz
1.20 GB
-
genome_clusters.tar.gz
4.97 MB
-
genslm_ORF_embeddings.h5
23.32 GB
-
host_prediction.tar.gz
2.34 GB
-
IMGVRv4_test_set_PST-TL-P__large_protein_embeddings.h5
20.58 GB
-
IMGVRv4_test_set_PST-TL-P__small_protein_embeddings.h5
10.19 GB
-
IMGVRv4_test_set_PST-TL-T__large_protein_embeddings.h5
30.94 GB
-
IMGVRv4_test_set_PST-TL-T__small_protein_embeddings.h5
10.18 GB
-
MGnify_test_set_PST-TL-P__large_protein_embeddings.h5
401.41 MB
-
MGnify_test_set_PST-TL-P__small_protein_embeddings.h5
200.45 MB
-
MGnify_test_set_PST-TL-T__large_protein_embeddings.h5
622.79 MB
-
MGnify_test_set_PST-TL-T__small_protein_embeddings.h5
200.48 MB
-
other_genome_embeddings.tar.gz
8.46 GB
-
protein_clusters.tar.gz
197.29 MB
-
PST_training_set_PST-TL-P__large_protein_embeddings.h5
18.28 GB
-
PST_training_set_PST-TL-P__small_protein_embeddings.h5
9.04 GB
-
PST_training_set_PST-TL-T__large_protein_embeddings.h5
27.94 GB
-
PST_training_set_PST-TL-T__small_protein_embeddings.h5
9.05 GB
-
PST-MLM.tar.gz
4.35 GB
-
PST-TL_genome_embeddings.tar.gz
2.61 GB
-
PST-TL-P__large.ckpt.gz
221.45 MB
-
PST-TL-P__small.ckpt.gz
56.06 MB
-
PST-TL-T__large.ckpt.gz
1.90 GB
-
PST-TL-T__small.ckpt.gz
56.13 MB
-
README.md
26.66 KB
-
supplementary_data.tar.gz
2.32 GB
-
supplementary_tables.zip
214.79 MB
Sep 30, 2025 version files 235.14 GB
-
esm_embeddings.tar.gz
45.46 GB
-
fasta.tar.gz
4.77 GB
-
foldseek_databases.tar.gz
1.20 GB
-
genome_clusters.tar.gz
4.97 MB
-
genslm_ORF_embeddings.h5
23.32 GB
-
host_prediction.tar.gz
2.34 GB
-
IMGVRv4_test_set_PST-TL-P__large_protein_embeddings.h5
20.58 GB
-
IMGVRv4_test_set_PST-TL-P__small_protein_embeddings.h5
10.19 GB
-
IMGVRv4_test_set_PST-TL-T__large_protein_embeddings.h5
30.94 GB
-
IMGVRv4_test_set_PST-TL-T__small_protein_embeddings.h5
10.18 GB
-
MGnify_test_set_PST-TL-P__large_protein_embeddings.h5
401.41 MB
-
MGnify_test_set_PST-TL-P__small_protein_embeddings.h5
200.45 MB
-
MGnify_test_set_PST-TL-T__large_protein_embeddings.h5
622.79 MB
-
MGnify_test_set_PST-TL-T__small_protein_embeddings.h5
200.48 MB
-
other_genome_embeddings.tar.gz
8.46 GB
-
protein_clusters.tar.gz
197.29 MB
-
PST_training_set_PST-TL-P__large_protein_embeddings.h5
18.28 GB
-
PST_training_set_PST-TL-P__small_protein_embeddings.h5
9.04 GB
-
PST_training_set_PST-TL-T__large_protein_embeddings.h5
27.94 GB
-
PST_training_set_PST-TL-T__small_protein_embeddings.h5
9.05 GB
-
PST-MLM.tar.gz
4.35 GB
-
PST-TL_genome_embeddings.tar.gz
2.61 GB
-
PST-TL-P__large.ckpt.gz
221.45 MB
-
PST-TL-P__small.ckpt.gz
56.06 MB
-
PST-TL-T__large.ckpt.gz
1.90 GB
-
PST-TL-T__small.ckpt.gz
56.13 MB
-
README.md
28.70 KB
-
source_data.tar.gz
2.35 GB
-
supplementary_data.tar.gz
219.26 MB
-
supplementary_tables.tar.gz
3.12 KB
Dec 18, 2025 version files 241.97 GB
-
esm_embeddings.tar.gz
46.64 GB
-
fasta.tar.gz
4.77 GB
-
foldseek_databases.tar.gz
1.20 GB
-
genome_clusters.tar.gz
4.87 MB
-
genslm_ORF_embeddings.h5
23.32 GB
-
host_prediction.tar.gz
2.35 GB
-
IMGVRv4_test_set_PST-TL-P__large_protein_embeddings.h5
23.03 GB
-
IMGVRv4_test_set_PST-TL-P__small_protein_embeddings.h5
11.60 GB
-
IMGVRv4_test_set_PST-TL-T__large_protein_embeddings.h5
35.68 GB
-
IMGVRv4_test_set_PST-TL-T__small_protein_embeddings.h5
11.52 GB
-
MGnify_test_set_PST-TL-P__large_protein_embeddings.h5
422.15 MB
-
MGnify_test_set_PST-TL-P__small_protein_embeddings.h5
210.96 MB
-
MGnify_test_set_PST-TL-T__large_protein_embeddings.h5
669.26 MB
-
MGnify_test_set_PST-TL-T__small_protein_embeddings.h5
211.03 MB
-
other_genome_embeddings.tar.gz
5.11 GB
-
protein_clusters.tar.gz
192.28 MB
-
PST_training_set_PST-TL-P__large_protein_embeddings.h5
18.28 GB
-
PST_training_set_PST-TL-P__small_protein_embeddings.h5
9.04 GB
-
PST_training_set_PST-TL-T__large_protein_embeddings.h5
27.94 GB
-
PST_training_set_PST-TL-T__small_protein_embeddings.h5
9.05 GB
-
PST-MLM.tar.gz
4.35 GB
-
PST-TL_genome_embeddings.tar.gz
1.58 GB
-
PST-TL-P__large.ckpt.gz
221.45 MB
-
PST-TL-P__small.ckpt.gz
56.06 MB
-
PST-TL-T__large.ckpt.gz
1.90 GB
-
PST-TL-T__small.ckpt.gz
56.13 MB
-
README.md
23.94 KB
-
source_data.tar.gz
2.35 GB
-
supplementary_data.tar.gz
219.26 MB
-
supplementary_tables.tar.gz
3.12 KB
Abstract
Exponential increases in microbial and viral genomic data demand transformational advances in scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of microbial and especially viral genomes and proteins that significantly decreases the volume of usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model that models genomes as sets of proteins without considering sparsely available functional labels. Trained on >100k viruses, PST outperformed other homology- and language model-based approaches for relating viral genomes based on shared protein content. Further, PST demonstrated protein structural and functional awareness by clustering capsid-fold-containing proteins with known capsid proteins and uniquely clustering late gene proteins within related viruses. Our data establish PST as a valuable method for diverse viral genomics, ecology, and evolutionary applications. We posit that the PST framework can be a foundation model for microbial genomics when trained on suitable data.
Genomes used to create this dataset are publicly available, and all data within this dataset were generated by the study. See the manuscript for details.
Changes after Sep 19, 2024:
-
Added
foldseek_databases.tar.gz- Precomputed foldseek 3Di databases for each test dataset
PST-TL-P__small.ckpt.gz- New pretrained model checkpoint for model trained with triplet loss and tuned with protein diversity groups
PST-TL-P__large.ckpt.gz- New pretrained model checkpoint for model trained with triplet loss and tuned with protein diversity groups
PST-MLM.tar.gz- New pretrained model checkpoints for models trained with masked language modeling loss
PST_training_set_PST-TL-P__large_protein_embeddings.h5IMGVRv4_test_set_PST-TL-P__large_protein_embeddings.h5MGnify_set_PST-TL-P__large_protein_embeddings.h5PST_training_set_PST-TL-P__small_protein_embeddings.h5IMGVRv4_test_set_PST-TL-P__small_protein_embeddings.h5MGnify_set_PST-TL-P__small_protein_embeddings.h5
-
Changed
esm-large_protein_embeddings.tar.gz- Now part of
esm_embeddings.tar.gz - Includes MGnify test set
- Now part of
esm-small_protein_embeddings.tar.gz- Now part of
esm_embeddings.tar.gz - Includes MGnify test set
- Now part of
fasta.tar.gz- Includes MGnify test set
genome_clusters.tar.gz- Includes MGnify test set
genome_embeddings.tar.gz- Split into different files:
PST-TL_genome_embeddings.tar.gzcontain allPST-TLgenome embeddings for each datasetother_genome_embeddings.tar.gzcontain all others
- Split into different files:
genslm_protein_embeddings.tar.gz- Converted into a single
.h5file calledgenslm_ORF_embeddings.h5 - Includes MGnify test set
- Converted into a single
host_prediction.tar.gz- Knowledge graphs were reconstructed using a different vector similarity search method
- Retrained models using new genome embeddings
protein_clusters.tar.gz- Includes MGnify test set
pst-large_protein_embeddings.tar.gz- Split into dataset specific files for easier access for each dataset:
PST_training_set_PST-TL-T__large_protein_embeddings.h5IMGVRv4_test_set_PST-TL-T__large_protein_embeddings.h5MGnify_set_PST-TL-T__large_protein_embeddings.h5
- Split into dataset specific files for easier access for each dataset:
pst-small_protein_embeddings.tar.gz- Split into dataset specific files for easier access for each dataset:
PST_training_set_PST-TL-T__small_protein_embeddings.h5IMGVRv4_test_set_PST-TL-T__small_protein_embeddings.h5MGnify_set_PST-TL-T__small_protein_embeddings.h5
- Split into dataset specific files for easier access for each dataset:
supplementary_data.tar.gz- Most figures were modified, so all supplementary datasets changed to reflect changes in manuscript
supplementary_tables.zip- Added 3 new supplementary tables and included the MGnify test dataset in the existing tables
trained_models.tar.gz- Split into separate files for easier access of individual models:
PST-TL-T__small.ckpt.gzPST-TL-T__large.ckpt.gz
- Split into separate files for easier access of individual models:
-
Removed
aai.tar.gz- These were originally raw protein-protein alignments for the IMGVRv4 dataset
- These have been summarized in
supplementary_data.tar.gz - But the raw alignments had to be removed to make more storage
Changes after Jun 18, 2025:
-
Added
supplementary_data.tar.gznow refers to Supplementary Data files as described in the manuscript. These are tables that are too large to be submitted at tables.
-
Changed
-
source_data.tar.gznow refers to the source data used for the figures.- Each file has been renamed to refer to the specific figure panel.
Old file New file supplementary_tables/supplementary_table_1.tsvsupplementary_data/supplementary_data_1.tsvsupplementary_tables/supplementary_table_2.tsvsupplementary_data/supplementary_data_2.tsvsupplementary_tables/supplementary_table_3.tsvsupplementary_tables/supplementary_table_1.tsvsupplementary_tables/supplementary_table_4.tsvsupplementary_tables/supplementary_table_2.tsvsupplementary_tables/supplementary_table_5.tsvsupplementary_data/supplementary_data_3.tsvsupplementary_tables/supplementary_table_6.tsvsupplementary_data/supplementary_data_4.tsvsupplementary_tables/supplementary_table_7.tsvsupplementary_tables/supplementary_table_5.tsvsupplementary_tables/supplementary_table_8.tsvsupplementary_data/supplementary_data_5.tsvsupplementary_tables/supplementary_table_9.tsvsupplementary_tables/supplementary_table_6.tsvsupplementary_tables/supplementary_table_10.tsvsupplementary_tables/supplementary_table_3.tsvsupplementary_tables/supplementary_table_11.tsvsupplementary_tables/supplementary_table_4.tsv
-
Changes after Sep 30, 2025:
- Changed
- The following files were reuploaded since there was an issue with the ordering of the embeddings:
IMGVRv4_test_set_PST-TL-T__large_protein_embeddings.h5IMGVRv4_test_set_PST-TL-T__small_protein_embeddings.h5IMGVRv4_test_set_PST-TL-P__large_protein_embeddings.h5IMGVRv4_test_set_PST-TL-P__small_protein_embeddings.h5MGnify_test_set_PST-TL-T__large_protein_embeddings.h5MGnify_test_set_PST-TL-T__small_protein_embeddings.h5MGnify_test_set_PST-TL-P__large_protein_embeddings.h5MGnify_test_set_PST-TL-P__small_protein_embeddings.h5esm_embeddings.tar.gzPST-TL_genome_embeddings.tar.gzgenome_clusters.tar.gzprotein_clusters.tar.gzother_genome_embeddings.tar.gzhost_prediction.tar.gz
- The format of all data has not changed, just the actual content to reflect the order of genomes/proteins in
Supplementary Tables 1 & 2.
- The following files were reuploaded since there was an issue with the ordering of the embeddings:
- Martin, Cody; Gitter, Anthony; Anantharaman, Karthik (2024), Protein Set Transformer: A protein-based genome language model to power high diversity viromics, [], Posted-content, https://doi.org/10.1101/2024.07.26.605391
