A dataset for predicting protein-protein interactions in humans

Zhang, Jing 1 ; Humphrey, Ian R.2; Pei, Jimin1; Kim, Jinuk3; Choi, Chulwon3; Yuan, Rongqing1; Durham, Jesse1; Liu, Siqi1; Choi, Hee-Jung3; Baek, Minkyung3; Baker, David2; Cong, Qian1

Published Sep 16, 2025 on Dryad. https://doi.org/10.5061/dryad.15dv41p84

Data files

Sep 16, 2025 version files 98.79 GB

Abstract

Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resulting dataset includes omicsMSA alignments, training data (domain-domain and protein-protein interactions), high-confidence predicted pairs, oligomeric assemblies inferred from predicted and known interactions, novel components predicted for known complexes, structural models (PDB format), contact probabilities from AlphaFold and RoseTTAFold2-PPI, and DCA scores.

Dataset DOI: 10.5061/dryad.15dv41p84

Description of the data and file structure

protein_omicMSAs.tar.gz (17 GB)

These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named “mask,” to indicate the alignment quality at each position. In this “mask,” an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of each sequence in the header, following the format: [genus]:[family]:[order]:[class]:[phylum]. Please note that because we assemble these sequences by aligning draft genomes or genomic reads to human proteins, insertions present in other species relative to the human sequence are often missed. Similarly, gaps in the MSAs may not represent deletions relative to the human sequence; they could result from alignment failures or incompleteness in the genomic dataset.

segment_omicMSAs.tar.gz (13 GB)

Building on the omicMSAs for full-length human proteins (see description above), these MSAs correspond to segments of human proteins used in our work. We split larger proteins into multiple segments and excluded the “low-quality” positions from these segments. The definition of each segment can be found segment_def. These MSAs are in an A3M-like format. Insertions relative to the human (query) sequence are represented as lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also provide the taxonomic information for each sequence in the header, following the format: [genus]:[family]:[order]:[class]:[phylum]. Please note that because we assemble these sequences by aligning draft genomes or genomic reads to human proteins, insertions present in other species relative to the human sequence will often be missed. Similarly, gaps in the MSAs may not represent deletions relative to the human sequence; they could result from alignment failures or incompleteness of the genomic dataset.

benchmarks.tar.gz (286 KB)

Two files are included. The file “positives_and_negatives.tsv” contains the positive and negative controls used to benchmark different methods. The file “pairs_partitioned_by_interface_sizes” contains additional positive controls derived from PDB complexes, partitioned into different categories based on interface size, which correlates with binding affinity.

final_predictions.tar.gz (14 MB)

Two final sets of predicted PPIs generated in our study are included, along with additional metadata. “final_predictions_80.tsv” includes all predictions obtained at an expected precision of 80%, while “final_predictions_90.tsv” includes predictions obtained at an expected precision of 90%. The latter is a subset of the former.

best_models.tar.gz (67 GB)

Description: The best 3D structural models for 29,246 out of 29,257 predicted PPIs are provided here (selected based on the presence of consistently predicted inter-protein contacts across multiple models). A 3D model is considered confident only if it contains inter-protein contacts (distance < 6 Å) with AlphaFold2 interaction probabilities above 0.5. We were unable to obtain such 3D models for 11 predicted PPIs.

Our PPI screening and structural modeling were performed using segments rather than full-length protein sequences. These segments exclude 70% of residues in intrinsically disordered regions (which make up 25% of all residues) that have low-quality MSAs, and they are short enough to fit within GPU memory. Each segment contains one or more domains, and the relative orientation of different segments within a protein is largely flexible. The definition of each segment can be found here.

Each protein pair may be modeled using multiple segment pairs, and the results are organized into folders named after the protein pair. The model files are named using the format: segment1__segment2__model, where the model can be:

"AF": a model built using the AlphaFold2 model 3
"AF1-5": five models built by ColabFold using the AlphaFold2 network
"AFMM1-5": five models built by ColabFold using the AlphaFold-Multimer network

We provide three files for each segment pair:

*.pdb: the predicted 3D structure
*.npz: a matrix of shape L1 × L2, where L1 and L2 are the lengths of the first and second segments, respectively

AF_scores.gz (32 MB)

Interaction probabilities between 3.4 million protein pairs were predicted by AlphaFold2 based on omicMSAs (excluding low-quality positions). We applied AlphaFold2 only to protein pairs that showed high RF2-PPI interaction probabilities or had prior experimental evidence supporting their interactions. Each protein pair has a single probability value, representing the maximum interaction probability among all residue pairs between the two proteins. We also indicate the source of each pair, defined as follows:

DNS: de novo screen of pairs with shared subcellular localization
DNU: de novo screen of pairs involving proteins with unknown subcellular localization
PPI: pairs from PPI databases—BioGRID, STRING (physical), and UniProt
STR: genetically interacting pairs from STRING
NEG: negative controls used for accuracy estimation

RF2-PPI_scores.gz (444 MB)

Interaction probabilities between 47 million protein pairs were predicted by RF2-PPI based on omicMSAs (excluding low-quality positions). We applied RF2-PPI only to protein pairs that exhibited high DCA scores or had prior experimental evidence supporting their interactions. Each protein pair has one probability value, representing the maximum interaction probability (first column) among all residue pairs between the two proteins. We also indicate the source of each pair (second column), defined as follows:

DNS: de novo screen of pairs with shared subcellular localization
DNU: de novo screen of pairs involving proteins with unknown subcellular localization
PPI: pairs from PPI databases—BioGRID, STRING (physical), and UniProt
STR: genetically interacting pairs from STRING
NEG: negative controls used for accuracy estimation

DCA_scores.gz (778 MB)

Coevolution between 189 million protein pairs was evaluated using direct coupling analysis (DCA) on omicMSAs (excluding low-quality positions), followed by Average Product Correction (APC). Each protein pair has a single score, representing the maximum score among all residue pairs between the two proteins.

screened_pairs.tar.gz (336 MB)

Human protein pairs (among the 19,528 proteins included in our PPI screen) from various sources are provided in four files: (1) “PPI_database_pairs” contains candidate PPIs gathered from UniProt, BioGRID, and STRING physical interactions; (2) “STRING_genetic_pairs” includes genetically associated pairs from STRING genetic interactions; (3) “same_locality_pairs” lists protein pairs that share subcellular locality as annotated by UniProt keywords; and (4) “unknown_locality_pairs” includes pairs involving proteins without known subcellular locality.

RF2-PPI.pt (310 MB)

Trained weights for RoseTTAFold2-PPI, which can be used with our code deposited at: https://github.com/CongLabCode/RoseTTAFold2-PPI

Access information

Other publicly accessible locations of the data are provided below. Large data files (such as the PPI and DDI datasets used for training, which cannot be deposited in Dryad due to file size limitations) are available at the following websites.