Skip to main content
Dryad

A dataset for predicting protein-protein interactions in humans

Data files

Sep 16, 2025 version files 98.79 GB

Click names to download individual files Select up to 11 GB of files for zip download

Abstract

Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resulting dataset includes omicsMSA alignments, training data (domain-domain and protein-protein interactions), high-confidence predicted pairs, oligomeric assemblies inferred from predicted and known interactions, novel components predicted for known complexes, structural models (PDB format), contact probabilities from AlphaFold and RoseTTAFold2-PPI, and DCA scores.