Skip to main content
Dryad

A dataset for predicting protein-protein interactions in humans

Data files

Sep 16, 2025 version files 98.79 GB

Select up to 11 GB of files for download

Abstract

Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resulting dataset includes omicsMSA alignments, training data (domain-domain and protein-protein interactions), high-confidence predicted pairs, oligomeric assemblies inferred from predicted and known interactions, novel components predicted for known complexes, structural models (PDB format), contact probabilities from AlphaFold and RoseTTAFold2-PPI, and DCA scores.