Structural ontogeny of protein-protein interactions
Data files
Abstract
Natural protein binding sites are often the most “druggable” sites on proteins, while alternative protein surfaces can be difficult targets. To explore the structural basis of this phenomenon, we used synthetic coevolution to engineer new interactions between naïve surfaces, simulating the de novo formation of protein complexes. We isolated seven distinct structural families of protein Z-domain complexes and found that synthetic complexes explore multiple shallow energy wells through ratchet-like docking modes, while complexes co-evolved from a natural binding surface converged in a deep energy well with a relatively fixed docking geometry. Epistasis analysis using machine learning to estimate fitness landscapes extracted “seed” contacts emerging from silent surfaces between binding partners that anchored the earliest stages of encounter complex formation. These data suggest why natural binding sites attract binders: alternative surfaces have a shallow energy landscape that disfavors tight binding, likely due to evolutionary counter-selection. Our findings have implications for understanding druggable versus undruggable surfaces.
This dataset contains the sequencing data obtained from the coevolution libraries in the study titled "Structural Ontogeny of Protein-Protein Interactions."
The dataset provides information on the Z-A and Z-B protein pairs along with their respective read counts across multiple rounds of yeast display selection.
Dataset Contents
- NGS Data: Parsed sequences for each Z-A and Z-B pair, along with their corresponding read counts.
Data Processing and Analysis
All data processing and analysis procedures performed on the dataset are described in detail in the accompanying research paper.
Note that we used filtered data for downstream coevolution analysis and machine learning as described in the paper.
Dataset Structure
The file name is assigned as follows:
selection round(e.g., naive, R1, R2)
File Format
The processed data files are in tabular format with the following columns:
- Z-A Sequence
- Z-B Sequence
- Read Counts
Contact Information
For any inquiries or further information, please contact:
- K. Christopher Garcia (kcgarcia@stanford.edu)
- Aerin Yang (aryang8825@gmail.com)
