Data for: Accurate sequence-to-affinity models for SH2 domains from multi-round peptide binding assays coupled with free-energy regression

Gagoski, Dejan1; Rube, H. Tomas1; Rastogi, Chaitanya1; Melo, Lucas A. N.1; Li, Xiaoting1; Voleti, Rashmi1; Shah, Neel H.1; Bussemaker, Harmen 1

Published Aug 01, 2025 on Dryad. https://doi.org/10.5061/dryad.msbcc2g7w

Data files

Aug 01, 2025 version files 2.05 GB

configuration.zip
63.03 KB
countTables.zip
147.72 MB
README.md
1.49 KB
S1_Datasets.tsv

5.30 KB
S2_ProBound_models.tsv

140.73 KB
sequencingReads.zip
1.90 GB

Abstract

Short linear peptide motifs play important roles in cell signaling. They can act as modification sites for enzymes and as recognition sites for peptide binding domains. SH2 domains bind specifically to tyrosine-phosphorylated proteins, with the affinity of the interaction depending strongly on the flanking sequence. Quantifying this sequence specificity is critical for deciphering phosphotyrosine-dependent signaling networks. In recent years, protein display technologies and deep sequencing have allowed researchers to profile SH2 domain binding across thousands of candidate ligands. Here, we present a concerted experimental and computational strategy that improves the predictive power of SH2 specificity profiling. Through multi-round affinity selection and deep sequencing with large randomized phosphopeptide libraries, we produce suitable data to train an additive binding free energy model that covers the full theoretical ligand sequence space. Our models can be used to predict signaling network connectivity and the impact of missense variants in phosphoproteins on SH2 binding.

This repository contains the data

S1_Datasets.tsv - Supplemental table listing the sequencing datasets.
S2_ProBound_models.tsv - Supplemental table listing all ProBound models and the settings that were used.
sequencingReads.zip/
- <libraryName>.fastq.gz - The sequencing reads
countTables.zip/
- <countTableId>.<experimentName>.tsv.gz - File tabulating the number of occurrences of each observed sequence in the input and binding-selected libraries.
- <countTableId>.<experimentName>.folds.gz - File assigning each sequence to one of 10 folds.
configuration.zip/
- config.<fitID>.builder.json - Configuration builder file (used by ProBound to build the full configuration file)
- config.<fitID>.json - Configuration file used as input to ProBound

To build the sequence recognition models, first download and install ProBound. Make sure to set the environmental variable PROBOUND_DIR:

export PROBOUND_DIR="/path/to/ProBound"

Next, build the configuration file using:

fitID=21868
java -jar $PROBOUND_DIR/ProBound.jar -b -c configuration/config.$fitID.builder.json > configuration/config.$fitID.json

Finally run ProBound

cd configuration/
java -jar $PROBOUND_DIR/ProBound.jar -c config.$fitID.json

Data for: Accurate sequence-to-affinity models for SH2 domains from multi-round peptide binding assays coupled with free-energy regression

Data files

Abstract

README: Repository containing data related to "Accurate sequence-to-affinity models for SH2 domains from multi-round peptide binding assays coupled with free-energy regression"

Works referencing this dataset