Data from: Proximity-labeling proteomics reveals remodeled interactomes and altered localization of pathogenic SHP2 variants
Data files
Mar 10, 2025 version files 2.11 GB
-
count_matrices.zip
7.76 KB
-
microscopy.zip
1.28 GB
-
raw_sequencing.zip
727.06 MB
-
README.md
7.97 KB
-
scores.zip
5.06 MB
-
scripts.zip
252.82 KB
-
translated_reads.zip
101.54 MB
Abstract
This data respository contains data associated with a study to characterize how missense mutations in PTPN11, the gene encoding the protein tyrosine phosphatase SHP2, alter the protein-protein interactions and localization. SHP2 has a catalytic domain that dephosphorylates proteins and two phosphotyrosine-binding SH2 domains. One part of this respository contains data that were used to map the sequence recognition profiles of the two wild-type SH2 domains of SHP2. These sequence recognition profiles were used to predict binding sites for each SH2 domain across the human proteome, and they are juxtaposed with the proteomics data presented in the parent study. The other part of the repository contains microscopy data examining the mitochondrial localization of wild-type SHP2.
https://doi.org/10.5061/dryad.wstqjq2xv
Description of the data and file structure
There are two data types in this repository: deep sequencing data and associated analysis files, and microscopy data.
The deep sequencing data were generated using a high-throughput method for profiling the sequence specificities of SH2 domains. This method has been documented extensively in two published manuscripts:
https://doi.org/10.7554/elife.82345
https://doi.org/10.1073/pnas.2407159121
The overall approach entails the use of degenerate, genetically-encoded peptide libraries, with a structure X5-Y-X5, where X is any of the 20 canonical amino acids and Y is tyrosine. The libraries (approximately 1 million random sequences) are displayed on the surface of E. coli cells, then enzymatically phosphorylated using a mixture of tyrosine kinases. Then, SH2 domains immobilized on magnetic beads are used to enriched those phosphorylated cells with optimal peptide sequences for the particular SH2 domain. These cells are isolated, and DNA encoding the peptides is amplified by PCR and subject to deep sequencing. An input library that has not been selected by any SH2 domain is also sequenced.
Next, the sequenced libraries are translated from DNA to peptide sequences. The frequency of each amino acid at each position in the selected library is counted and normalized to that in the input library. These normalized matrices are used to calculate a binding score for all documented tyrosine phosphorylation sites seen in the PhosphoSitePlus database (https://www.phosphosite.org/homeAction.action). The scoring method is described in detail here: https://doi.org/10.7554/elife.82345
The raw fastq files, the translated sequence files, count matrices, and calculated scores, along with the scripts to generate these scores, are provided in this data repository. The analysis pipeline is also extensively documented here: https://doi.org/10.7554/elife.82345
The confocal microscopy data were acquired under oil immersion 60x magnification (Nikon, MRD71670) using a confocal spinning disk microscope (Andor Dragonfly) coupled to a Nikon Ti-2 inverted epifluorescence microscope with automated stage control, Nikon Perfect Focus System and a Zyla PLUS 4.2-megapixel USB3 camera. Illumination was done with 100 mW 405 nm, 50 mW 488 nm, 50 mW 561 nm and 140 mW 640 nm solid-state lasers. All hardware was controlled using Andor Fusion software. Lasers, laser powers, exposure times, objectives and experiment-specific acquisition parameters are 100% power 100ms exposure for all the images. Images were acquired with 11 z-slices at 2.0-μm intervals (Total scan size 20 μm). The images are in .ims format and can be opened in software such as ImageJ/Fiji. A metadata file for the microscopy data is also provided.
Files and variables
File: raw sequencing.zip
Description: This compressed directory contains all of the raw FastQ files (forward and reverse paired-end reads) from two separate sequencing experiments, separated into subfolders labeled 210812 and 210831. There is also an Excel spreadsheet (Barcode-Sample Mapping.xlsx) that describes which files correspond to which sample in the peptide display experiment. Note that the 210812 data were collected on a NextSeq instrument, and thus there are data files for 4 lanes per barcode (L001, L002, L003, L004). The 210831 datat were collected on a MiSeq instrument, and there are only data files for 1 lane per barcode (L001). These paired-end reads can be merged, trimmed, and translated using scripts in this GitHub: https://github.com/nshahlab/2022_Li-et-al_peptide-display
File: scripts.zip
Description: This compressed directory contains two scripts and one input file.
- AA-frequency-nostop.py is a python script that can read one of the translated sequence read files in the translated_reads directory and calculate a position-specific count matrix (11 columns for each position in the peptide by 21 rows for each amino acid and a stop codon). Note that this script omits all sequences that lack a central "Y" or contain one or more stop codons "*".
- score_peptide_nostop.py is a python script that can take two output files from the prior script (two counts matrices, one for selection by an SH2 domain and another input/unselected file) and uses them to calculate a normalized scoring matrix. Then, this script reads as an input file a list of sequences (e.g. pTyr-phosphosites.txt) and outputs that same list with unnormalized and normalized scores for each peptide.
- pTyr-phosphosites.txt is an input file that contains approximately 40,000 human tyrosine phosphorylation site sequences, which we scored for SH2 binding in this study. It can be used as an input file for score_peptide_nostop.py.
File: translated_reads.zip
Description: This compressed directory contains size fasta files corresponding to the translated reads from 6 deep sequencing samples. Each file name contains the data of the sequencing run and the selection condition ("NSH2" or "CSH2" for either SH2 domain from wild-type SHP2, or "input" for an unselected sample).
File: count_matrices.zip
Description: This compressed directory contains six count matrices generated using the AA-frequency-nostop.py script and the input fasta files from the translated_reads directory.
File: scores.zip
Description: This compressed directory contains the unnormalized and normalized binding scores for all of the peptide sequences in pTyr-phosphosites.txt, generated using the count matrices in the count_matrices directory. Scores for two replicates with the SHP2 N-SH2 and C-SH2 domains are calculated. These replicate scores were averaged for all of the analyses reported in the study.
File: microscopy.zip
Description: This compressed directory contains confocal microscopy data analyzing SHP2 localization in U2OS cells. The confocal microscopy data were acquired under oil immersion 60x magnification (Nikon, MRD71670) using a confocal spinning disk microscope (Andor Dragonfly) coupled to a Nikon Ti-2 inverted epifluorescence microscope with automated stage control, Nikon Perfect Focus System and a Zyla PLUS 4.2-megapixel USB3 camera. Illumination was done with 100 mW 405 nm, 50 mW 488 nm, 50 mW 561 nm and 140 mW 640 nm solid-state lasers. All hardware was controlled using Andor Fusion software. Lasers, laser powers, exposure times, objectives and experiment-specific acquisition parameters are 100% power 100ms exposure for all the images. Images were acquired with 11 z-slices at 2.0-μm intervals (Total scan size 20 μm). The images are in .ims format and can be opened in software such as ImageJ/Fiji (Untransfected U2-OS cells (SHP2, Tom20, Phalloidin, DAPI).ims). A metadata file for the microscopy data is also provided (Untransfected U2-OS cells (metadata).txt).
Access information
Other publicly accessible locations of the data:
- Processed and interpreted forms of the data are available in this preprint: https://www.biorxiv.org/content/10.1101/2025.02.26.640373v1
Data was derived from the following sources:
- The list of phosphorylation sites used for this analysis are derived from the PhosphoSitePlus database: https://www.phosphosite.org/homeAction.action
The deep sequencing data were generated using a high-throughput method for profiling the sequence specificities of SH2 domains. This method has been documented extensively in two published manuscripts:
https://doi.org/10.7554/elife.82345
https://doi.org/10.1073/pnas.2407159121
The overall approach entails the use of degenerate, genetically-encoded peptide libraries, with a structure X5-Y-X5, where X is any of the 20 canonical amino acids and Y is tyrosine. The libraries (approximately 1 million random sequences) are displayed on the surface of E. coli cells, then enzymatically phosphorylated using a mixture of tyrosine kinases. Then, SH2 domains immobilized on magnetic beads are used to enriched those phosphorylated cells with optimal peptide sequences for the particular SH2 domain. These cells are isolated, and DNA encoding the peptides is amplified by PCR and subject to deep sequencing. An input library that has not been selected by any SH2 domain is also sequenced.
Next, the sequenced libraries are translated from DNA to peptide sequences. The frequency of each amino acid at each position in the selected library is counted and normalized to that in the input library. These normalized matrices are used to calculate a binding score for all documented tyrosine phosphorylation sites seen in the PhosphoSitePlus database (https://www.phosphosite.org/homeAction.action). The scoring method is described in detail here: https://doi.org/10.7554/elife.82345
The raw fastq files, the translated sequence files, count matrices, and calculated scores, along with the scripts to generate these scores, are provided in this data repository. The analysis pipeline is also extensively documented here: https://doi.org/10.7554/elife.82345
The confocal microscopy data were acquired under oil immersion 60x magnification (Nikon, MRD71670) using a confocal spinning disk microscope (Andor Dragonfly) coupled to a Nikon Ti-2 inverted epifluorescence microscope with automated stage control, Nikon Perfect Focus System and a Zyla PLUS 4.2-megapixel USB3 camera. Illumination was done with 100 mW 405 nm, 50 mW 488 nm, 50 mW 561 nm and 140 mW 640 nm solid-state lasers. All hardware was controlled using Andor Fusion software. Lasers, laser powers, exposure times, objectives and experiment-specific acquisition parameters are 100% power 100ms exposure for all the images. Images were acquired with 11 z-slices at 2.0-μm intervals (Total scan size 20 μm). The images are in .ims format and can be opened in software such as ImageJ/Fiji. A metadata file for the microscopy data is also provided.