Data from: Deep mutational scanning of the multi-domain phosphatase SHP2 reveals mechanisms of regulation and pathogenicity
Data files
May 18, 2024 version files 69.19 GB
-
4dgp_E76K_1_2500f.nc
4.93 GB
-
4dgp_E76K_2_2500f.nc
4.93 GB
-
4dgp_E76K_3_2500f.nc
4.93 GB
-
4dgp_wt_1_2500f.nc
4.93 GB
-
4dgp_wt_2_2500f.nc
4.93 GB
-
4dgp_wt_3_2500f.nc
4.93 GB
-
6crf_E76K_1_2500f.nc
4.96 GB
-
6crf_E76K_2_2500f.nc
4.96 GB
-
6crf_E76K_3_2500f.nc
4.96 GB
-
6crf_wt_1_2500f.nc
4.95 GB
-
6crf_wt_2_2500f.nc
4.95 GB
-
6crf_wt_3_2500f.nc
4.95 GB
-
parameter_files.zip
6.95 MB
-
PDB-format-MD-trajectories.zip
437.69 MB
-
README.md
4.73 KB
-
SHP2-FL_cSrc-KD.zip
3.36 GB
-
SHP2-FL_vSrc-FL.zip
4.36 GB
-
SHP2-PTP_cSrc-KD.zip
886.23 MB
-
SHP2-PTP_vSrc-FL.zip
788.87 MB
Jul 11, 2024 version files 95.84 GB
-
4dgp_E76K_1_2500f.nc
4.93 GB
-
4dgp_E76K_2_2500f.nc
4.93 GB
-
4dgp_E76K_3_2500f.nc
4.93 GB
-
4dgp_wt_1_2500f.nc
4.93 GB
-
4dgp_wt_2_2500f.nc
4.93 GB
-
4dgp_wt_3_2500f.nc
4.93 GB
-
6crf_E76K_1_2500f.nc
4.96 GB
-
6crf_E76K_2_2500f.nc
4.96 GB
-
6crf_E76K_3_2500f.nc
4.96 GB
-
6crf_wt_1_2500f.nc
4.95 GB
-
6crf_wt_2_2500f.nc
4.95 GB
-
6crf_wt_3_2500f.nc
4.95 GB
-
parameter_files.zip
10.13 MB
-
PDB-format-MD-trajectories.zip
656.59 MB
-
README.md
5.54 KB
-
shp2_af2_E76K_1_2500f.nc
4.40 GB
-
shp2_af2_E76K_2_2500f.nc
4.40 GB
-
shp2_af2_E76K_3_2500f.nc
4.40 GB
-
shp2_af2_wt_1_2500f.nc
4.41 GB
-
shp2_af2_wt_2_2500f.nc
4.41 GB
-
shp2_af2_wt_3_2500f.nc
4.41 GB
-
SHP2-FL_cSrc-KD.zip
3.36 GB
-
SHP2-FL_vSrc-FL.zip
4.36 GB
-
SHP2-PTP_cSrc-KD.zip
886.23 MB
-
SHP2-PTP_vSrc-FL.zip
788.87 MB
Jan 07, 2025 version files 95.84 GB
-
4dgp_E76K_1_2500f.nc
4.93 GB
-
4dgp_E76K_2_2500f.nc
4.93 GB
-
4dgp_E76K_3_2500f.nc
4.93 GB
-
4dgp_wt_1_2500f.nc
4.93 GB
-
4dgp_wt_2_2500f.nc
4.93 GB
-
4dgp_wt_3_2500f.nc
4.93 GB
-
6crf_E76K_1_2500f.nc
4.96 GB
-
6crf_E76K_2_2500f.nc
4.96 GB
-
6crf_E76K_3_2500f.nc
4.96 GB
-
6crf_wt_1_2500f.nc
4.95 GB
-
6crf_wt_2_2500f.nc
4.95 GB
-
6crf_wt_3_2500f.nc
4.95 GB
-
parameter_files.zip
10.13 MB
-
PDB-format-MD-trajectories.zip
656.59 MB
-
README.md
5.54 KB
-
shp2_af2_E76K_1_2500f.nc
4.40 GB
-
shp2_af2_E76K_2_2500f.nc
4.40 GB
-
shp2_af2_E76K_3_2500f.nc
4.40 GB
-
shp2_af2_wt_1_2500f.nc
4.41 GB
-
shp2_af2_wt_2_2500f.nc
4.41 GB
-
shp2_af2_wt_3_2500f.nc
4.41 GB
-
SHP2-FL_cSrc-KD.zip
3.36 GB
-
SHP2-FL_vSrc-FL.zip
4.36 GB
-
SHP2-PTP_cSrc-KD.zip
886.23 MB
-
SHP2-PTP_vSrc-FL.zip
788.87 MB
Mar 25, 2025 version files 95.89 GB
-
4dgp_E76K_1_2500f.nc
4.93 GB
-
4dgp_E76K_2_2500f.nc
4.93 GB
-
4dgp_E76K_3_2500f.nc
4.93 GB
-
4dgp_wt_1_2500f.nc
4.93 GB
-
4dgp_wt_2_2500f.nc
4.93 GB
-
4dgp_wt_3_2500f.nc
4.93 GB
-
6crf_E76K_1_2500f.nc
4.96 GB
-
6crf_E76K_2_2500f.nc
4.96 GB
-
6crf_E76K_3_2500f.nc
4.96 GB
-
6crf_wt_1_2500f.nc
4.95 GB
-
6crf_wt_2_2500f.nc
4.95 GB
-
6crf_wt_3_2500f.nc
4.95 GB
-
AlphaFold2-model.zip
14.94 MB
-
MD-starting-structures.zip
34.10 MB
-
parameter_files.zip
10.13 MB
-
PDB-format-MD-trajectories.zip
656.59 MB
-
README.md
6.65 KB
-
shp2_af2_E76K_1_2500f.nc
4.40 GB
-
shp2_af2_E76K_2_2500f.nc
4.40 GB
-
shp2_af2_E76K_3_2500f.nc
4.40 GB
-
shp2_af2_wt_1_2500f.nc
4.41 GB
-
shp2_af2_wt_2_2500f.nc
4.41 GB
-
shp2_af2_wt_3_2500f.nc
4.41 GB
-
SHP2-FL_cSrc-KD.zip
3.36 GB
-
SHP2-FL_vSrc-FL.zip
4.36 GB
-
SHP2-PTP_cSrc-KD.zip
886.23 MB
-
SHP2-PTP_vSrc-FL.zip
788.87 MB
Abstract
Multi-domain enzymes can be regulated both by inter-domain interactions and structural features intrinsic to the catalytic domain. The tyrosine phosphatase SHP2 is a quintessential example of a multi-domain protein that is regulated by inter-domain interactions. This enzyme has a protein tyrosine phosphatase (PTP) domain and two phosphotyrosine-recognition domains (N-SH2 and C-SH2) that regulate phosphatase activity through autoinhibitory interactions. SHP2 is canonically activated by phosphoprotein binding to the SH2 domains, which causes large interdomain rearrangements, but autoinhibition is also disrupted by disease-associated mutations. Many details of the SHP2 activation are still unclear, the structure of the active state remains elusive, and hundreds of human variants of SHP2 have not been functionally characterized. Here, we perform scanning mutagenesis on both full-length SHP2 and its isolated PTP domain to examine mutational effects on inter-domain regulation and catalytic activity. Our experiments provide a comprehensive map of SHP2 mutational sensitivity, both in the presence and absence of interdomain regulation. Coupled with molecular dynamics simulations, our investigation reveals novel structural features that govern the stability of the autoinhibited and active states of SHP2. Our analysis also identifies key residues beyond the SHP2 active site that control PTP domain dynamics and intrinsic catalytic activity. This work expands our understanding of SHP2 regulation and provides new insights into SHP2 pathogenicity.
https://doi.org/10.5061/dryad.83bk3jb18
Associated Preprint:
https://doi.org/10.1101/2024.05.13.593907
Dates of data collection:
March 2023 to May 2024
Overview
This dataset contains data from structural and mutational analysis of the signaling enzyme SHP2. The data are clustered into three groups, based on the type of experiment/analysis that generated them. The details of these experiments can be found in the associated preprint. Briefly:
(1) We conducted deep mutational scanning experiments in which we constructed 15 DNA libraries encoding mutations across the SHP2 gene, subjected these to selection for SHP2 function in yeast, and then analyzed the DNA before and after selection by deep sequencing. The data associated with this experiment are FASTQ-format Illumina deep sequencing files. These are uploaded in 4 zipped folders (SHP2-FL_vSrc-FL.zip, SHP2-FL_cSrc-KD.zip, SHP2-PTP_vSrc-FL.zip, SHP2-PTP_cSrc-KD.zip), corresponding to 4 different versions of the experiment.
(2) We used AlphaFold2 via a ColabFold notebook, with the default settings, to generate a model of SHP2 residues 1-529. One additional setting used was that the models were allowed to relax. All of the output files from this modeling session are in the compressed folder named “AlphaFold2-model.zip”.
(3) The remainder of the data in this dataset comes from molecular dynamics (MD) simulations run using the Amber software package. We ran 18 simulations in total, corresponding to 3 replicates each for 6 different systems. The corresponding data include 18 raw trajectory files (.nc), 6 parameter/topology files (1 per system in the compressed folder named “parameter_files.zip”), and then 18 PDB-format processed trajectory files, in the compressed folder named “PDB-format-MD-trajectories.zip”. Additionally, the solvated starting structures for each system can be found in the compressed folder named “MD-starting-structures.zip”.
Data structure and analysis
Deep sequencing data
Raw, unpaired deep sequencing data files (.fastq) are split into four compressed folders. Each folder contains all of the replicate data for deep mutational scans with SHP2-FL and SHP2-PTP in the presence of either vSrc-FL or c-Src-KD. Each folder has subfolders containing the sequencing data for each individual tile (15 tiles for the SHP2-FL experiments and 7 tiles for the SHP2-PTP experiments). Within each tile subfolder, there should be sets of “unselected” and “selected” files with the same date prefix. These can all be analyzed together to generate the enrichment scores for all mutants in that tile for one replicate of selection. Generally, we process the data first by merging paired reads, then trimming adapters, then finally counting variants. The scripts for these processing steps, and instructions on how to use them, can be found here: https://github.com/nshahlab/2024_Jiang-et-al_SHP2-DMS
AlphaFold2 data
All of the standard output files from the AlphaFold2 ColabFold run can be found in “AlphaFold2-model.zip”. These include the configuration files, plddt and coverage data, and both unrelaxed and relaxed models. A second copy of the top relaxed model, which was used for molecular dynamics simulations, is provided with the file name “SHP2WTtrunc_6b232.pdb”.
Molecular dynamics data
PDB-format molecular dynamics trajectories (.pdb) are all in one compressed zipped folder. These files (18 in total) correspond to the following simulated systems and have correspondingly descriptive names: (1) SHP2 wild-type sequence starting from the conformation seen in PDB code 4DGP, (2) SHP2 wild-type sequence starting from the conformation seen in PDB code 6CRF, (3) SHP2 E76K sequence starting from the conformation seen in PDB code 4DGP, (4) SHP2 E76K sequence starting from the conformation seen in PDB code 6CRF, (5) SHP2 wild-type sequence starting from a model generated by AlphaFold2 in the open conformation, and (6) SHP2 E76K sequence starting from a model generated by AlphaFold2 in the open conformation. In each of these files, water molecules and salt ions have been removed, and only the atoms in the protein molecule are present. These files contain snapshots sampled every 10 ns from 0-2500 ns.
For users who may wish to compile and analyze the trajectories from the raw data, we have compiled the full trajectory files (.nc) containing all atoms in each simulation, sampling every 1 ns from 0-2500 ns. Finer sampling would result in excessive file sizes. 18 trajectory files are provided, corresponding to the same 18 trajectories described above for the PDB-format files. In order to process these files, we use the CPPTRAJ module with AmberTools (https://ambermd.org/AmberTools.php). For each system, there is a corresponding parameter/topology file (.prmtop) that is needed as an input for CPPTRAJ, alongside the trajectory file (.nc).
Finally, we have also provided the starting structure models in both .pdb and .inpcrd formats, alongside the parameter/topology file (.prmtop), in “MD-starting-structures.zip”.
Sharing/Access information
This repository will be accessible from the corresponding manuscript. Any additional data associated with this study is summarized in the manuscript and other forms of data can be obtained by directly contacting the authors.
Code/Software
Details on how to process and analyze the sequencing data can be found in the associated manuscript. Python scripts used to process the data can be found here: https://github.com/nshahlab/2024_Jiang-et-al_SHP2-DMS
Versions
The original version of this dataset, published on May 18, 2024, did not include any MD trajectories or associated files for simulations started from the AlphaFold2 open conformation. On July 11, 2024, data for these 6 MD trajectories were added as follows: 2 new parameter files (.prmtop) have been added to the zipped directory ”parameterfiles.zip”; 6 new PDB-format trajectories have been added to the zipped directory “PDB-format-MD-trajectories.zip”; and 6 new coordinate files (.nc) have been uploaded as well. All of these new files have the prefix “shp2_af2”. On March 20, 2025, the MD starting structure files and the raw output files from the AlphaFold2 modeling were added to the dataset.
The molecular dynamics data were generated using the Amber Molecular Dynamics Package, as described in the associated manuscript. Data were processed using the CPPTRAJ program within AmberTools. The AlphaFold2 model for SHP2 was generated using ColabFold with the default settings (https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb). Deep sequencing data are the result of high-throughput peptide display screens, conducted as described in the manuscript. Data were generated using an Illumina MiSeq or NextSeq instrument. Data were processed in three steps: (1) FLASh (https://ccb.jhu.edu/software/FLASH/(opens in new window)) was used for paired-end read merging, (2) CutAdapt (https://cutadapt.readthedocs.io/en/stable/)(opens in new window) was used to trim flanking sequences, and (3) trimmed sequences were translated and counted using in-house Python scripts (https://github.com/nshahlab/2024_Jiang-et-al_SHP2-DMS).