Data from: Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins
Data files
Sep 21, 2023 version files 1.25 GB
-
DataS1.zip
7.96 MB
-
DataS2.zip
489.17 KB
-
DataS3.zip
15.13 MB
-
DataS4.zip
4.60 MB
-
DataS5.zip
25.11 MB
-
DataS6.zip
1.13 GB
-
FiguresS.zip
64.75 MB
-
FileS1.xlsx
3.03 MB
-
FileS2.pdf
119.44 KB
-
README.md
32.43 KB
Abstract
Protein-protein interactions drive many cellular processes. Some protein interactions are directed by Src homology 3 (SH3) domains that bind proline-rich motifs on other proteins. The evolution of the binding specificity of SH3 domains is not completely understood, particularly following gene duplication. Paralogous genes accumulate mutations that can modify protein functions and, for SH3 domains, their binding preferences. Here, we examined how the binding of the SH3 domains of two paralogous yeast type I myosins, Myo3 and Myo5, evolved following duplication. We found that the paralogs have subtly different SH3-dependent interaction profiles. However, by swapping SH3 domains between the paralogs and characterizing the SH3 domains freed from their protein context, we find that few of the differences in interactions, if any, depend on the SH3 domains themselves. We used ancestral sequence reconstruction to resurrect the pre-duplication SH3 domains and examined, moving back in time, how the binding preference changed. Although the closest ancestor of the two domains had a very similar binding preference as the extant ones, older ancestral domains displayed a gradual loss of interaction with the modern interaction partners when inserted in the extant paralogs. Molecular docking and experimental characterization of the free ancestral domains showed that their affinity with the proline motifs is likely not the cause for this loss of binding. Taken together, our results suggest that the SH3 and its host protein could create intramolecular or allosteric interactions essential for the SH3-dependent PPIs, making domains not functionally equivalent even when they have the same binding specificity.
This README file was generated on 2023-09-19 by Pascale Lemieux.
- Title of Dataset: Data from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins
- Author Information
A. Principal Investigator Contact Information
Name: Christian Landry
Institution: Université Laval, Québec CA
Email: christian.landry@bio.ulaval.ca
<br>
B. Associate or Co-investigator Contact Information
Name: Pascale Lemieux
Institution: Université Laval, Québec, CA
Email: pascale.lemieux.4@ulaval.ca - Date of data collection (single date, range, approximate date): 2020-2023
- Information about funding sources that supported the collection of the data: Canadian Institutes of Health Research (CIHR) Foundation grant 387697 and a HFSP grant (RGP0034/2018) to CRL
SHARING/ACCESS INFORMATION
- Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
- Links to publications that cite or use the data:
Lemieux, P., Bradley, D., Dubé, A. K., Dionne, U. & Landry, C. R. Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins. bioRxiv 2023.03.09.531510 (2023) doi:10.1101/2023.03.09.531510
- Links to other publicly accessible locations of the data: None
- Links/relationships to ancillary data sets: None
- Was data derived from another source? YES
A. If yes, list source(s): EnsemblCompara (June 2021, June 2023); https://useast.ensembl.org/info/data/index.html - Recommended citation for this dataset:
Lemieux, P., Bradley, D., Dubé, A. K., Landry, C.R., (2023). Data from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins. Dryad Digital Repository. doi:10.5061/dryad.sj3tx968m
DATA & FILE OVERVIEW
This dataset contains the Supplementary Data for the pre-print titled ‘Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins’.
- Folder List:
A) Figures S : Supplementary figures and their caption(Figures S1-S10) from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins
B) File S1 : Supplementary Tables (Tables S1-S12)
C) Data S1 : Raw and processed data from 3 different DHFR PCA experiment which quantify PPIs strength in vivo
D) Data S2 : Protein sequences, multiple sequence alignment, phylogenetic tree of the fungal orthologs of the paralogs MYO3 & MYO5. Ancestral sequence reconstruction probabilities of their SH3 domains are also present in this repository.
E) Data S3 : Alphafold2 structure prediction output for the ancestral SH3 domains, in the paralog context or alone. AlphaFold Multimer output for the SH3 and their surrounding regions.
F) Data S4 : SH3 domains and peptides’ structures and configuration files used for the molecular docking with Haddock2.4.
G) Data S5 : Motif conservation analysis results and the orthology informations used for Figure S5A
H) Data S6 : evCouplings output for Myo3 and Myo5 proteins
- Relationship between files, if important: None
- Additional related data collected that was not included in the current data package: None
- Are there multiple versions of the dataset? No
A. If yes, name of file(s) that was updated: NA
i. Why was the file updated? NA
ii. When was the file updated? NA
Description of the data and file structure
Figures S
All of the supplementary Figures from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins.
Figure files are named according to their identifier (S1 to S10).
The captions of the supplementary figures are available in the file Caption_FigureS.pfd
File S1
File S1 contains all the Supplementary tables (Table S1 to S12) from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins.
FileS1.xlsx is divided in 12 sheets and each one shows one supplementary table.
Table S1 to S3 contain the results of 3 different PCA DHFR experiment. Table S1 contains the result of the screen testing the Myo3 & Myo5 interactome, the SH3-dependency of those interactions, the influence of different SH3 domains one those interactions. Table S2 contains the result of the screen testing if deleting the Myo3 and Myo5 SH3 predicted binding motifs from the preys impacts the PPIs between the Bait-Prey. Table S3 contains the result of the screen testing the effect of the protein context on the PPIs of the SH3 domains using strain expressing the SH3 domains without the rest of the protein. The expression level of the SH3 domains is modulated by the amount of estradiol in the media.
Table S1 - PCA paralog
Number of rows : 8149
Number of variables : 17
Prey.Systematic_name : Gene systematic name of the protein tagged by the DHFR[3] fragment
Prey.Standard_name : Gene standard name of the protein tagged by the DHFR[3] fragment
Bait.Standard_name : Gene standard name of the protein tagged by the DHFR[12] fragment
sh3_sequence : SH3 domain inserted in the Bait protein at the extant SH3 locus
plate_number : Media plate number
tech_rep : technical replicate number (2 per Bait & sh3_sequence combination)
row : row on the plate (1536 array) of the colony
column : column on the plate (1536 array) of the colony
n.bio_rep : number of replicate per Bait, sh3_sequence and Prey combination
area : area of the colony computed by pyphe from Data S1
adjust_area : area of the colony + 1
log2_area : transformation to logarithm 2 scale of the adjust_area
plate_max : maximum colony size per plate
plate_min : minimum colony size per plate
PPI_score : Protein-protein interaction (PPI) score of each colony
med.PPI_score : median PPI score of biological replicates
SH3_dep : if the Bait-Prey PPI is dependent on the SH3 domain (SH3-dependent), according to Wilcoxon test corrected with Benjamini-Hochberg (TRUE, FALSE)
optSH3_dif : if the Bait-Prey PPI is affected by the use of the codon optimized sequence of the extant SH3 in the bait, according to Wilcoxon test corrected with Benjamini-Hochberg (TRUE, FALSE)
Table S2 - PCA motifΔ prey
Number of rows : 7328
Number of variables : 16
Same variables present in Table S1 - PCA paralog except :
motif_deletion : if the predicted binding motif has been replaced by the stuffer on each Prey
Table S3 - PCA free SH3
Number of rows : 11835
Number of variables : 16
Same variables present in Table S1 & Table S2 except :
sh3_domain : SH3 domain inserted in the genetic locus (GAL) under the control of the estradiol concentration
DHFR12_tag : Position of the DHFR[12] fragment tag relative to the SH3 domain (C , N)
[estradiol] : estradiol concentration in the media in nM
Table S4 - PCA liquid validation
PCA DHFR validation results in liquid media. Table S4 contains the output metrics of the R package Growthcurver [12] and the
Number of rows : 448
Number of variables : 15
plate_well : position of the culture in the 96-well plates
k : carrying capacity
n0 : population size at the beginning of the growth curve
r : growth rate of the population
t_mid : time at which the population density reaches 0.5k
t_gen : fastest possible generation time
auc_l : area under the logistic curve
auc_e : area under the empirical curve
sigma : measure of the goodness of fit of the parameters of the logistic equation for the data
note : if the curve does not fit the data, a warning is written here
Prey.Systematic_name : Gene systematic name of the protein tagged by the DHFR[3] fragment
sh3_sequence : SH3 domain inserted in the Bait protein at the extant SH3 locus
Bait.Standard_name : Gene standard name of the protein tagged by the DHFR[12] fragment
strain : Combination of sh3_sequence and Bait.Standard_name
PPI_score_gc : Normalized PPI score obtain in liquid culture (auc_l - (n0*60))
See https://cran.r-project.org/web/packages/growthcurver/vignettes/Growthcurver-vignette.html for further description of Growthcurver output metrics.
Table S5 - Motif prediction
Using the experimentally resolved specificity matrix of the SH3 domains from Myo3 and Myo5 [13], we scanned the SH3-dependent prey protein sequences to identify the best match to the specificity matrix. The motif identified is the predicted motif that is bound by the SH3 domain of either Myo3 or Myo5 protein.
Number of rows : 54
Number of variables : 7
Prey.Systematic_name : Gene systematic name of the protein scanned by the specificity matrix
Prey.Standard_name : Gene standard name of the protein scanned by the specificity matrix
motif_Start-End : positions of the start and end (relative to the first amino acid of the protein) of the best match motif to the specificity matrix
best_match : amino acid sequence of the best match motif to the specificity matrix
Max_MSS : Score of the best match motif to the specificity matrix
p-value : Confidence indication of the motif prediction
Paralog : SH3 Specificity matrix used to do the scan (Myo3 or Myo5)
Table S6 - Cytometry
Cytometry measurement of Green Fluorescent Protein (GFP) expression tagged to variants of Myo3 and Myo5. The variants contain different SH3 domain at the position of the extant SH3 domain in both paralogs. The cytometry measures if the paralog variants have different level of expression in vivo. For each replicate, we aggregate the measures of 3000 individual cells.
Number of rows : 75
Number of variables : 6
GFP_tagged_protein : Paralog tagged by the GFP
sh3_sequence : SH3 domain inserted in the paralog, creating the paralog variant
bio_rep : identifier of the replicate (1,2,3) for each paralog variant
cyto_rawfile : name of the cytometry file containing the raw data
med_rep : median of the 3000 measurements
mean_rep : mean of the 3000 measurements
Table S7 - Molecular docking
To explain the in vivo binding of the SH3 domains, we performed molecular docking with the predicted binding motifs (Table S5) to SH3 domains of interest. Myo3 and Myo5 extant SH3 experimental structures were retrieved (Data S4) for the docking. We used the AlphaFold2 prediction of AncD SH3 as well (Data S3). Table S7 contains the processed data from Data S4. The molecular docking of one motif-SH3 pair generates 400 structures which are divided in multiple clusters. The cluster are then ranked by the algorithm of Haddock2.4(i.e. 1 is the most reliable cluster). We then selected the 10 best structures (again ranked by Haddock2.4 algorithm)
Number of rows : 544
Number of variables : 7
SH3.Standard_name : Identifier of the SH3 domain used for the docking (AncD, Myo3, Myo5)
Motif.Standard_name : Gene standard name of the protein on which the motif is predicted (Table S5)
Motif.Systematic_name : Gene systematic name of the protein on which the motif is predicted (Table S5)
cluster_rank : Rank of the cluster
cluster_median_Interaction_energy(kcal/mol) : median of binding energy between the SH3 and the motif for the 10 best structures of each cluster
target_SH3 : SH3 predicted to bind the motif
exp_validated : if the motif was experimentally validated by the PCA experiment motifΔ prey (TRUE and FALSE)
Table S8 - Oligonucleotides
Tables S8 and Table S9 list the synthesize DNA material used to construct the yeast strains.
Number of rows : 120
Number of variables : 4
Name : Systematic name of the oligonucleotide
Sequence : Nucleotide sequence
Description : Detailed description of the oligonucleotide
Purpose : General purpose of the oligonucleotides
Table S9 - SH3 DNA sequences
Number of rows : 8
Number of variables : 2
SH3 : Identifier of the codon optimized SH3 domains
codon optimized sequence : nucleotide sequence of the codon optimized SH3 domain
Table S10 - Baits
Table S10 and S11 contain the description of the Bait and Prey yeast strain used in the paper related to this dataset
Number of rows : 44
Number of variables : 5
Systematic.Bait_name : Gene systematic name of the protein tagged by the DHFR[12] fragment
Standard.Bait_name : Gene standard name of the protein tagged by the DHFR[12] fragment
Reference : Publication which produced the strains
Strain background : Genetic background
SH3 variant : SH3 domain introduced in the Bait protein at the position of the extant SH3 domain
Table S11 - Preys
Number of rows : 305
Number of variables : 5
Systematic.Prey_name : Gene systematic name of the protein tagged by the DHFR[3] fragment
Standard.Prey_name : Gene standard name of the protein tagged by the DHFR[3] fragment
Reference : Publication which produced the strains
Strain background : Genetic background
Additionnal Informations : Specification regarding the strain construction
Table S12 - Reagent table
Table containing reagents, strains, tools and databases used in this study.
Reagent type species, or resource : General class of reagent, strain or resource
Designation : Specific name of the reagent, strain or resource
Source or reference: Reagent supplier or publication which produced the strains or resource
Identifiers : Catalog number, public repository or RRID of the reagent, strain or resource
Additional information : Specification regarding the detailed list of strains or DNA sequences
File S2
File S2 contains the detailed protocol for yeast strain construction and for PCA DHFR experiment and analysis.
Data S1
Raw data obtain with the analysis of microbial growth with pyphe[1].
Directories complete_paralog_RD, motif_confirmation_RD and free_SH3_RD contain one DHFR PCA experiment each. DHFR PCA uses multiple selection step to allow detection of protein-protein interactions in yeast strains.
Each experiment contains raw data files for different selection steps. Selection steps present in each experiment directories are the second diploid selection (S2_diploid directories) and the second methotrexate selection (MTX2 directories). The complete_paralog_RD/DHFR3_array/ contains selection step for the large array of prey strains.
In each selection step directories, there is one CSV file per plate analysed (final timepoint of the selection step).
Here is the folder structure:
/complete_paralog_RD/ # DHFR PCA raw data from the experiment testing paralog variant baits against 296 wt preys
/DHFR3_array/ # prey array, growth verification of the Yeast Protein Interaction collection strains
/MTX2/ # growth on MTX at final timepoint
/S2_diploid/ # growth verification of the diploid strains before replication on MTX medium
/motif_confirmation_RD/ # DHFR PCA raw data from the experiment testing paralog variant baits against 296 wt preys
/MTX2/ # growth on MTX at final timepoint
/S2_diploid/ # growth verification of the diploid strains before replication on MTX medium
/free_SH3_RD/ # DHFR PCA raw data from the experiment testing paralog variant baits against 296 wt preys
/MTX2/ # growth on MTX at final timepoint
/S2_diploid/ # growth verification of the diploid strains before replication on MTX medium
Each csv file contains 7 variables and a number of rows corresponding to the number of colonies detected by pyphe on the selection media.
Variable list as defined in pyphe documentation [1]:
area : single colony area
centroid : coordinates of the center of the colony
mean_intensity : overall intensity (an estimator that reflects thickness as well as area)
perimeter : perimeter of the colony
row : row of the colony
column : column of the colony
circularity : circularity of the colony
CSV files are analyzed with the corresponding PCA_analysis_X.R script available in the Github repository cited in the paper.
Data S2
Data S2 contains 7 files with the sequence data used to reconstruct the ancestral SH3 domains and to visualize the sequences comparison in Supplementary Figures. The fungal orthologs of MYO3 and MYO5 were retrieved by EnsemblCompara [2] and SH3 domain were defined according to the SMART database v8 [3].Each file is obtained after a step of the ancestral sequence reconstruction pipeline.
1 - Fungal orthologs protein sequences retrieved by EnsemblCompara [2] as orthologs of Myo3 and Myo5 yeast proteins
ortholog_sequences.fa
2 - Multiple Sequence Alignment generated by MAFFT L-INS-i v7.453 of the fungal orthologs.
ortholog_MSA.fa
3 - Trimmed Multiple Sequence Alignment of the fungal orthologs used for the tree construction
ortholog_MSA_trim.fa
4 - Phylogenetic tree of the fungal orthologs generated by IQ-TREE2 [5]
ortholog_phylogeny_newick.txt
5 - Posterior probability for the reconstruction for all position in SH3 domains at each node in the phylogenetic tree produced by FAST-ML v3.11 [4]
Ancestral_MaxMarginalProb_Char_Indel.txt
6 - Multiple Sequence Alignment of the orthologs present in the WGD clade
ortholog_dup_MSA.afa
7 - Protein sequence of the ancestral and extant SH3 domains tested in vivo (Figure S1E)
myo3.myo5.sh3sequence.fasta
Data S3
We used each ancestral sequence for protein structure prediction.
AlphaFold2 (AF) structure prediction output for each of the SH3 domains are present in directories AncA, AncB,AncC, AncD, extantMyo3 and extantMyo5. Except for AncB, the prediction were performed with AlphaFold Colab (Febuary 2022).
The names of the folder correspond to the name of the SH3 domains described in the paper related to this repository.
See the Alphafold Colab Notebook documentation [6] for the description of all the files.
superimposition.cxs is the file used for visualization of the superimposed SH3 predicted strutures . It can be open with ChimeraX [7]
Further protein structure predictions were performed using the full length of the protein sequence modified by replacing the extant SH3 domains with an ancestral one, creating chimeras. These results can be found in the directory paralog_variants. There is one file per prediction for the protein sequence submited to AlphaFold2 (Anc*Myo*.fasta) and one file for the best-ranked structure for each AlphaFold2 prediction (ranked_0_AncMyo.pdb)
We then used AlphaFold Multimer to predict how the binding of SH3 domains surrounded by disordered regions could be affected by those regions. We create chimeras inserting AncD in place of extant SH3 domains and submttted AncD chimeras with one extant sequence to AlphaFold Multimer. The sequences used as input are available in the directory /sh3_regions/. The output of AlphaFold Multimer are present in the subdirectories AncDmyo3_extantMyo5, AncDmyo5_extantMyo3 and extantMyo3_extantMyo5 named according to the sequences used as input. In each of those directories, you can find the AlphaFold Multimer top-5 models (ranked_[0-4].pdb) and a ChimeraX file used for visualization of the predictions(*.cxs).
AncB structure was predicted with a later version of AlphaFold Colab Notebook (December 2022) which explains the different format of the output. It contains only the first ranked model (selected_prediction.pdb) and the predicted aligned error of this model (predicted_aligned_error.json).
Here is the folder structure:
/AncA/
/AncB/
/prediction/
predicted_aligned_error.json
selected_prediction.pdb
/AncC/
/AncD/
/extantMyo3/
/extantMyo5/
/paralog_variants/
/sh3_regions/
/AncDmyo3_extantMyo5/
/AncDmyo5_extantMyo3/
/extantMyo3_extantMyo5/
*.fasta # sequence files
superimposition.cxs
Data S4
All input and output data of the molecular docking of the SH3 domains with multiple peptides.
First, the peptide structures were generated by AlphaFold2[11] and the output files are present in the AF2_motif_structure directory. There is one subdirectory per peptide used for the structure predictions. The names of the subdirectory are the systematic name of the gene (Y*) encoding the protein on which the peptide is a subsequence. The gene systematic name is followed by the paralog predicted to bind the peptide (both, MYO3 or MYO5). The binding motifs are listed in Table S5. Each subdirectory contains the AlphaFold2 output files of the peptide structure prediction. See AlphaFold documentation for more detailed description of the output files[6, 11].
Then, the SH3 domain structures were prepared for the docking. We used AncD AlphaFold prediction and the experimental structures of Myo3 (PDB:1RUW) and Myo5 (PDB:1YP5) SH3s. We followed the best practices guide of Haddock2.48 to prepare the SH3 domain structures for the docking. The docking ready structures are present in the directory structures_sh3 named according to the SH3 domain.
We used Ambiguous Interaction Restraints(AIRS)[8] to identify key positions on the SH3 domains that we expect the peptides to bind. The choice of these position is based on previous experimental work on the PDB structure 2P4R identifying which positions of a SH3 domain bind to a proline peptide. The files defining the AIRS for each SH3 domain are present in the directory AIRS_files.
Variables :
resid : amino acid residu identifier relative to the first position of a protein segment
segid : protein segment identifier (A or B, for docking with 2 segment)
Ex. The structure of the *_AIRS file To indicate that position 35 on segment A should bind to residues 2,3 or 4 of segment B is the following :
assign ( resid 35 and segid A)
(
( resid 2 and segid B)
or
( resid 3 and segid B)
or
( resid 4 and segid B)
)
Lastly, to run the molecular docking with Haddock2.4 [8], we modified the configuration files with the optimal parameter for peptide docking (https://www.bonvinlab.org/software/bpg/peptides/). There is one configuration file per SH3 domain present in the directory run.csn_files. Each file is named according to the SH3 used in the specific molecular docking. For more detailed information of the run.cns files, please refer to the Haddock2.4 documentation (https://www.bonvinlab.org/software/haddock2.4/manual/).
400 structures were generated per molecular docking, they were divided in clusters according to Root Mean Square Deviation computed by Haddock2.4 algorithm. We then used FoldX v4 [9] on the 10 best structures of each cluster for energy minimization (‘RepairPDB’ [9]) 10 times and to compute the interaction energy between the docked peptide and the SH3 domain (‘AnalyseComplexe’ [9]).
The file Docking_RD.csv contains the raw data obtain with the consecutive the molecular docking, energy minimization by FoldX v4 [9] and the interaction energy computation by FoldX v4.
Docking_RD.csv
Number of rows : 4110
Variable list :
Pdb : name of the pdb file containing the SH3-peptide structure
Interaction_Energy : Interaction energy computed by FoldX v4[9]
cluster: Cluster ID defined by Haddock2.4[8]
rank : Rank of the cluster based on Haddock2.4 algorithm[8]
preys : peptide ID (Gene systematic name, paralog predicted to bind to this peptide)
SH3 : SH3 domain used for this specific docking
Here is the folder structure:
/AF2_motif_structure/ # peptide structures used as input
/Y*_seq/
# AlphaFold2 output
/AIRS_files/ # Ambiguous restraint files used as input
/AncC_AIRS.txt
/myo3_AIRS.txt
/myo5_AIRS.txt
/run.csn_files/ # Haddock2.4 configuration files used as input
/run_AncC_test.csn
/run_myo3_test.csn
/run_myo5_test.csn
/structures_sh3/ # SH3 domain structures used as input
/AncC_clean.pdb
/myo3_clean.pdb
/myo5_clean.pdb
Docking_RD.csv # Molecular docking and energy minimization results for AncD, Myo3 and Myo5 SH3 domains
Data S5
Input orthology information for the Figure S5 and motif conservation results for both paralogous SH3.
We used orthology data to predict for extant interaction partners of Myo3 and Myo5 if there was also an interaction between the ancestral proteins.
We retrieved orthology information for all yeast proteins from EnsemblCompara (June 2021) and compared it to the species included in the phylogeny used for the ancestral sequence reconstruction. Since EnsemblCompara modifies frequently its dataset, the dataset used is available in the file Compara.103.protein_default.homologies.tsv.
Compara.103.protein_default.homologies.tsv
Number of rows : 753493
Variables list :
gene_stable_id : Gene systematic name
protein_stable_id : Systematic name of the protein coding gene
species : Species expressing the protein
identity : Sequence identity of the protein to the homologous protein
homology_type : type of gene homology (one2one, one2many, many2many, etc.)
homology_gene_stable_id : Homologous Gene systematic name
homology_protein_stable_id : Homologous Systematic name of the protein coding gene
homology_species : Species expressing the homologous protein
homology_identity : Sequence identity homologous protein to the protein
We also used a more recent EnsemblCompara version (June 2023) to perform a binding motif conservation prediction on the orthologs of 7 yeast proteins (Bzz1, Lsb3, Myo5, Osh2, Pkh2, Sla1, Ste20). We performed a sequence alignment with orthologs of the 7 proteins to map the binding motif on the orthologs according to the motif position on the yeast proteins. We also mapped the disordered region surrounding the predicted motifs and scored the motifs according to the specificity matrix of Myo3 and Myo5 SH3 domains. This analysis outputs are one file per the 7 yeast protein. There is one folder per specificity matrix the motifs were tested against (one for Myo3 and one for Myo5).
Number of rows is equal to the number of orthologs retrieved for each yeast protein.
Variable list :
Species : Species taxonomic name
Type : Orthology type (one2one, one2many, many2many, etc.)
Taxonomy : Reference clade of the species
Best_match : predicted binding motif with the best match to the specificity matrix
Max_MSS : Score of the motif against the specificity matrix in Best_match
Finally, the file orthology_myosins_partners.csv contains the filtered version of EnsemblCompara dataset (June 2021) to summarize the orthology information for the interaction partners identified in the related paper. Orthology information is one of one2one, one2many, many2many, other_paralog or within_species_paralog.
Number of variable : 28
Number of rows : 177
No orthology information available : NA
Variable list:
label : Species name
Sla1 : Orthology information for Sla1 protein in each species of label column
Cdc10 : Orthology information for Cdc10 protein in each species of label column
Syp1 : Orthology information for Syp1 protein in each species of label column
…
Sla2 : Orthology information for Sla2 protein in each species of label column
Pkh2 : Orthology information for Pkh2 protein in each species of label column
Las17 : Orthology information for Las17 protein in each species of label column
Here is the folder structure:
/motif_conservation_results_Myo3/
*_ortho_mss_myo3_pwm.csv
/motif_conservation_results_Myo3/
*_ortho_mss_myo5_pwm.csv
Compara.103.protein_default.homologies.tsv
orthology_myosins_partners.csv
Data S6
Output of EVCoupling v0.2 using Myo3 protein sequence as input. EVCoupling[10] is a bioinformatic tool predicting coevolution between positions in a protein.
The first step of EVCouplings is to create a sequence alignment with homologs of the query sequence, here with the Myo3 protein sequence (align directory). The second step is to compare the sequence level signal to the structural information and compute de couplings scores (respectively compare and couplings directories). Finally, EVCouplings predicts how mutations can affect the structure of the query protein (directories fold and mutate).
The most important files of the output are briefly described but please refer to EVCouplings documentation for an exhaustive description of the output files (https://github.com/debbiemarkslab/EVcouplings/blob/v0.2/notebooks/output_files_tutorial.ipynb) [10] Most files are name according to the job name ‘MYO3_YEAST_1-1272’ which as as input the full lenght sequence of the protein Myo3 from position 1 to 1272.
MYO3_YEAST_1-1272.a2m : Final alignment file usde to compute evolutionary couplings.
“Lowercase columns indicate columns with too many gaps that did not meet the minimum column coverage threshold, which means they were excluded from the EC calculation. Lowercase columns have a ‘.’ for gap character by convention. Additionally, sequences which did not fulfill the the minimum sequence coverage threshold (i.e. fragments with too many gaps) have been removed from this alignment.” [10]
MYO3_YEAST_1-1272_statistics_summary.csv : Contains information on properties of the final alignment [10]
MYO3_YEAST_1-1272_frequencies.csv : Contains the frequency of each character in each position of the alignment.[10]
MYO3_YEAST_1-1272_identities.csv : Contains the sequence identity of each sequence in the alignment to the target sequence.[10]
MYO3_YEAST_1-1272_*_ECs.pdf: Contact map files, Displays the evolutionary couplings together with experimental structure contacts. [10]
MYO3_YEAST_1-1272_ECs.txt : Contains the space-delimited raw output of the evolutionary couplings calculation from plmc. [10]
MYO3_YEAST_1-1272_CouplingScores.csv : Contains the evolutionary couplings sorted according to score, and the probability that a pair represents significant coupling rather than background noise. [10]
MYO3_YEAST_1-1272_enrichment.csv : Measures the how strongly individual residues are coupled by summing the coupling scores of pairs involving this residues, and then normalizing with the average level of coupling. [10]
MYO3_YEAST_1-1272.model : Contains model parameters inferred by plmc [10]
Here is the general folder structure :
/evCouplings_MYO3_YEAST/
/align/
MYO3_YEAST_1-1272.a2m
MYO3_YEAST_1-1272_statistics_summary.csv
MYO3_YEAST_1-1272_frequencies.csv
MYO3_YEAST_1-1272_identities.csv
/compare/
MYO3_YEAST_1-1272_*_ECs.pdf
/couplings/
MYO3_YEAST_1-1272_CouplingScores.csv
MYO3_YEAST_1-1272_enrichment.csv
MYO3_YEAST_1-1272_ECs.txt
MYO3_YEAST_1-1272.model
/fold/
/mutate/
Code/Software
All the scripts used for analyses of raw data present in this dataset are available in the Github repository (https://github.com/Landrylab/Lemieux_et_al2023) cited in the paper.
References
[1]Kamrad, S., Rodriguez-Lopez, M., Cotobal, C., Correia-Melo, C., Ralser M., Bahler J. (2020). Pyphe, a python toolbox for assessing microbial growth and cell viability in high-throughput colony screens. eLife 9:e55160
[2]Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. 2009. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19:327–335.
[3]Letunic I, Khedkar S, Bork P. 2021. SMART: recent updates, new developments and status in 2020. Nucleic Acids Res. 49:D458–D460.
[4]Pupko T, Pe’er I, Shamir R, Graur D. 2000. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol. Biol. Evol. 17:890–896.
[5]Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 37:1530–1534.
[6]Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all. Nature Methods, 2022
[7]Pettersen, E. F. et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci. 30, 70–82 (2021)
[8]Dominguez C, Boelens R, Bonvin AMJJ. 2003. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. 125:1731–1737.
[9]Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. 2005. The FoldX web server: an online force field. Nucleic Acids Res. 33:W382–W388.
[10]Hopf, T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019)
[11]Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)
[12]Sprouffske, K. & Wagner, A. Growthcurver: an R package for obtaining interpretable metrics from microbial growth curves. BMC Bioinformatics 17, 172 (2016)
[13]Tonikian, R. et al. Bayesian modeling of the yeast SH3 domain interactome predicts spatiotemporal dynamics of endocytosis proteins. PLoS Biol. 7, e1000218 (2009)
The data published in this dataset was collected by multiple methods. Among the methods used are DHFR Protein-fragment Complementation Assay, cytometry, ancestral sequence reconstruction with IQ-TREE and FastML, protein structure prediction with AlphaFold2 and AlphaFold Multimer, molecular docking with Haddock2.4, orthology analysis and coevolution predictions with EVCouplings. See the README.md file and the method section of the paper Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins for more details.
File S1 : Tables S1 - S12
File S2 : Detailled protocols
FiguresS : Figures S1 - S10
DataS1 : DHFR PCA results
DataS2 : Phylogeny and sequence alignment
DataS3 : AlphaFold results
DataS4 : Molecular docking input and output files
DataS5: Orthology input and motif conservation results
DataS6: EVCouplings output
Please refer to Lemieux et al. 2023 for details on the data collection and transformation.
All files can be opened with either R, a text editor, Excel or ChimeraX.