Alignment-based protein mutational landscape prediction: doing more with less
Data files
Sep 29, 2023 version files 28.33 GB
-
CSV_HumanProteome.tgz
-
CSV_ProteinGym.tgz
-
HumanProteome_GEMME.tgz
-
ProteinGym_assessment.tgz
-
README.md
Feb 01, 2024 version files 28.33 GB
-
CSV_HumanProteome.tgz
-
CSV_ProteinGym.tgz
-
HumanProteome_GEMME.tgz
-
ProteinGym_assessment.tgz
-
README.md
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
README: Alignment-based protein mutational landscape prediction: doing more with less.
This dataset contains the data and tools associated with Alignment-based protein mutational landscape prediction: doing more with less, Abakarova et al., Genome Biology and Evolution, 2023. doi: https://doi.org/10.1093/gbe/evad201.
Description of the data and file structure
We provide the community with data associated with our assessment of four different multiple sequence alignment (MSA) resources and protocols, as well as the complete single-mutational landscape of the human proteome predicted by combining the MSA protocol implemented in ColabFold and the variant effect predictor GEMME.
- ProteinGym_assessment.tgz contains the data and scripts associated with our assessment of the four different MSA generation protocols (ColabFold, ProteinGym, ProteinNet, Pfam) against the ProteinGym substitution benchmark.
This archive is organised as follows:
- ColabFold (CF): GEMME predictions computed from ColabFold MSAs. This folder contains 75 subdirectories with protein names. Each of these subdirectories contains the predicted single-point mutational landscape for the corresponding protein in 1-3 versions, _normPred_evolCombi.txt*, *_normPred_evolCombi_Pfam.txt, *_normPred_evolCombi_PrNet.txt. The first one corresponds to the original output of GEMME. In the two others, the positions not covered by Pfam or ProteinNet are replaced by NAs. These versions are useful for comparing the performance. Additionally, the CF folder contains 3 other subdirectories, namely MultipleMutants, including predictions of multiple mutants for 11 proteins (_normPred_evolCombi.txt,* with or without *_normPred_evolCombi_PfamVersion.txt), CF_noFilter, including GEMME predictions based on the CF alignements without the filter ( 4 *_normPred_evolCombi.txt files) and align_CF_merged, containing 75 CF MSAs in FASTA format.
- ProteinGym (PG): GEMME predictions computed from ProteinGym MSAs. This folder contains 75 subdirectories with protein names. Each of these subdirectories contains the predicted single-point mutational landscape for the corresponding protein in 1-3 versions, _normPred_evolCombi.txt*, *_normPred_evolCombi_Pfam.txt, *_normPred_evolCombi_PrNet.txt. The first one corresponds to the original output of GEMME. In the two others, the positions not covered by Pfam or ProteinNet are replaced by NAs. These versions are useful for comparing the performance. Additionally, the folder contains a subdirectory, namely MultipleMutants, including predictions of multiple mutants for 11 proteins (_normPred_evolCombi.txt,* with or without *_normPred_evolCombi_PfamVersion.txt).
- ProteinNet(PN): GEMME predictions computed from ProteinNet MSAs. This folder contains 41 subdirectories with protein names corresponding to the proteins covered by Pfam. Each of these subdirectories contains the predicted single-point mutational landscape for the corresponding protein, namely PDBid_normPred_evolCombi_PrNet.txt (PDBid is the identifier from the Protein Data Bank of the proteins covered by ProteinNet). To match the original length of the query sequence in the ProteinGym benchmark, we added extra columns filled with NAs. This version is useful for evaluation. Additionally, MultipleMutants subdirectory is provided.
- Pfam: GEMME predictions computed from Pfam MSAs. This folder contains 39 subdirectories with protein names corresponding to the proteins covered by Pfam. Each of these subdirectories contains the predicted single-point mutational landscape for the corresponding protein, namely *_normPred_evolCombi_fullPfam.txt. To match the original length of the query sequence in the ProteinGym benchmark, we added extra columns filled with NAs. Additionally, MultipleMutants subdirectory and a readme file describing the protocol are provided.
- Mutations_of_interest: input files for GEMME specifying the multiple mutants
- DMS_experiments: 87 Deep Mutational Scanning assays retrieved from the ProteinGym substitution benchmark, https://github.com/OATML-Markslab/ProteinGym
- script: useful scripts for analysing the data (.py, .ipynb, .R), see more details below in the section Code/Software.
- CSV_ProteinGym.tgz contains CSV files sumarizing the data listed above and includes:
- ID_UniProt_PDB.csv provides a UniProt_ID (second column) to PDB_ID (third column) mapping for the proteins (names in first column) from the ProteinGym substitution benchmark,
- Neff_ProteinGym_ColabFold.csv compares the MSA depth (as measured by the Neff metric, see Methods section in the manuscript) between CF and PG protocols, where UniProt_ID gives the protein UniProt identifier, MSA_filename the MSA filename in ProteinGym, shortName the short name of the protein, theta the threshold used for computing the MSA depth, CF_Neff the ColabFold MSA depth, CF_Nseqs the ColabFold MSA number of sequences, MSA_len the length of the query sequence in the MSA, PG_num_seqs the ProteinGym MSA number of sequences, PG_perc_cov the ProteinGym MSA percentage of positions covered, PG_Neff the ProteinGym MSA depth, PG_Neff_cat the ProteinGym MSA category as defined in ProteinGym benchmark,
- Pfam_potocol.csv provides the Pfam domain composition of each query protein, where UniProt_ID gives the protein UniProt identifier, shortName the short name of the protein, pfamID the Pfam identifier(s) of the Pfam domains contained in the protein, pfamStart the residue index(ices) where the Pfam domain(s) start(s), pfamEnd the residue index(ices) where the Pfam domain(s) end(s), avail whether the MSA was available from the Pfam website,
- Spearman_Pfam.csv, Spearman_ProteinGymMSA.csv, Spearman_ProteinNet.csv contains the performances of the three protocols (Pfam, PG and PN) along with CF performances measured in Spearman rank correlation coefficient, where DMS_filename gives the filename of the experimental data, UniProt_ID the protein UniProt identifier, GEMME_Tranception_spearman the Spearman rank correlation for GEMME combined with the ProteinGym MSA, GEMME_CF_spearman the Spearman rank correlation for GEMME combined with the ColabFold MSA, Tranception_spearman the Spearman rank correlation for Tranception, Tranception_no_retrieval_score the Spearman rank correlation for Tranception without retrieval, MSA_Neff_l_Tranception the depth of the proteinGym MSA, MSA_Neff_L_category the depth category of the proteinGym MSA, Multiple_mutants whether there are multiple mutants, GEMME_CF_Uniref_Spearman the Spearman rank correlation for GEMME combined with the ColabFold MSA generated with Uniref only, GEMME_CF_speamanNoFilter the Spearman rank correlation for GEMME combined with the ColabFold MSA when the filter is removed, CF_noFilter whether the filter was removed.
- Summary4Protocols.csv indicates whether the predictions for each of 72 proteins have been successfully calculated by the four protocols and whether the experimental measurements contain multiple mutants, where ProtName gives the protein name, UniProt_ID the protein UniProt identifier, containsMultiple whether it has multiple mutants, isTreatedbyProteinGym whether it has a ProteinGym MSA, isTreatedbyColabFold whether it has a ColabFold MSA, isTreatedbyProteinNet whether it has a ProteinNet MSA, PDB_id the PDB identifier (if applicable), isTreatedbyPfam whether it has a Pfam MSA, nbPfam the number of Pfam domains,
- Tranception_data.csv is a table provided by https://github.com/OATML-Markslab/ProteinGym giving detailed information about the benchmark. A detailed description of it can be found at this URL link.
- HumanProteome_GEMME.tgz contains predictions for the entire human proteome, obtained by running GEMME on MSAs generated by ColabFold. The archive contains 20 586 folders with a UniProt_ID name containing the predictions. The detailed description of a folder can be found in CSV_Human/ReadmePredictions.
- CSV_HumanProteome.tgz is an archive containing annotations for the predictions over the human proteome. It contains:
- ReadmePredictions: file describing the content of each protein prediction folder,
- HumanProteome_analyse.csv: summary table of the human proteome predictions, whose detailed description is provided in ReadmeCSV,
- ReadmeCSV: description of the file HumanProteome_analyse.csv,
- UP000005640_9606.fasta human proteome in fasta file (20 586 sequences).
Sharing/Access information
- All predictions were computed using the alignment-based variant effect predictor GEMME, available at: http://www.lcqb.upmc.fr/GEMME.
- The experimental deep mutational scan measurements and the ProteinGym MSAs come from the ProteinGym substitution benchmark available at: https://github.com/OATML-Markslab/ProteinGym.
- The ColabFold MSAs were generated locally following the instructions and downloading data from https://colabfold.mmseqs.com.
- The Pfam MSAs were retrieved from the Pfam website, now accessible at: http://pfam-legacy.xfam.org/.
- The ProteinNet MSAs were downloaded upon request to the authors, see https://github.com/aqlaboratory/proteinnet/blob/master/docs/raw\_data.md for more information.
- The MSAs associated with the predictions over the whole human proteome are available upon request. Please email us to request access.
Code/Software
We provide a set of scripts used to analyse the data and produce the figures. They are located in the script subfolder of ProteinGym_assessment.tar:
- analInfluenceMSA.R is written in R and contains a set of functions for computing summary statistics and plotting the results,
- Calculate_Spearman_correlation_DMS.py is written in Python and contains a set of functions for computing the Spearman rank correlations,
- make_plots.ipynb is a Jupyter Notebook written in Python and contains a set of functions for plotting the results,
- code4Neff.tgz is an archive containing R and bash scripts for computing the MSA depths :
- runAll.sh is a bash script launching the script computeMat.R on a set of MSAs,
- computeMat.R is an R script computing all pairwise hamming distances between the sequences contained in the input MSA,
- runAllNeff.sh is a bash script launching the script computeNeff.R on a set of MSAs,
- computeNeff.R is an R script computing the depth of an MSA, given its distance matrix and a threshold as input.
The R scripts were run using R version 4.2.2. The Python script were run using Python 2.7.16.