Skip to main content
Dryad

Protein structure alignments and structural similarity score code for: Benchmarking methods of protein structure alignment

Cite this dataset

Sykes, Janan (2021). Protein structure alignments and structural similarity score code for: Benchmarking methods of protein structure alignment [Dataset]. Dryad. https://doi.org/10.5061/dryad.c59zw3r4v

Abstract

The function of a protein is primarily determined by its structure and amino acid sequence. Many biological questions of interest rely on being able to accurately determine the group of structures to which domains of a protein belong (often referred to as their `fold'); this can be done through alignment and comparison of protein structures. This fundamental task underpins predicting function, identification of homology, and testing hypotheses about the way "protein-space'' is organised. Dozens of different methods for Protein Structure Alignment (PSA) have been proposed that use a wide range of techniques.   The aim of this study is to determine the ability of PSA methods to identify pairs of protein domains known to share differing levels of structural similarity, and to assess their utility for clustering domains from several different folds into known groups. 

We present the results of a comprehensive investigation into eighteen PSA methods; to our knowledge this is the largest piece of independent research on this topic.   Overall, SP-AlignNS (nonsequential) was found to be the best method for classification, and also one of the best performing methods for clustering. 

Methods (where possible) were split into the algorithm used to find the optimal alignment and the score used to assess similarity.   This allowed us to largely separate the algorithm from the score it maximises and thus, to assess their effectiveness independently of each other.   Surprisingly, we found that some hybrids of mismatched scores and algorithms performed better than either of the native methods at classification and, in some cases, clustering as well. It is hoped that this investigation and the accompanying discussion will be useful for researchers selecting or designing methods to align protein structures.

Methods

Alignments: Contained in the classification folder of this dataset are alignments of 500 randomly selected 'pivot' domains to protein domains from the same family, same superfamily, same fold, same class and different classes. These are generally named using the numbers 0-499 and the letters f,s,fo,c and d (for decoy). These alignments were performed with eighteen different protein structure alignment methods, as can be seen from the folders. The exception to this was Fr-TM-Align, which was used to perform alignments for these same protein pairs several different times with different length normalisation methods.

Contained in the clustering folder of this dataset are all vs all comparisons of fifty proteins performed by the same eighteen alignment methods. These proteins were selected to be from ten different folds, with five representatives from each.

Code: Contained in the code folder are python modules to calculate structural similarity scores associated with different original methods. These all require a .pdb file containing two aligned proteins and a .fasta file showing which residues are aligned to which.

Usage notes

The identities of all proteins in the alignments uploaded can be found in the supplementary material of "Benchmarking Methods of Protein Structure Alignment".

Funding

SET Research Training Program (RTP) Stipend

SET Research Training Program (RTP) Stipend