The function of a protein is primarily determined by its structure and amino acid sequence. Many biological questions of interest rely on being able to accurately determine the group of structures to which domains of a protein belong (often referred to as their `fold'); this can be done through alignment and comparison of protein structures. This fundamental task underpins predicting function, identification of homology, and testing hypotheses about the way "protein-space'' is organised. Dozens of different methods for Protein Structure Alignment (PSA) have been proposed that use a wide range of techniques. The aim of this study is to determine the ability of PSA methods to identify pairs of protein domains known to share differing levels of structural similarity, and to assess their utility for clustering domains from several different folds into known groups.

We present the results of a comprehensive investigation into eighteen PSA methods; to our knowledge this is the largest piece of independent research on this topic. Overall, SP-AlignNS (nonsequential) was found to be the best method for classification, and also one of the best performing methods for clustering.

Methods (where possible) were split into the algorithm used to find the optimal alignment and the score used to assess similarity. This allowed us to largely separate the algorithm from the score it maximises and thus, to assess their effectiveness independently of each other. Surprisingly, we found that some hybrids of mismatched scores and algorithms performed better than either of the native methods at classification and, in some cases, clustering as well. It is hoped that this investigation and the accompanying discussion will be useful for researchers selecting or designing methods to align protein structures.

Alignments: Contained in the classification folder of this dataset are alignments of 500 randomly selected 'pivot' domains to protein domains from the same family, same superfamily, same fold, same class and different classes. These are generally named using the numbers 0-499 and the letters f,s,fo,c and d (for decoy). These alignments were performed with eighteen different protein structure alignment methods, as can be seen from the folders. The exception to this was Fr-TM-Align, which was used to perform alignments for these same protein pairs several different times with different length normalisation methods.

Contained in the clustering folder of this dataset are all vs all comparisons of fifty proteins performed by the same eighteen alignment methods. These proteins were selected to be from ten different folds, with five representatives from each.

Code: Contained in the code folder are python modules to calculate structural similarity scores associated with different original methods. These all require a .pdb file containing two aligned proteins and a .fasta file showing which residues are aligned to which.

The identities of all proteins in the alignments uploaded can be found in the supplementary material of "Benchmarking Methods of Protein Structure Alignment".

Protein structure alignments and structural similarity score code for: Benchmarking methods of protein structure alignment

Data files

Abstract

Protein structure alignments and structural similarity score code for: Benchmarking methods of protein structure alignment

Data files

Abstract

Methods

Usage notes

Works referencing this dataset