Sequence-structure-function relationships in class I MHC: a local frustration perspective
Data files
Apr 28, 2020 version files 991.07 MB
Abstract
Class I Major Histocompatibility Complex (MHC) binds short antigenic peptides with the help of Peptide Loading Complex (PLC), and presents them to T-cell Receptors (TCRs) of cytotoxic T-cells and Killer-cell Immunglobulin-like Receptors (KIRs) of Natural Killer (NK) cells. With more than 10000 alleles, the Human Leukocyte Antigen (HLA) chain of MHC is the most polymorphic protein in humans. This allelic diversity provides a wide coverage of peptide sequence space, yet does not affect the three-dimensional structure of the complex. Moreover, TCRs mostly interact with pMHC in a common diagonal binding mode, and KIR-pMHC interaction is allele-dependent. With the aim of establishing a framework for understanding the relationships between polymorphism (sequence), structure (conserved fold) and function (protein interactions) of the MHC, we performed here a local frustration analysis on pMHC homology models covering 1436 HLA I alleles. An analysis of local frustration profiles indicated that (1) variations in MHC fold are unlikely due to minimally-frustrated and relatively conserved residues within the HLA peptide-binding groove, (2) high frustration patches on HLA helices are either involved in or near interaction sites of MHC with the TCR, KIR, or Tapasin of the PLC, and (3) peptide ligands mainly stabilize the F-pocket of HLA binding groove.
Methods
Data collection for records_matureHLA.fasta: The sequences contained in this fasta file were obtained from the IMGT/HLA dataset. Only HLA binding groove residues are included (residues 1-180).
Data collection for data_frame_SRFI.csv: This table includes mainly single residue frustration index data from pMHC structures covering 1436 HLA Class I alleles in complex with 3-10 nonamer peptides. 3-10 high affinity peptide ligands were predicted using netMHCpan 3.0 for each allele, then homology models were created using Modeller v9.19. Local frustration analysis was then carried out using frustratometer2 (stand-alone version from https://github.com/gonzaparra/frustratometer2 was used). The column names are as obtained from frustratometer2. FrstIndex column includes singe residue frustration index values. SASA,RSASA,Peptide,Allele columns include position-specific Solvent Accessible Surface Area (SASA), Relative SASA, peptide sequences and allele names, respectively.
Data collection for df_SF_R_20200428.csv: This table includes a "reduced" version of data_frame_SRFI.csv, with following columns:
Allele: specific HLA I allele name,
Sequence: binding-groove sequence
Chain: the chain on which the respective position is located
Res: position of residue within the sequence
AA: one-letter amino-acid code
ChainRes: a concatenated string including Chain and Res fields.
SASA, RSASA: as described above
FI_mean, FI_mean_sd, FI_median: Mean and median SRFI values calculated for each position using data_frame_SRFI. FI_mean_sd denotes standart-deviation of mean SRFI values.
rvET: Position-specific real-value Evolutionary Trace scores.
FI_median_diff: The difference in median SRFI values upon peptide-binding.
Locus: Gene Locus (A, B or C)
Core Allele: True, if the allele is among the core alleles reported by Robinson et al. (2017) (https://doi.org/10.1371/journal.pgen.1006862). False, otherwise.
Pocket: The peptide-binding pocket in which the respective residue is located. None if the residue is not included in any pocket.
Interface: True, if the residue is a protein-protein interface residue within the MHC
SS: Secondary-structure assignment
Domain: The structural domain in which the respective residue in located.
Usage notes
For most alleles, 10 peptide ligands are included. For some alleles, however, netMHCpan 3.0 did not classify at least 10 peptides among 25000 random nonamer sequences as strong binders. Therefore, only those peptides classified as strong binders for these alleles were used in homology models, which resulted in less than 10 peptide ligands for some alleles.
It should be sufficient to reproduce SRFI-based figures using only data included in df_SF_R_20200428.csv. Note that some of the data, such as solvent-accessible surface areas and secondary structure assignments were not used in the final publication.