Availability of highly parallelized immunoassays has renewed interest in the discovery of serology-based biomarkers for infectious diseases. Protein and peptide microarrays now provide a high-throughput platform for immunological screening of potential antigens and B-cell epitopes. However, there is still a need to prioritize relevant probes when designing these arrays. In this work we describe a computational method called APRANK (Antigenic Protein and Peptide Ranker) which integrates multiple molecular features to prioritize antigenic targets starting from a given pathogen proteome. These features include subcellular localization, presence of repetitive motifs, natively disordered regions, secondary structure, transmembrane spans and predicted interaction with the immune system. We applied this method to the prioritization of potential diagnostic antigens and peptides in a number of pathogen proteomes and human diseases: Borrelia burgdorferi (Lyme disease), Brucella melitensis (Brucellosis), Coxiella burnetii (Q fever), Escherichia coli (Gastroenteritis), Francisella tularensis (Tularemia), Leishmania braziliensis (Leishmaniasis), Leptospira interrogans (Leptospirosis), Mycobacterium leprae (Leprae), Mycobacterium tuberculosis (Tuberculosis), Plasmodium falciparum (Malaria), Porphyromonas gingivalis (Periodontal disease), Staphylococcus aureus (Bacteremia), Streptococcus pyogenes (Group A Streptococcal infections), Toxoplasma gondii (Toxoplasmosis) and Trypanosoma cruzi (Chagas Disease). After training a linear regression model the method achieves good to excellent performance on most species, measured by the enrichment of validated antigens at the top of the ranking. An unbiased validation using independent data sets shows APRANK is successful in predicting antigenicity for all pathogen species tested. We make APRANK available to facilitate the identification of novel diagnostic antigens in infectious diseases.

A curated dataset of known / validated antigens was obtained from each of the 15 human pathogens listed (bacteria, eukaryotes). Other proteins encoded in these annotated genomes were considered non-antigenic or with no antigenicity precedence/information. Using these data a number of protein features were calculated or predicted using a bioionformaics pipeline (described in the manuscript). To create and train a generalized-linear model first we created 15 individual training sets (one per species) containing a set of 3000 proteins (with balanced positive and negative training examples). A merged training set containing data from all species was used to train the protein model. A similar approach was followed to create and train a model for peptides (epitopes). In this case, we created 15 individual training sets containing a balanced set of 100,000 peptides.

These files contain:

R data structures that can be fed into R. They contain generalized linear models derived from curated (validated) antigens from 15 different human pathogens. Most likely users of these files may want to use our APRANK software (Antigenic Protein and Peptide Ranker, https://github.com/trypanosomatics/aprank), which is a pipeline that would rank proteins and peptides from a complete proteome based on predicted antigenicity. APRANK
Antigenicity Scores for all 15 organisms analyzed in this work. Scores are provided for all proteins in the proteomes of these 15 human pathogens, and for the top scoring peptides (score >0.7, not less than 1% of the total peptides)

Regression models generated by APRANK (computational prioritization of antigenic proteins and peptides from complete pathogen proteomes)

Data files

Abstract

Regression models generated by APRANK (computational prioritization of antigenic proteins and peptides from complete pathogen proteomes)

Data files

Abstract

Methods

Usage notes

Works referencing this dataset