T-SCAPE: T-cell immunogenicity scoring via cross-domain aided predictive engine
Data files
Abstract
T-cell immunogenicity is a critical determinant of safety and efficacy for protein therapeutics and vaccines, but prediction is hampered by data scarcity. We present T-SCAPE, a multi-domain deep learning framework that uses adversarial domain adaptation to integrate diverse immunologically relevant data sources, including MHC presentation, peptide-MHC binding affinity, TCR-pMHC interaction, source organism information, and T-cell activation. Validated through rigorous leakage-controlled benchmarks, T-SCAPE shows exceptional performance in predicting T-cell activation for specific peptide-MHC pairs. Remarkably, it also accurately predicts the anti-drug antibody-inducing potential of therapeutic antibodies without MHC inputs, a success attributed to its biologically grounded pretraining. Confirmed by extensive case studies and ablation studies, T-SCAPE’s flexible architecture also supports broader tasks like molecular binding prediction. Its robust performance highlights its potential to advance the development of safer and more effective biologics.
Dataset DOI: 10.5061/dryad.s7h44j1k7
1. Description of the Dataset
This dataset serves as the official training, validation, and benchmarking repository for TITANiAN (T-SCAPE), a deep learning model designed to predict T-cell immunogenicity. The data compiles T-cell receptor (TCR) sequences, epitope sequences, and MHC alleles from multiple public immunology databases.
This repository contains all data files used for training and benchmarking. The source code is hosted separately due to licensing requirements (see Section 4).
2. File Structure and Contents
The dataset is organized into three zipped archives.
A. train.zip
Contains the primary datasets used for model training.
TITANiAN_pretrain_train.csv: Dataset used for self-supervised pre-training.TITANiAN_finetune_train.csv: Dataset used for supervised fine-tuning.
B. valid.zip
Contains datasets used for model validation.
TITANiAN_pretrain_valid.csv: Validation set for the pre-training phase.TITANiAN_finetune_valid.csv: Validation set for the fine-tuning phase.
C. test.zip (Benchmark Datasets)
Contains independent test sets used to benchmark TITANiAN.
TITANiAN_IM_*.csv: Immunogenicity benchmarks (ADA, Neoantigen, Infectious diseases).TITANiAN_MHC_*.csv: MHC binding affinity benchmarks (Class I and II).TITANiAN_TCR_*.csv: TCR specificity benchmarks (Activation, Zero-shot).
3. Variable (Column) Information
The following tables describe all variables (column headers) found across the CSV files in this dataset.
3.1. Common Identifiers and Biological Sequences
| Column Header | Description | Data Type |
|---|---|---|
peptide / Epitope |
Amino acid sequence of the target antigen/epitope. | String (IUPAC AA) |
Peptide length |
Length of the peptide sequence. | Integer |
CDR3b |
Amino acid sequence of the TCR Beta chain CDR3 region. | String (IUPAC AA) |
mhc / Allele |
Major Histocompatibility Complex allele name (e.g., A0201). | String |
pseudo |
Pseudo-sequence representation of the MHC allele amino acid sequence. | String (IUPAC AA) |
Antibody |
Name/ID of the antibody (specific to TITANiAN_IM_ADA.csv). | String |
VH |
Variable Heavy chain amino acid sequence (Antibody). | String (IUPAC AA) |
VL |
Variable Light chain amino acid sequence (Antibody). | String (IUPAC AA) |
Reference / IEDB reference |
Literature or Database reference ID (e.g., DOI or IEDB ID). | String |
3.2. Target Labels and Experimental Measurements
| Column Header | Description | Interpretation Key / Unit |
|---|---|---|
label |
Binary classification label for immunogenicity or binding. | 0: Negative1: Positive |
task |
Identifier for the specific sub-task or data partition. | Integer ID |
Measurement type |
Type of experimental assay used. | binary, ic50 |
Immunogenicity |
Measured Anti-Drug Antibody (ADA) levels. | 0-100: Normalized ADA value |
3.3. Model Predictions (TITANiAN & Benchmarks)
| Column Header / Pattern | Description | Interpretation Key |
|---|---|---|
TITANiAN-* |
Prediction score generated by the TITANiAN model. | 0.0 - 1.0: Probability (Higher = More Immunogenic) |
TITANiAN-*-baseline |
Prediction score from the baseline version of our model. | 0.0 - 1.0 |
TITANiAN-*-ablation_... |
Prediction scores from ablation studies. | 0.0 - 1.0 |
NetMHCpan *NetMHCIIpan * |
Predictions from NetMHC family tools (Class I & II). | Score / Rank |
SMM / SMM-align / SMMPMBEC |
Predictions from SMM-based tools. | Score / IC50 / Rank |
NN-align / ANN * |
Predictions from Neural Network alignment tools. | Score / Affinity |
MHCflurry * |
Predictions from MHCflurry. | Affinity / Score |
MHCnuggets * |
Predictions from MHCnuggets. | Score / Affinity |
PickPocket, ARB, Tepitope |
Predictions from older MHC binding tools. | Score |
MixMHCpred-*, PRIME-* |
Predictions from MixMHCpred and PRIME tools. | Score / Rank |
BigMHC_*, TransPHLA |
Predictions from BigMHC and TransPHLA. | Score |
HLAthena_* |
Predictions from HLAthena. | Score / Rank |
DLpTCR, ERGO2, PanPep, pMTnet |
Predictions from TCR-peptide binding tools. | Score |
AbNatiV, T20, Z-score |
Antibody-specific immunogenicity scores. | Score / Metric |
IgReconstruct, AbLSTM |
Antibody-specific prediction models. | Score |
Hu-mAb, MG Score |
Humanization and humanness scores for antibodies. | Score |
Germline content |
Percentage of germline sequence identity. | Percentage / Score |
OASis identity (*) |
Sequence identity metrics from OASis (loose, strict, etc.). | Score / Percentage |
Non-human MHCII binders (*) |
Predicted non-human MHCII binding sites (loose, strict, etc.). | Count / Score |
3.4. Metadata
| Column Header | Description |
|---|---|
Date |
Date of data entry or retrieval. |
rand |
Random seed or randomization index used for splitting. |
Species |
Source species (e.g., Human, Mouse). |
4. Access Information and Software
- GitHub Repository: https://github.com/seoklab/T-SCAPE
- Webserver: https://galaxy.seoklab.org/design/t-scape/
Note on Code Availability: The source code and model weights for T-SCAPE are available at the GitHub repository linked above. Due to a pending patent, the software is not included in this Dryad data package.
5. Data Sources
Data was derived from the following public databases:
IEDB, VDJdb, MCPAS-TCR, ImmuneCODE, TBAdb, 10X Genomics, OAS, Uniprot, PRIME, MHCBN, BigMHC, Biophi, Panpep
