Predicting T cell receptor (TCR) specificity based on sequence is challenging because TCRs of similar sequence can recognize entirely different antigens, whereas TCRs of different sequence can recognize the same antigens. Here, we present a system that integrates high-throughput yeast display with fine-tuned protein language models (pLMs) to generate deep Peptide Recognition Profiles (PRPs) for individual TCRs, each detailing binding against millions of peptides. We provide detailed PRPs for a panel of HLA-B*27:05-restricted TCRs from patients with ankylosing spondylitis and acute anterior uveitis that almost exclusively recognize peptides through CDR3β. pLMs trained on these PRPs outperform AlphaFold3 and tFold-TCR in predicting T cell activation. We discover and validate novel candidate autoantigens, demonstrate that model generalization to new TCRs correlates with functional distance (PRP divergence) rather than sequence similarity, and introduce a model-intrinsic uncertainty metric to quantify prediction confidence. This system and its associated PRP datasets offer a scalable approach to mapping TCR recognition, accelerating antigen discovery, and guiding TCR engineering.

This dataset contains the processed Next-Generation Sequencing (NGS) data for the study titled "Deep peptide recognition profiling decodes TCR specificity and enables disease-relevant antigen discovery."

The data includes peptide counts from yeast display libraries before (Naive) and after selection (R1-R4) against various T cell receptors (TCRs).

Dataset Contents

The dataset consists of 23 files in tabular format, categorized into Naive libraries and TCR-specific selection rounds.

Naive Libraries (2 files).
These files represent the baseline peptide distribution of the yeast display libraries before any selection.

Naive_P2&P8.csv: Library with fixed anchors at positions P2 and P8.
Naive_P2&P9.csv: Library with fixed anchors at positions P2 and P9.

TCR Selection Data (21 files).
These files contain peptide sequences and their corresponding read counts after multiple rounds of selection (typically R1–R4).

A. Patient-derived TCRs (16 files)
Selection data for TCRs identified from patients with AS or AAU.

Fixed 2&9 Anchor: TCR4.2_R1-R4.csv, TCR4.3_R1-R4.csv, TCR4.4_R1-R4.csv.

Fixed 2&8 Anchor: All other patient TCR files (e.g., TCR19.2_R1-R4.csv, TCR26.2_R1-R4.csv, etc.).

B. Engineered TCRs (5 files)
Selection data for the 19.2 TCR and its engineered variants (C1-C5), as described in Figure 5 of the manuscript.

Files: TCR19.2C1_R1-R4.csv, TCR19.2C2_R1-R4.csv, TCR19.2C3_R1-R4.csv, TCR19.2C4_R1-R4.csv, TCR19.2C5_R1-R4.csv.

Library Anchor: These selections used the 2&8 library.

Data Processing and Analysis

All data processing and analysis procedures performed on the dataset are described in detail in the accompanying research paper. Note that we used filtered data for downstream analysis and machine learning as described in the paper.

Dataset Structure

The file name is assigned as follows:

HLA-B27_Naive_position_fixed (e.g., HLA-B27_Naive_P2_P9_fixed.csv, HLA-B27_Naive_P2_P8_fixed.csv)
TCR name_R1-R4 (e.g., TCR19.2_R1-R4.csv)

File Format

The processed data files are in tabular format with the following columns:

Naive Libraries (2 files)
(1) Peptide sequences
(2) Peptide counts
TCR Selection Data (21 files)
(1) Peptide sequences
(2) Peptide counts (Round1)
(3) Peptide counts (Round2)
(4) Peptide counts (Round3)
(5) Peptide counts (Round4)

Contact Information

For any inquiries or further information, please contact:

K. Christopher Garcia (kcgarcia@stanford.edu)
Nan Wang (nanwang1@stanford.edu)

Data from: Deep peptide recognition profiling decodes TCR specificity and enables disease-associated antigen discovery

Data files

Abstract

Dataset Contents

Data Processing and Analysis

Dataset Structure

File Format

Contact Information

Data from: Deep peptide recognition profiling decodes TCR specificity and enables disease-associated antigen discovery

Data files

Abstract

README: Data from: Deep peptide recognition profiling decodes TCR specificity and enables disease-associated antigen discovery

Dataset Contents

Data Processing and Analysis

Dataset Structure

File Format

Contact Information