Rheumatoid arthritis (RA) is frequently preceded by an at-risk phase of disease marked by presence of anti-citrullinated protein antibodies (ACPA) but absence of clinically detectable synovitis. Preemptive treatment of at-risk individuals (ARI) could prevent or delay future tissue damage but immunobiology of this state is unclear. Using integrative multiomics, we longitudinally profiled ARI, where one-third of participants developed clinical RA on study, and found evidence of systemic inflammation, broad activation of naïve T and B cell subsets, proinflammatory-skewing of effector B cells, expansion of memory T cells with an activated Tfh-like signature despite no appreciable elevation in circulating ACPA levels. CD4 T-cell gene signatures from ARI who went on to develop clinical RA reflected those of RA patients that responded to Abatacept, implicating these cell states in RA pathogenesis.

Description of the data and file structure

Data files that support analyses shown in He, Glass et al. 2025 (Science Translational Medicine). Data were derived from measurements on peripheral blood mononuclear cells (PBMCs) from anti-citrullinated protein antibody positive (ACPA+) individuals at risk for developing rheumatoid arthritis (RA), and ACPA- controls. Detailed methods are contained in the paper.

Datafile_S1

Cross-sectional comparison of transcriptome profiles between ACPA+ at-risk individuals (ARI) and ACPA- controls (HC1). Results calculated using DESeq2 on pseudobulked scRNA-seq data.

cell type: immune cell type
gene: gene name
baseMean: average of the normalized count values, dividing by size factors, taken over all samples
log2FoldChange: fold change between ARI and HC1 transcripts, on a log2 scale; positive values indicate higher expression in ARI
lfcSE: standard error estimate for the log2 fold change estimate
stat: value of the test statistic for the gene or transcript
pvalue: nominal P values were calculated using the Wald test
FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported

Datafile_S2

Average Spectra gene program scores across samples and PBMC cell types. Cross-sectional analysis comparing ACPA+ at-risk individuals (ARI), ACPA+ individuals with early RA (ERA) and ACPA- controls (HC1)

Sample ID: unique sample identifier
AIFI Level 1 Cell Type: immune cell type; derived from first hierarchical level of cell type identification from Allen Institute for Immunology (AIFI) immune cell atlas. The first hierarchical level is the least granular.
Spectra Factor: gene module from Spectra paper (GET REF)
mean Score: mean of Spectra score
nonzero percent: percent of cells that have a non-zero Spectra score
Batch: scRNA-seq batch samples were run in
BMI: body mass index; ND = not determined
Sex: sex
age: participant age at time of sample collection
status: grouping - ARI, ERA, HC1

Datafile_S3

Cross-sectional comparison of differential PBMC transcripts between ACPA+ converters (CONV), ACPA+ non-converters (NONC) and ACPA- controls (HC1). Results calculated using DESeq2 on pseudobulked scRNA-seq data. Data was from baseline samples.

Cell Type: immune cell type
contrast: specific comparison being made between groups
gene: gene name
baseMean: average of the normalized count values, dividing by size factors, taken over all samples
log2FoldChange: fold change between ARI and HC1 transcripts, on a log2 scale; positive values indicate higher expression in 2nd group in the contrast column
lfcSE: standard error estimate for the log2 fold change estimate
stat: value of the test statistic for the gene or transcript
pvalue: nominal P values were calculated using the Wald test
FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported

Datafile_S4

Intra-donor coefficient of variation (CV) in ACPA+ converters (CONV) compared with longitudinal healthy control samples (HC2). All input data were derived from longitudinal measurements. Data are reported by gene per cell type.

Cell Type: immune cell type
gene: gene name
Mean CV in HC2: mean coefficient of variation in HC2 group
Mean CV in ARI (Converters): mean coefficient of variation in CONV group
Ratio of CV (Converters/HC2): ratio of mean CV values from CONV over HC2 groups
Direction: indicates whether mean CV is higher in converters (ARI to clinical RA) or HC2

In a separate group of columns, we provide statistical results for the above CV calculations. These calculations are per cell type and are separated by a blank column from the calculated CV values.

Cell type: immune cell type
Estimate: mean of the differences in paired transcript expression between CONV and HC2
Statistic: t-statistic from paired t test
P value: nominal P values were calculated using paired t tests (see Method)
Parameter: degrees of freedom
Conf.low: lower bound of the 95% confidence interval for the mean difference
Conf.high: upper bound of the 95% confidence interval for the mean difference
Method: method for calculating P values
Alternative: alternative hypothesis (two-sided)
Direction: annotation indicating whether CV values were higher in CONV or HC2
FDR: adjusted P values (Benjamini-Hochberg procedure)

Datafile_S5

Differentially expressed genes over time in ACPA+ converters (CONV). Calculated using linear mixed effect models as details in Supplementary Methods.

Cell type: immune cell type
Gene: gene name
Term: model coefficient
Effect size: change in transcript expression over time (per day); positive values indicate increased transcript expression as converters progress to clinical RA onset
std.error: standard error for the effect size
p.value: nominal P values calculated by linear mixed effect models
FDR: P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported

Datafile_S6

Differentially expressed genes in paired samples comparing last pre-symptomatic visit with clinical RA diagnosis visit. Paired pre-post analysis, as detailed in Supplementary Methods.

cell type: immune cell type
gene: gene name
baseMean: average of the normalized count values, dividing by size factors, taken over all samples
log2FoldChange: fold change between last pre-symptomatic visit vs. clinical RA diagnosis visit; positive values indicate higher transcript expression at clinical RA diagnosis.
lfcSE: standard error estimate for the log2 fold change estimate
stat: value of the test statistic for the gene or transcript
pvalue: nominal P values were calculated using the Wald test
FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported

Datafile_S7

Differentially accessible chromatin regions from TEA-seq cross-sectional data comparing a subset of ACPA+ at-risk individuals (ARI) and ACPA- healthy controls (HC2). Details provided in Supplementary Methods. Only peaks with FDR < 0.2 are included in this table.

chromosome: chromosome number
start: starting genome coordinates for peak; annotations based on UCSC hg38 reference genome
end: ending genome coordinates for peak; annotations based on UCSC hg38 reference genome
width: size of peak (basepairs; bp)
Tile: concatenated chromosome-start-end coordinates (i.e. location of peak)
CellPopulation: immune cell type
Foreground: testing group (ARI)
Background: reference group (HC2)
P value: nominal P value using zero-inflated Wilcoxon test from MOCHA (Rachid Zaim 2024 Nature Communications)
FDR: multiple test correction from MOCHA
Log2 Fold Change (Continuous): Fold change in accessibility on log2 scale; Foreground/Background; Not calculated when one group had no detectable peak
tileType: annotated peak location; regions within 2000bp upstream and 100bp downstream of transcriptional start site (TSS) were considered promoter regions; intragenic = within gene; distal = outside gene and promoter
Gene: gene name

Datafile_S8

Differentially expressed genes from responders and non-responders before and after TNF inhibitor therapy. Re-analyzed PBMC RNA-seq data from Farutin et al. 2019 Arthritis Res. Ther.

gene: gene name
baseMean: average of the normalized count values, dividing by size factors, taken over all samples
log2FoldChange: fold change between last pre-symptomatic visit vs. clinical RA diagnosis visit; positive values indicate higher transcript expression at clinical RA diagnosis.
lfcSE: standard error estimate for the log2 fold change estimate
stat: value of the test statistic for the gene or transcript
pvalue: nominal P values were calculated using the Wald test
q-values: P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
treatment: disease-modifying anti-rheumatic drug therapy patients were taking
group: responder or non-responder to TNF therapy

Datafile_S9

Individual-level data for experiments where n < 20. To comply with figure numbering in the published article, data are provided in a single excel file with tabs specifying individual figures.

Data supporting Fig 1D-E
- Assay: protein name
- Sample ID: unique sample identifier
- Cluster: k-means cluster each sample belongs to (see Fig 1C in main article)
- NPX: normalized protein expression, as measured by Olink; log2 scale
- Status: group each sample belongs to (i.e. converter, non-converter, etc.)
- Concentration: concentration of each protein in plasma, as measured by Mesoscale Discovery (MSD) assays
- Units: units for concentration
Data supporting Fig 2F, S5G and 2I
- Sample ID: unique sample identifier
- Subject ID: participant identifier
- Sex: sex
- Status: group each sample belongs to (preclinical RA, clinical RA)
- Cell type: immune cell type
- Gene: gene name
- Normalized Expression: gene count, normalized using variance stabilizing normalization (VST); see Supplemental Methods for details
- Counts: RNA counts for cell type listed
- Total CD14 Monocyte Counts: RNA counts for CD14 monocytes
- Frequency within CD14 Monocytes: frequency of cell type within CD14 monocytes (counts/CD14mono counts)
Data supporting fig S6I
- Subject ID: participant identifier
- 10X Sequencing Technology: indicates which 10X scRNA method was used (3' or 5' sequencing)
- IGHG3 Expression: germline transcript (GLT) expression
- Ig Isotyping Method: indicates whether RNA-based (this paper) or BCR-seq-based; see Supplemental Methods for further details
Data supporting fig S8 (TEA-seq data from Thomson 2023 Nature Immunology)
- Cluster: RNA cluster
- gene: gene name (scRNA-seq)
- mean_expression: mean expression of transcript (left) or protein (right)
- percent_expressing: number of cells in corresponding cluster expressing that gene (left) or protein (right)
- Surface Protein: protein name (CITE panel)
Data supporting Fig 6 and fig S12
- ATAC clusters: cluster from ATAC-seq portion of TEA-seq
- counts: number of cells for each cluster in each sample
- frequency: frequency of each cluster in total CD4 T cells for each sample
- cohort: groups (at-risk individuals/ARI or healthy control/HC2)
- Sample ID: unique sample identifier
- Subject ID: participant identifier
- CLR: centre log ratio-transformed frequency
- Method: transformation method
- Group 1: first comparison group
- Group 2: second comparison group
- n1: sample size of group 1
- n2: sample size of group 2
- statistic: Wilcoxon rank-sum test statistic
- P value: nominal P values calculated using Wilcoxon rank-sum test
- Adjusted P value: Adjusted P value corrected for multiple hypothesis testing with Benjamini-Hochberg procedure
- Percent expression: percent of cells expressing indicated surface protein marker
- Mean: mean expression of indicated surface protein marker
- Protein: protein name
- Scores: z-score for the computation of a p-value for each protein for each group (output from scanpy.tl.rank_genes_groups function)
- Logfoldchanges: log2 fold change for each gene for each group (output from scanpy.tl.rank_genes_groups function)
Data supporting Fig 7G (data from Iwasaki 2024 Arthritis Res. Ther.)
- Gene: gene name
- Normalized Expression: normalized transcript expression
- Time: sample time point (0 months or 3 months)
- Patient ID: participant identifier
- Timepoint: pre- or post- abatacept (ABT) treatment

Files and variables

See data description for details. See manuscript for detailed methods.

File: Datafile_S1_-_ARIvHC1_deg_results.csv

File: Datafile_S2_-_Spectra_average_scores.csv

File: Datafile_S3_-_conv_nonc_hc_degs.csv

File: Datafile_S4_-_ALTRA_HC2_CV_comparison.csv

File: Datafile_S5_-_Longitudinal_degs_table.csv

File: Datafile_S6_-_paired_degs.csv

File: Datafile_S7_-_TeaSeq_DAPs.csv

File: Datafile_S8_-_TNFi_deg_results.csv

File: Datafile_S9_-_Individual-level_data_for_experiments_where_n_below_20.xlsx

Code/software

R and Python code used to perform analyses and generate figures is provided in GitHub (https://github.com/aifimmunology/ALTRA-manuscript/) or Zenodo (https://doi.org/10.5281/zenodo.17064925).

Access information

Other publicly accessible locations of the data:

Allen Institute interactive data visualizations: https://apps.allenimmunology.org/aifi/insights/ra-progression/
Processed scRNA-seq data from human PBMCs generated in this study can be downloaded from GEO under accession number GSE274680
Raw scRNA-seq fastq files are deposited in dbGap (phs003944.v1.p1)

Data was derived from the following sources:

Abatacept treatment data from RA patients was derived from Iwasaki et al. (2024) PMID: 38167328 and can be downloaded from Zenodo (https://zenodo.org/records/8250013).
TNFi treatment data from RA patients was derived from Farutin et al. (2019) PMID: 31647025 and can be downloaded from GEO under accession number GSE129705
T cell TEA-seq data was derived from Thomson et al. (2023) PMID: 37845489 and can be downloaded from Zenodo (https://zenodo.org/records/14783021).

Human subjects data

No identifiable information or data was submitted to the Dryad database. To de-identify metadata, participant names and visit details were anonymized by replacing with codes. Subject IDs were assigned an alphanumeric code that reflected their group (i.e. ARI = at-risk individual) and a random but sequential number. Participants signed a consent and authorization form for study participation that included sharing of de-identified data. These forms were reviewed and approved by Institutional Review Boards (IRB), as detailed in the paper.

Progression to rheumatoid arthritis in at-risk individuals is defined by systemic inflammation and by T and B cell dysregulation

Data files

Abstract

Description of the data and file structure

Datafile_S1

Datafile_S2

Datafile_S3

Datafile_S4

Datafile_S5

Datafile_S6

Datafile_S7

Datafile_S8

Datafile_S9

Files and variables

File: Datafile_S1_-_ARIvHC1_deg_results.csv

File: Datafile_S2_-_Spectra_average_scores.csv

File: Datafile_S3_-_conv_nonc_hc_degs.csv

File: Datafile_S4_-_ALTRA_HC2_CV_comparison.csv

File: Datafile_S5_-_Longitudinal_degs_table.csv

File: Datafile_S6_-_paired_degs.csv

File: Datafile_S7_-_TeaSeq_DAPs.csv

File: Datafile_S8_-_TNFi_deg_results.csv

File: Datafile_S9_-_Individual-level_data_for_experiments_where_n_below_20.xlsx

Code/software

Access information

Human subjects data

Progression to rheumatoid arthritis in at-risk individuals is defined by systemic inflammation and by T and B cell dysregulation

Data files

Abstract

README: Progression to rheumatoid arthritis in at-risk individuals is defined by systemic inflammation and by T and B cell dysregulation

Description of the data and file structure

Datafile_S1

Datafile_S2

Datafile_S3

Datafile_S4

Datafile_S5

Datafile_S6

Datafile_S7

Datafile_S8

Datafile_S9

Files and variables

File: Datafile_S1_-_ARIvHC1_deg_results.csv

File: Datafile_S2_-_Spectra_average_scores.csv

File: Datafile_S3_-_conv_nonc_hc_degs.csv

File: Datafile_S4_-_ALTRA_HC2_CV_comparison.csv

File: Datafile_S5_-_Longitudinal_degs_table.csv

File: Datafile_S6_-_paired_degs.csv

File: Datafile_S7_-_TeaSeq_DAPs.csv

File: Datafile_S8_-_TNFi_deg_results.csv

File: Datafile_S9_-_Individual-level_data_for_experiments_where_n_below_20.xlsx

Code/software

Access information

Human subjects data