Progression to rheumatoid arthritis in at-risk individuals is defined by systemic inflammation and by T and B cell dysregulation
Data files
Sep 20, 2025 version files 166.30 MB
-
Datafile_S1_-_ARIvHC1_deg_results.csv
31.62 MB
-
Datafile_S2_-_Spectra_average_scores.csv
22.81 MB
-
Datafile_S3_-_conv_nonc_hc_degs.csv
44.48 MB
-
Datafile_S4_-_ALTRA_HC2_CV_comparison.csv
11.09 MB
-
Datafile_S5_-_Longitudinal_degs_table.csv
23.64 MB
-
Datafile_S6_-_paired_degs.csv
25.25 MB
-
Datafile_S7_-_TeaSeq_DAPs.csv
4.50 MB
-
Datafile_S8_-_TNFi_deg_results.csv
2.64 MB
-
Datafile_S9_-_Individual-level_data_for_experiments_where_n_below_20.xlsx
251.99 KB
-
README.md
14.71 KB
Abstract
Rheumatoid arthritis (RA) is frequently preceded by an at-risk phase of disease marked by presence of anti-citrullinated protein antibodies (ACPA) but absence of clinically detectable synovitis. Preemptive treatment of at-risk individuals (ARI) could prevent or delay future tissue damage but immunobiology of this state is unclear. Using integrative multiomics, we longitudinally profiled ARI, where one-third of participants developed clinical RA on study, and found evidence of systemic inflammation, broad activation of naïve T and B cell subsets, proinflammatory-skewing of effector B cells, expansion of memory T cells with an activated Tfh-like signature despite no appreciable elevation in circulating ACPA levels. CD4 T-cell gene signatures from ARI who went on to develop clinical RA reflected those of RA patients that responded to Abatacept, implicating these cell states in RA pathogenesis.
https://doi.org/10.5061/dryad.1rn8pk13z
Description of the data and file structure
Data files that support analyses shown in He, Glass et al. 2025 (Science Translational Medicine). Data were derived from measurements on peripheral blood mononuclear cells (PBMCs) from anti-citrullinated protein antibody positive (ACPA+) individuals at risk for developing rheumatoid arthritis (RA), and ACPA- controls. Detailed methods are contained in the paper.
Datafile_S1
Cross-sectional comparison of transcriptome profiles between ACPA+ at-risk individuals (ARI) and ACPA- controls (HC1). Results calculated using DESeq2 on pseudobulked scRNA-seq data.
- cell type: immune cell type
- gene: gene name
- baseMean: average of the normalized count values, dividing by size factors, taken over all samples
- log2FoldChange: fold change between ARI and HC1 transcripts, on a log2 scale; positive values indicate higher expression in ARI
- lfcSE: standard error estimate for the log2 fold change estimate
- stat: value of the test statistic for the gene or transcript
- pvalue: nominal P values were calculated using the Wald test
- FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
Datafile_S2
Average Spectra gene program scores across samples and PBMC cell types. Cross-sectional analysis comparing ACPA+ at-risk individuals (ARI), ACPA+ individuals with early RA (ERA) and ACPA- controls (HC1)
- Sample ID: unique sample identifier
- AIFI Level 1 Cell Type: immune cell type; derived from first hierarchical level of cell type identification from Allen Institute for Immunology (AIFI) immune cell atlas. The first hierarchical level is the least granular.
- Spectra Factor: gene module from Spectra paper (GET REF)
- mean Score: mean of Spectra score
- nonzero percent: percent of cells that have a non-zero Spectra score
- Batch: scRNA-seq batch samples were run in
- BMI: body mass index; ND = not determined
- Sex: sex
- age: participant age at time of sample collection
- status: grouping - ARI, ERA, HC1
Datafile_S3
Cross-sectional comparison of differential PBMC transcripts between ACPA+ converters (CONV), ACPA+ non-converters (NONC) and ACPA- controls (HC1). Results calculated using DESeq2 on pseudobulked scRNA-seq data. Data was from baseline samples.
- Cell Type: immune cell type
- contrast: specific comparison being made between groups
- gene: gene name
- baseMean: average of the normalized count values, dividing by size factors, taken over all samples
- log2FoldChange: fold change between ARI and HC1 transcripts, on a log2 scale; positive values indicate higher expression in 2nd group in the contrast column
- lfcSE: standard error estimate for the log2 fold change estimate
- stat: value of the test statistic for the gene or transcript
- pvalue: nominal P values were calculated using the Wald test
- FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
Datafile_S4
Intra-donor coefficient of variation (CV) in ACPA+ converters (CONV) compared with longitudinal healthy control samples (HC2). All input data were derived from longitudinal measurements. Data are reported by gene per cell type.
- Cell Type: immune cell type
- gene: gene name
- Mean CV in HC2: mean coefficient of variation in HC2 group
- Mean CV in ARI (Converters): mean coefficient of variation in CONV group
- Ratio of CV (Converters/HC2): ratio of mean CV values from CONV over HC2 groups
- Direction: indicates whether mean CV is higher in converters (ARI to clinical RA) or HC2
In a separate group of columns, we provide statistical results for the above CV calculations. These calculations are per cell type and are separated by a blank column from the calculated CV values.
- Cell type: immune cell type
- Estimate: mean of the differences in paired transcript expression between CONV and HC2
- Statistic: t-statistic from paired t test
- P value: nominal P values were calculated using paired t tests (see Method)
- Parameter: degrees of freedom
- Conf.low: lower bound of the 95% confidence interval for the mean difference
- Conf.high: upper bound of the 95% confidence interval for the mean difference
- Method: method for calculating P values
- Alternative: alternative hypothesis (two-sided)
- Direction: annotation indicating whether CV values were higher in CONV or HC2
- FDR: adjusted P values (Benjamini-Hochberg procedure)
Datafile_S5
Differentially expressed genes over time in ACPA+ converters (CONV). Calculated using linear mixed effect models as details in Supplementary Methods.
- Cell type: immune cell type
- Gene: gene name
- Term: model coefficient
- Effect size: change in transcript expression over time (per day); positive values indicate increased transcript expression as converters progress to clinical RA onset
- std.error: standard error for the effect size
- p.value: nominal P values calculated by linear mixed effect models
- FDR: P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
Datafile_S6
Differentially expressed genes in paired samples comparing last pre-symptomatic visit with clinical RA diagnosis visit. Paired pre-post analysis, as detailed in Supplementary Methods.
- cell type: immune cell type
- gene: gene name
- baseMean: average of the normalized count values, dividing by size factors, taken over all samples
- log2FoldChange: fold change between last pre-symptomatic visit vs. clinical RA diagnosis visit; positive values indicate higher transcript expression at clinical RA diagnosis.
- lfcSE: standard error estimate for the log2 fold change estimate
- stat: value of the test statistic for the gene or transcript
- pvalue: nominal P values were calculated using the Wald test
- FDR (q-value): P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
Datafile_S7
Differentially accessible chromatin regions from TEA-seq cross-sectional data comparing a subset of ACPA+ at-risk individuals (ARI) and ACPA- healthy controls (HC2). Details provided in Supplementary Methods. Only peaks with FDR < 0.2 are included in this table.
- chromosome: chromosome number
- start: starting genome coordinates for peak; annotations based on UCSC hg38 reference genome
- end: ending genome coordinates for peak; annotations based on UCSC hg38 reference genome
- width: size of peak (basepairs; bp)
- Tile: concatenated chromosome-start-end coordinates (i.e. location of peak)
- CellPopulation: immune cell type
- Foreground: testing group (ARI)
- Background: reference group (HC2)
- P value: nominal P value using zero-inflated Wilcoxon test from MOCHA (Rachid Zaim 2024 Nature Communications)
- FDR: multiple test correction from MOCHA
- Log2 Fold Change (Continuous): Fold change in accessibility on log2 scale; Foreground/Background; Not calculated when one group had no detectable peak
- tileType: annotated peak location; regions within 2000bp upstream and 100bp downstream of transcriptional start site (TSS) were considered promoter regions; intragenic = within gene; distal = outside gene and promoter
- Gene: gene name
Datafile_S8
Differentially expressed genes from responders and non-responders before and after TNF inhibitor therapy. Re-analyzed PBMC RNA-seq data from Farutin et al. 2019 Arthritis Res. Ther.
- gene: gene name
- baseMean: average of the normalized count values, dividing by size factors, taken over all samples
- log2FoldChange: fold change between last pre-symptomatic visit vs. clinical RA diagnosis visit; positive values indicate higher transcript expression at clinical RA diagnosis.
- lfcSE: standard error estimate for the log2 fold change estimate
- stat: value of the test statistic for the gene or transcript
- pvalue: nominal P values were calculated using the Wald test
- q-values: P values were corrected for multiple hypothesis testing using the Storey-Tibshirani method; resulting q-values are reported
- treatment: disease-modifying anti-rheumatic drug therapy patients were taking
- group: responder or non-responder to TNF therapy
Datafile_S9
Individual-level data for experiments where n < 20. To comply with figure numbering in the published article, data are provided in a single excel file with tabs specifying individual figures.
- Data supporting Fig 1D-E
- Assay: protein name
- Sample ID: unique sample identifier
- Cluster: k-means cluster each sample belongs to (see Fig 1C in main article)
- NPX: normalized protein expression, as measured by Olink; log2 scale
- Status: group each sample belongs to (i.e. converter, non-converter, etc.)
- Concentration: concentration of each protein in plasma, as measured by Mesoscale Discovery (MSD) assays
- Units: units for concentration
- Data supporting Fig 2F, S5G and 2I
- Sample ID: unique sample identifier
- Subject ID: participant identifier
- Sex: sex
- Status: group each sample belongs to (preclinical RA, clinical RA)
- Cell type: immune cell type
- Gene: gene name
- Normalized Expression: gene count, normalized using variance stabilizing normalization (VST); see Supplemental Methods for details
- Counts: RNA counts for cell type listed
- Total CD14 Monocyte Counts: RNA counts for CD14 monocytes
- Frequency within CD14 Monocytes: frequency of cell type within CD14 monocytes (counts/CD14mono counts)
- Data supporting fig S6I
- Subject ID: participant identifier
- 10X Sequencing Technology: indicates which 10X scRNA method was used (3' or 5' sequencing)
- IGHG3 Expression: germline transcript (GLT) expression
- Ig Isotyping Method: indicates whether RNA-based (this paper) or BCR-seq-based; see Supplemental Methods for further details
- Data supporting fig S8 (TEA-seq data from Thomson 2023 Nature Immunology)
- Cluster: RNA cluster
- gene: gene name (scRNA-seq)
- mean_expression: mean expression of transcript (left) or protein (right)
- percent_expressing: number of cells in corresponding cluster expressing that gene (left) or protein (right)
- Surface Protein: protein name (CITE panel)
- Data supporting Fig 6 and fig S12
- ATAC clusters: cluster from ATAC-seq portion of TEA-seq
- counts: number of cells for each cluster in each sample
- frequency: frequency of each cluster in total CD4 T cells for each sample
- cohort: groups (at-risk individuals/ARI or healthy control/HC2)
- Sample ID: unique sample identifier
- Subject ID: participant identifier
- CLR: centre log ratio-transformed frequency
- Method: transformation method
- Group 1: first comparison group
- Group 2: second comparison group
- n1: sample size of group 1
- n2: sample size of group 2
- statistic: Wilcoxon rank-sum test statistic
- P value: nominal P values calculated using Wilcoxon rank-sum test
- Adjusted P value: Adjusted P value corrected for multiple hypothesis testing with Benjamini-Hochberg procedure
- Percent expression: percent of cells expressing indicated surface protein marker
- Mean: mean expression of indicated surface protein marker
- Protein: protein name
- Scores: z-score for the computation of a p-value for each protein for each group (output from scanpy.tl.rank_genes_groups function)
- Logfoldchanges: log2 fold change for each gene for each group (output from scanpy.tl.rank_genes_groups function)
- Data supporting Fig 7G (data from Iwasaki 2024 Arthritis Res. Ther.)
- Gene: gene name
- Normalized Expression: normalized transcript expression
- Time: sample time point (0 months or 3 months)
- Patient ID: participant identifier
- Timepoint: pre- or post- abatacept (ABT) treatment
Files and variables
See data description for details. See manuscript for detailed methods.
File: Datafile_S1_-_ARIvHC1_deg_results.csv
File: Datafile_S2_-_Spectra_average_scores.csv
File: Datafile_S3_-_conv_nonc_hc_degs.csv
File: Datafile_S4_-_ALTRA_HC2_CV_comparison.csv
File: Datafile_S5_-_Longitudinal_degs_table.csv
File: Datafile_S6_-_paired_degs.csv
File: Datafile_S7_-_TeaSeq_DAPs.csv
File: Datafile_S8_-_TNFi_deg_results.csv
File: Datafile_S9_-_Individual-level_data_for_experiments_where_n_below_20.xlsx
Code/software
R and Python code used to perform analyses and generate figures is provided in GitHub (https://github.com/aifimmunology/ALTRA-manuscript/) or Zenodo (https://doi.org/10.5281/zenodo.17064925).
Access information
Other publicly accessible locations of the data:
- Allen Institute interactive data visualizations: https://apps.allenimmunology.org/aifi/insights/ra-progression/
- Processed scRNA-seq data from human PBMCs generated in this study can be downloaded from GEO under accession number GSE274680
- Raw scRNA-seq fastq files are deposited in dbGap (phs003944.v1.p1)
Data was derived from the following sources:
- Abatacept treatment data from RA patients was derived from Iwasaki et al. (2024) PMID: 38167328 and can be downloaded from Zenodo (https://zenodo.org/records/8250013).
- TNFi treatment data from RA patients was derived from Farutin et al. (2019) PMID: 31647025 and can be downloaded from GEO under accession number GSE129705
- T cell TEA-seq data was derived from Thomson et al. (2023) PMID: 37845489 and can be downloaded from Zenodo (https://zenodo.org/records/14783021).
Human subjects data
No identifiable information or data was submitted to the Dryad database. To de-identify metadata, participant names and visit details were anonymized by replacing with codes. Subject IDs were assigned an alphanumeric code that reflected their group (i.e. ARI = at-risk individual) and a random but sequential number. Participants signed a consent and authorization form for study participation that included sharing of de-identified data. These forms were reviewed and approved by Institutional Review Boards (IRB), as detailed in the paper.
