Transcription start site analysis for heterogenous CD4+ T cells using 5′ scRNA-seq

Published Apr 22, 2024 on Dryad. https://doi.org/10.5061/dryad.gtht76hv9

Data files

Apr 22, 2024 version files 12.57 GB

20230928_TC_anno_Gene_rank_bed6.bed

1.87 MB
5scCTSSbed_All.zip

11.84 GB
CD4bulk_5sc_TCver_TCRremoved.removeMono.harmonyintegration_FigS14.rds

728.32 MB
Data_summary.xlsx.zip

23.01 KB
md5sum_CTSS_files.txt

7.71 KB
README.md

4.12 KB

Abstract

These datasets are generated by ReapTEC (read-level pre-filtering and transcribed enhancer call) using 5' single-cell RNA-seq data on human heterogenous CD4+ T cells. By taking advantage of a unique “cap signature” derived from the 5′-end of a transcript, ReapTEC simultaneously profiles gene expression and enhancer activity at nucleotide resolution using 5′-end single-cell RNA-sequencing (5′ scRNA-seq). The detail of ReapTEC pipeline is described in https://github.com/MurakawaLab/ReapTEC.

README: Transcription start site analysis for heterogenous CD4+ T cells using 5′ scRNA-seq

https://doi.org/10.5061/dryad.gtht76hv9

Description of the data and file structure

Data_summary.xlsx.zip: Summary of single-cell experiments in this study.

5scCTSSbed_All.zip: There are 102 files containing count data for analyzing transcription start site (TSS) signals. Details are as follows.

Our original raw sequencing data and processed data of 5′ scRNA-seq have been deposited to National Bioscience Database Center (NBDC) Human Database (accession code: hum0350). Raw sequencing data originated from human subjects have been deposited to Japanese Genotype-phenotype Archive (JGA, accession code: JGAS000689). We retrieved 5′ scRNA-seq data for human memory CD4+ T cells stimulated with viral antigens from the Gene Expression Omnibus database (accession number GSE152522). In total, 102 5′ scRNA-seq datasets were processed by ReapTEC pipeline (https://github.com/MurakawaLab/ReapTEC). During the process, count files were generated for each TSS (CTSS files). The CTSS files in 5scCTSSbed_All.zip are CTSS files obtained from ReapTEC pipeline and further sorted by “LANG=C sort -k 4”.

Naming convention (Please refer to Data_summary.xlsx)

Abbreviation: D1, donor 1; D2, donor 2; D3: donor 3

addExp_Bulk_5sc: Data originated from 5′ scRNA-seq of bulk CD4+ T cell corresponding to the samples written in "5sc_snRNA_seq_added_samples" window of the Data_summary.xlsx.

Bulk_5sc_[replication number]: Data originated from 5′ scRNA-seq of bulk CD4+ T cells corresponding to the samples written in "5sc_3scRNA-seq_CD4bulkCITE-seq" window of the Data_summary.xlsx.

Treg, Th1or2, Tfh, Th17, LAG3_5sc_[replication number]: Data originated from 5′ scRNA-seq of subpopulations corresponding to the samples written in "5sc_3scRNA-seq_CD4bulkCITE-seq" window of the Data_summary.xlsx.

Treg, Th1or2, Tfh, Th17, LAG3_CITE_[replication number]: Data originated from 5′ CITE-seq of subpopulations corresponding to the samples written in " 5CITE-seq_added_samples" window of the Data_summary.xlsx.

SoftclipG_aCD4_5sc_48h_sorted.CTSS.bed: Data originated from 5′ scRNA-seq of activated CD4+ T cell corresponding to the samples written in "5sc_snRNA_seq_added_samples" window of the Data_summary.xlsx.

SoftclipG_[SRR number]_[stimulation hours]: Data originated from 5′ scRNA-seq for human memory CD4+ T cells stimulated with viral antigens from the Gene Expression Omnibus database (accession number GSE152522).

md5sum_CTSS_files.txt: A text file containing the md5sum of the files included in 5scCTSSbed_All.zip.

20230928_TC_anno_Gene_rank_bed6.bed: This is a bed6 file for robust TSS peaks generated using ReapTEC (with a cutoff of log2CPM ≥ 2 in at least one cell cluster across 136 cell clusters of CD4+ T cells in this study). Robust TSS peaks were named according to the nearest known transcript and numbered in ascending order from upstream to downstream of the transcript.

CD4bulk_5sc_TCver_TCRremoved.removeMono.harmonyintegration_FigS14.rds: This is a Seurat object used in Supplementary figure S14 (Oguchi et al. Science). The sorted CTSS files in 5scCTSSbed_All.zip and 20230928_TC_anno_Gene_rank_bed6.bed were used to count the reads mapping to each robust TSS peak at single cell level and create count matrix files for Seurat version 5 (4.9.9.9067). Of the matrix files, those for bulk CD4+ T cells (22 files) were processed in Seurat. Robust TSS peaks expressed in three or more cells were retained. Doublets were predicted and filtered using R package scDblFinder version 1.14.0. Singlets were retained for downstream analysis. Cells expressing fewer than 200 TSS peaks and those with more than 3% of all transcripts derived from the mitochondrial genome were excluded. TSS peaks of T cell receptor–related transcripts were excluded in this analysis. Data integration and batch collection were performed using Harmony version 1.0.3 in the “IntegrateLayers” function implemented in Seurat.

Code/Software
https://github.com/MurakawaLab/ReapTEC