Skip to main content
Dryad

Data from: Clustering Deviation Index (CDI): A robust and accurate internal measure for evaluating scRNA-seq data clustering

Cite this dataset

Fang, Jiyuan et al. (2022). Data from: Clustering Deviation Index (CDI): A robust and accurate internal measure for evaluating scRNA-seq data clustering [Dataset]. Dryad. https://doi.org/10.5061/dryad.08kprr55h

Abstract

The clustering of cells has been widely used to explore the heterogeneity of cell populations in single-cell RNA-sequencing (scRNA-seq). We proposed a parametric model for monoclonal and polyclonal scRNA-seq data to evaluate clustering results. Based on the parametric model, we proposed a metric (CDI) to quantify the goodness-of-fit of cell clustering to the data. Here we presented CT26.WT and T-CELL as two datasets to examine the performance of our model and metric. CT26.WT contains wild-type CT26 cells from the murine colorectal carcinoma cell line, and cells in CT26.WT are highly homogeneous. T-CELL contains T-cells from tumor tissue of mice three weeks after 4T1 tumor injection. From these datasets and public datasets, we validated our model and benchmarked our metric.

Methods

This dataset contains six files. Four of them (matrix.mtx, features.tsv, barcodes.tsv, CT26_bulk_30k.txt) are for CT26.WT, and the other two are for T-CELL.

CT26.WT sample preparation: Murine colorectal carcinoma cell line CT26.WT was obtained from the cell culture facility of Duke University and cultured in DMEM media (Sigma Aldrich). All cells were cultured at 37 degrees. Single-cell clones were chosen and cultured for over 220 days. Bulk RNA-seq and single-cell RNA-seq samples were prepared on the same day.

CT26.WT bulk RNA-seq: Total RNA from ~ 1,000,000 cells from each group was extracted using the miniprep kit (Zymo Research) according to the manufacturer’s instructions. Then, the libraries were sequenced on the Illumina sequencing platform by the Novogene Corporation Inc. (CA, USA) (HiSeq × Ten) with paired-end 150 bp (PE 150) sequencing strategy.

CT26.WT scRNA-seq: A total of ~ 10000 cells of each clone were selected for single-cell RNA-seq. Single-cell RNA sequence libraries using Chromium Single Cell 3’ Reagent kits v3 (10x genomics). The libraries were then sequenced on the Illumina sequencing platform by the Novogene Corporation Inc. (CA, USA) with PE 150 sequencing strategy in a single index mode.

T-CELL scRNA-seq: In this study, tumors were firstly collected from the female mice after 3 weeks since the mice were injected by 4T1 tumors. Tissues were then disassociated into single cells and homogenized. T cells were separated out by flow sorting with a stringent gating threshold and sequenced on the 10X platform.

T-CELL filtering: We filtered out genes with less than 2% non-zero cells and removed cells with less than 2% non-zero genes. Eventually, 2, 989 cells from five cell types with 7, 893 genes were retained.

T-CELL annotation: The benchmark clustering labels of the T-CELL population were generated as a combination of protein-marker-based flow sorting labels and bioinformatics labels from Seurat v2. For evaluation purposes, we selected 5 distinct cell types: Regulatory Trm cells, Classical CD4 Tem cells, CD8 Trm cells, CD8 Tcm cells, and Active EM-like Treg cells.

Usage notes

CT26.WT: The raw count matrix is in "matrix.mtx", and it can be read by readMM() function in R package Matrix. The feature names and cell barcodes are in "features.tsv" and "barcodes.tsv", respectively, They can be read by read.table() function in R package utils. The bulk-seq dataset is in "CT26_bulk_30k.txt", and it can be read by read.table() function in R package utils. 

T-CELL: The filtered count matrix is in "Tcell_5type_filtered.rds". The cell annotation is in "Tcell_5type_filtered_labels.rds". They can both be opened with R function readRDS().

Funding

Duke University