Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks
Data files
Jan 09, 2025 version files 1.77 GB
-
data_hallmark_absent.csv
977.81 MB
-
data_hallmark_present.csv
787.48 MB
-
meta_hallmark_absent.csv
1.33 MB
-
meta_hallmark_present.csv
972.08 KB
-
README.md
7.06 KB
Oct 22, 2025 version files 1.77 GB
-
data_hallmark_absent.csv
977.81 MB
-
data_hallmark_present.csv
787.48 MB
-
hallmark_curated_genes.csv
6.64 KB
-
meta_hallmark_absent.csv
1.33 MB
-
meta_hallmark_present.csv
972.08 KB
-
README.md
7.44 KB
Abstract
https://doi.org/10.5061/dryad.zw3r228jc
Description of the data and file structure
Data Description: Experimental Efforts
This dataset comprises single-cell transcriptomic data from the Weizmann 3CA repository, encompassing 2.7 million single-cell transcriptomes from 14 tumor types, collected from 922 patients across 51 global studies. The primary objective of the experimental efforts was to generate synthetic datasets for training and validating computational models to identify and analyze cancer hallmarks at the single-cell resolution.
Single-cell RNA sequencing (scRNA-seq) data underwent a rigorous quality control process to ensure reliability and biological relevance. This included exclusion criteria based on mitochondrial transcript content (>15%) and mRNA transcript counts (<200 or >6,000 transcripts). Gene sets corresponding to 10 established cancer hallmarks were curated from multiple databases and literature, focusing on genes with direct or indirect involvement in hallmark-related pathways.
Digital Scores for each hallmark were calculated using the Mann-Whitney U test, and hallmark classification was determined via tissue-specific Otsu’s thresholding. To simulate clinical biopsy conditions, hallmark-positive and hallmark-negative cells were aggregated into synthetic biopsy datasets comprising 200 cells per sample, stratified by hallmark status. These datasets mimic the heterogeneity and composition of real-world tumor biopsies while ensuring no overlap between synthetic samples.
The synthetic datasets were developed to facilitate robust model training and validation, enabling generalization to external datasets and bulk transcriptomic data. Validation was conducted on six external studies using pooled hallmark-positive cells to emulate bulk RNA sequencing conditions, ensuring consistency and clinical relevance. These datasets serve as a critical resource for advancing computational approaches in cancer hallmark identification and characterization.
Files and variables
File: meta_hallmark_present.csv
Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_present.csv file.
Variables:
- Sample Name: Unique identifier for each sample.
- Cancer: Type of cancer associated with the sample.
- Hallmark: Specific hallmark represented in the sample.
File: meta_hallmark_absent.csv
Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_absent.csv file.
Variables:
- Sample Name: Unique identifier for each sample.
- Cancer: Type of cancer associated with the sample.
- Hallmark: Specific hallmark absent in the sample.
File: data_hallmark_present.csv
Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is present.
Variables:
- Gene Names: Features representing gene names.
- Values: Corresponding raw counts for each gene.
File: data_hallmark_absent.csv
Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is absent.
Variables:
- Gene Names: Features representing gene names.
- Values: Corresponding raw counts for each gene.
File: hallmark_curated_genes.csv
Description:
This file contains manually curated gene sets for each of the 10 cancer hallmarks.
Variables:
- Hallmark Names: Each column representing each hallmarks.
- Values: Each column values represent curated genes for that hallmarks.
Code/software
Viewing and Analyzing the Data: Required Software and Workflow
Required Software
To view and analyze the dataset, free and open-source software such as Python is recommended. Specifically, the analysis utilizes the pandas library to read, manipulate, and explore the data.
- Python Version: any (tested with Python 3.11)
- Required Packages:
pandas(version 1.3.0 or later)
Workflow and File Relationships
The provided dataset files (meta_hallmark_present.csv, meta_hallmark_absent.csv, data_hallmark_present.csv, and data_hallmark_absent.csv) can be analyzed using the following workflow. The relationships between the metadata and data files are as follows:
- Metadata files (
meta_hallmark_present.csvandmeta_hallmark_absent.csv): Contain descriptive information about the samples in the corresponding data files. - Data files (
data_hallmark_present.csvanddata_hallmark_absent.csv): Contain synthetic bulk transcriptomics data, with raw gene counts.
Code to View and Process Data
The dataset can be loaded and processed using the following Python code:
import pandas as pd
# Load the data files
data_pos = pd.read_csv('data_hallmark_present.csv', index_col=0) # Data where hallmark is present
data_neg = pd.read_csv('data_hallmark_absent.csv', index_col=0) # Data where hallmark is absent
# Load the metadata files
meta_pos = pd.read_csv('meta_hallmark_present.csv', index_col=0)
meta_neg = pd.read_csv('meta_hallmark_absent.csv', index_col=0)
# Extract patient IDs from sample names in the metadata
meta_pos['sample'] = meta_pos.index.str.split('&').str[0] # Extract patient ID for samples with hallmark present
meta_neg['sample'] = meta_neg.index.str.split('&').str[0] # Extract patient ID for samples with hallmark absent
# Display loaded data
print("Data with hallmark present:")
print(data_pos.head())
print("\nMetadata for hallmark present:")
print(meta_pos.head())
print("\nData with hallmark absent:")
print(data_neg.head())
print("\nMetadata for hallmark absent:")
print(meta_neg.head())
Loaded Files and Processed Variables
data_hallmark_present.csvanddata_hallmark_absent.csv:
These files are loaded intodata_posanddata_negrespectively using thepandas.read_csv()function. Theindex_col=0argument ensures the first column is used as the row index.meta_hallmark_present.csvandmeta_hallmark_absent.csv:
These files are loaded intometa_posandmeta_negrespectively. Theindex_col=0argument is similarly applied. A new column,sample, is created by extracting the patient ID from the sample name (assuming the ID is before the '&' character).
Software and Workflow Overview
- Software: Python and
pandas - Files Relationship:
meta_hallmark_present.csv↔data_hallmark_present.csvmeta_hallmark_absent.csv↔data_hallmark_absent.csv
- Steps:
- Load the data and metadata files using
pandas.read_csv. - Process the metadata to extract patient IDs.
- Display or analyze the data using Python tools and libraries.
- Load the data and metadata files using
This workflow allows seamless viewing and processing of the dataset files. For detailed analysis, additional tools such as visualization libraries (matplotlib or seaborn) can also be used.
Access information
Other publicly accessible locations of the data:
- This is the only place where data can be found.
Data was derived from the following sources:
Version changes
22-October-2025: Added file hallmark_curated_genes.csv.
