Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks

Priyadarshi, Shreyansh 1 ; Mazumder, Camellia2 ; Neekhra, Bhavesh1 ; Gupta, Debayan1 ; Haldar, Shubhasis2

Published Jan 09, 2025; Updated Oct 22, 2025 on Dryad. https://doi.org/10.5061/dryad.zw3r228jc

Data files

Jan 09, 2025 version files 1.77 GB

Oct 22, 2025 version files 1.77 GB

Abstract

Evidence before this study

We conducted an extensive literature search using Google Scholar without language restrictions, employing search terms such as “(Predicting OR Classifying OR Annotating) and (cancer hallmarks) AND (Deep OR Machine Learning) OR (Artificial Intelligence OR AI).” Despite notable advances in molecular oncology and computational methodologies, a critical gap remains: no existing machine learning or deep learning framework comprehensively predicts cancer hallmarks from tumor biopsy samples. Current research primarily targets specific molecular pathways associated with individual hallmarks, leaving clinicians without an integrated model to interpret hallmark activity at the level of an individual tumor. Moreover, the absence of wet-lab techniques capable of annotating all cancer hallmarks in biopsy samples has further impeded progress, limiting the clinical utility of hallmark-related insights for precision oncology.

Added value of this study

This study introduces OncoMark, a novel neural multi-task learning (N-MTL) framework designed to predict cancer hallmark activity from transcriptomic data obtained from biopsy samples. OncoMark addresses the lack of hallmark-specific data by generating synthetic biopsy datasets annotated with hallmark activity, meticulously modeled to reflect real-world tumor biology while maintaining clinical relevance. The framework employs a multi-task learning approach to capture interdependencies among hallmarks, advancing beyond isolated predictions to offer a holistic view of tumor biology. Validation on six independent datasets comprising 159 patient samples demonstrated its generalizability and reproducibility. Further external validation using eight datasets, encompassing over 11,679 cancer and 8348 normal patient samples, reinforced its robustness. To promote clinical integration, a user-friendly web-based tool was developed, enabling seamless access for oncologists and researchers.

Implications of all the available evidence

The OncoMark framework represents a transformative advancement in cancer diagnostics and treatment planning. By enabling accurate and reproducible prediction of hallmark activity from biopsy samples, this model paves the way for precision oncology at scale. Its ability to systematically capture hallmark interdependencies provides deeper insights into tumor behavior, guiding the development of individualized, targeted therapies. The incorporation of a web-based interface ensures the accessibility of this innovation to clinicians worldwide, bridging the gap between computational oncology and clinical practice. Following further validation and integration into healthcare workflows, OncoMark has the potential to improve cancer outcomes by delivering timely, cost-effective, and precise tumor analyses, facilitating informed therapeutic decision-making with unparalleled precision.

https://doi.org/10.5061/dryad.zw3r228jc

Description of the data and file structure

Data Description: Experimental Efforts

This dataset comprises single-cell transcriptomic data from the Weizmann 3CA repository, encompassing 2.7 million single-cell transcriptomes from 14 tumor types, collected from 922 patients across 51 global studies. The primary objective of the experimental efforts was to generate synthetic datasets for training and validating computational models to identify and analyze cancer hallmarks at the single-cell resolution.

Single-cell RNA sequencing (scRNA-seq) data underwent a rigorous quality control process to ensure reliability and biological relevance. This included exclusion criteria based on mitochondrial transcript content (>15%) and mRNA transcript counts (<200 or >6,000 transcripts). Gene sets corresponding to 10 established cancer hallmarks were curated from multiple databases and literature, focusing on genes with direct or indirect involvement in hallmark-related pathways.

Digital Scores for each hallmark were calculated using the Mann-Whitney U test, and hallmark classification was determined via tissue-specific Otsu’s thresholding. To simulate clinical biopsy conditions, hallmark-positive and hallmark-negative cells were aggregated into synthetic biopsy datasets comprising 200 cells per sample, stratified by hallmark status. These datasets mimic the heterogeneity and composition of real-world tumor biopsies while ensuring no overlap between synthetic samples.

The synthetic datasets were developed to facilitate robust model training and validation, enabling generalization to external datasets and bulk transcriptomic data. Validation was conducted on six external studies using pooled hallmark-positive cells to emulate bulk RNA sequencing conditions, ensuring consistency and clinical relevance. These datasets serve as a critical resource for advancing computational approaches in cancer hallmark identification and characterization.

Files and variables

File: `meta_hallmark_present.csv`

Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_present.csv file.

Variables:

Sample Name: Unique identifier for each sample.
Cancer: Type of cancer associated with the sample.
Hallmark: Specific hallmark represented in the sample.

File: `meta_hallmark_absent.csv`

Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_absent.csv file.

Variables:

Sample Name: Unique identifier for each sample.
Cancer: Type of cancer associated with the sample.
Hallmark: Specific hallmark absent in the sample.

File: `data_hallmark_present.csv`

Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is present.

Variables:

Gene Names: Features representing gene names.
Values: Corresponding raw counts for each gene.

File: `data_hallmark_absent.csv`

Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is absent.

Variables:

Gene Names: Features representing gene names.
Values: Corresponding raw counts for each gene.

File: `hallmark_curated_genes.csv`

Description:
This file contains manually curated gene sets for each of the 10 cancer hallmarks.

Variables:

Hallmark Names: Each column representing each hallmarks.
Values: Each column values represent curated genes for that hallmarks.

Code/software

Viewing and Analyzing the Data: Required Software and Workflow

Required Software

To view and analyze the dataset, free and open-source software such as Python is recommended. Specifically, the analysis utilizes the pandas library to read, manipulate, and explore the data.

Python Version: any (tested with Python 3.11)
Required Packages:
- pandas (version 1.3.0 or later)

Workflow and File Relationships

The provided dataset files (meta_hallmark_present.csv, meta_hallmark_absent.csv, data_hallmark_present.csv, and data_hallmark_absent.csv) can be analyzed using the following workflow. The relationships between the metadata and data files are as follows:

Metadata files (meta_hallmark_present.csv and meta_hallmark_absent.csv): Contain descriptive information about the samples in the corresponding data files.
Data files (data_hallmark_present.csv and data_hallmark_absent.csv): Contain synthetic bulk transcriptomics data, with raw gene counts.

Code to View and Process Data

The dataset can be loaded and processed using the following Python code:

import pandas as pd

# Load the data files
data_pos = pd.read_csv('data_hallmark_present.csv', index_col=0)  # Data where hallmark is present
data_neg = pd.read_csv('data_hallmark_absent.csv', index_col=0)   # Data where hallmark is absent

# Load the metadata files
meta_pos = pd.read_csv('meta_hallmark_present.csv', index_col=0)
meta_neg = pd.read_csv('meta_hallmark_absent.csv', index_col=0)

# Extract patient IDs from sample names in the metadata
meta_pos['sample'] = meta_pos.index.str.split('&').str[0]  # Extract patient ID for samples with hallmark present
meta_neg['sample'] = meta_neg.index.str.split('&').str[0]  # Extract patient ID for samples with hallmark absent

# Display loaded data
print("Data with hallmark present:")
print(data_pos.head())

print("\nMetadata for hallmark present:")
print(meta_pos.head())

print("\nData with hallmark absent:")
print(data_neg.head())

print("\nMetadata for hallmark absent:")
print(meta_neg.head())

Loaded Files and Processed Variables

data_hallmark_present.csv and data_hallmark_absent.csv:
These files are loaded into data_pos and data_neg respectively using the pandas.read_csv() function. The index_col=0 argument ensures the first column is used as the row index.
meta_hallmark_present.csv and meta_hallmark_absent.csv:
These files are loaded into meta_pos and meta_neg respectively. The index_col=0 argument is similarly applied. A new column, sample, is created by extracting the patient ID from the sample name (assuming the ID is before the '&' character).

Software and Workflow Overview

Software: Python and pandas
Files Relationship:
- meta_hallmark_present.csv ↔ data_hallmark_present.csv
- meta_hallmark_absent.csv ↔ data_hallmark_absent.csv
Steps:
1. Load the data and metadata files using pandas.read_csv.
2. Process the metadata to extract patient IDs.
3. Display or analyze the data using Python tools and libraries.

This workflow allows seamless viewing and processing of the dataset files. For detailed analysis, additional tools such as visualization libraries (matplotlib or seaborn) can also be used.

Access information

Other publicly accessible locations of the data:

This is the only place where data can be found.

Data was derived from the following sources:

Weizmann| Curated Cancer Cell Atlas

Version changes

22-October-2025: Added file hallmark_curated_genes.csv.

Dataset Collection and Processing

We utilized a large-scale dataset comprising 2.7 million single-cell transcriptomes derived from 14 tumor types, collected from 922 patients across 51 independent studies conducted globally. This dataset was sourced from the Weizmann Institute's 3CA repository.

Quality Control

Before generating synthetic datasets for model training, the raw single-cell transcriptomic data underwent a rigorous quality control (QC) process. Cells with over 15% mitochondrial transcript content, fewer than 200, or more than 6,000 expressed mRNA transcripts were excluded to ensure data reliability.

Gene Set Curation

Gene sets representing cancer hallmarks were compiled from multiple databases, retaining only genes identified in at least two independent sources. This selection was refined through manual literature reviews to exclude genes without direct or indirect roles in hallmark-related pathways.

Digital Scoring

Using the curated gene sets, Digital Scores were calculated for each of the 10 cancer hallmarks across all cells using the Mann-Whitney U test. To ensure robust binary classification, hallmark presence or absence was determined through Otsu’s thresholding method. Tissue-specific digital score thresholds were calculated to account for variations in hallmark expression across different tumor tissues.

Synthetic Data Generation

To simulate clinical biopsy conditions while preserving biological fidelity, synthetic biopsy datasets were created by aggregating 200 hallmark-specific cells from each patient sample. Cells were grouped by hallmark status (positive or negative) to generate distinct hallmark-specific synthetic samples, ensuring no overlap across samples and minimizing cross-sample contamination. Synthetic datasets with positive and negative ground truths were created separately for each hallmark, facilitating robust model training and mimicking the heterogeneous composition of real-world clinical samples.

Validation

For validation purposes, six external studies were processed using the same synthetic data creation methods applied to the training data. In these datasets, all hallmark-positive cells for each patient were pooled to generate synthetic datasets resembling bulk RNA sequencing data. This approach ensured consistency in data processing while allowing the model to generalize effectively to clinically relevant bulk transcriptomic datasets.

Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks

Data files

Abstract

README: Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks

Description of the data and file structure

Data Description: Experimental Efforts

Files and variables

File: meta_hallmark_present.csv

File: meta_hallmark_absent.csv

File: data_hallmark_present.csv

File: data_hallmark_absent.csv

File: hallmark_curated_genes.csv

Code/software

Viewing and Analyzing the Data: Required Software and Workflow

Required Software

Workflow and File Relationships

Code to View and Process Data

Loaded Files and Processed Variables

Software and Workflow Overview

Access information

Version changes

Methods

Works referencing this dataset

File: `meta_hallmark_present.csv`

File: `meta_hallmark_absent.csv`

File: `data_hallmark_present.csv`

File: `data_hallmark_absent.csv`

File: `hallmark_curated_genes.csv`