Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks
Data files
Jan 09, 2025 version files 1.77 GB
-
data_hallmark_absent.csv
977.81 MB
-
data_hallmark_present.csv
787.48 MB
-
meta_hallmark_absent.csv
1.33 MB
-
meta_hallmark_present.csv
972.08 KB
-
README.md
7.06 KB
Abstract
README: Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks
https://doi.org/10.5061/dryad.zw3r228jc
Description of the data and file structure
Data Description: Experimental Efforts
This dataset comprises single-cell transcriptomic data from the Weizmann 3CA repository, encompassing 2.7 million single-cell transcriptomes from 14 tumor types, collected from 922 patients across 51 global studies. The primary objective of the experimental efforts was to generate synthetic datasets for training and validating computational models to identify and analyze cancer hallmarks at the single-cell resolution.
Single-cell RNA sequencing (scRNA-seq) data underwent a rigorous quality control process to ensure reliability and biological relevance. This included exclusion criteria based on mitochondrial transcript content (>15%) and mRNA transcript counts (<200 or >6,000 transcripts). Gene sets corresponding to 10 established cancer hallmarks were curated from multiple databases and literature, focusing on genes with direct or indirect involvement in hallmark-related pathways.
Digital Scores for each hallmark were calculated using the Mann-Whitney U test, and hallmark classification was determined via tissue-specific Otsu’s thresholding. To simulate clinical biopsy conditions, hallmark-positive and hallmark-negative cells were aggregated into synthetic biopsy datasets comprising 200 cells per sample, stratified by hallmark status. These datasets mimic the heterogeneity and composition of real-world tumor biopsies while ensuring no overlap between synthetic samples.
The synthetic datasets were developed to facilitate robust model training and validation, enabling generalization to external datasets and bulk transcriptomic data. Validation was conducted on six external studies using pooled hallmark-positive cells to emulate bulk RNA sequencing conditions, ensuring consistency and clinical relevance. These datasets serve as a critical resource for advancing computational approaches in cancer hallmark identification and characterization.
Files and variables
File: meta_hallmark_present.csv
Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_present.csv
file.
Variables:
- Sample Name: Unique identifier for each sample.
- Cancer: Type of cancer associated with the sample.
- Hallmark: Specific hallmark represented in the sample.
File: meta_hallmark_absent.csv
Description:
This file contains metadata for all the samples present in the corresponding data_hallmark_absent.csv
file.
Variables:
- Sample Name: Unique identifier for each sample.
- Cancer: Type of cancer associated with the sample.
- Hallmark: Specific hallmark absent in the sample.
File: data_hallmark_present.csv
Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is present.
Variables:
- Gene Names: Features representing gene names.
- Values: Corresponding raw counts for each gene.
File: data_hallmark_absent.csv
Description:
This file contains synthetic bulk transcriptomics data where the specified hallmark is absent.
Variables:
- Gene Names: Features representing gene names.
- Values: Corresponding raw counts for each gene.
Code/software
Viewing and Analyzing the Data: Required Software and Workflow
Required Software
To view and analyze the dataset, free and open-source software such as Python is recommended. Specifically, the analysis utilizes the pandas library to read, manipulate, and explore the data.
- Python Version: any (tested with Python 3.11)
- Required Packages:
pandas
(version 1.3.0 or later)
Workflow and File Relationships
The provided dataset files (meta_hallmark_present.csv
, meta_hallmark_absent.csv
, data_hallmark_present.csv
, and data_hallmark_absent.csv
) can be analyzed using the following workflow. The relationships between the metadata and data files are as follows:
- Metadata files (
meta_hallmark_present.csv
andmeta_hallmark_absent.csv
): Contain descriptive information about the samples in the corresponding data files. - Data files (
data_hallmark_present.csv
anddata_hallmark_absent.csv
): Contain synthetic bulk transcriptomics data, with raw gene counts.
Code to View and Process Data
The dataset can be loaded and processed using the following Python code:
import pandas as pd
# Load the data files
data_pos = pd.read_csv('data_hallmark_present.csv', index_col=0) # Data where hallmark is present
data_neg = pd.read_csv('data_hallmark_absent.csv', index_col=0) # Data where hallmark is absent
# Load the metadata files
meta_pos = pd.read_csv('meta_hallmark_present.csv', index_col=0)
meta_neg = pd.read_csv('meta_hallmark_absent.csv', index_col=0)
# Extract patient IDs from sample names in the metadata
meta_pos['sample'] = meta_pos.index.str.split('&').str[0] # Extract patient ID for samples with hallmark present
meta_neg['sample'] = meta_neg.index.str.split('&').str[0] # Extract patient ID for samples with hallmark absent
# Display loaded data
print("Data with hallmark present:")
print(data_pos.head())
print("\nMetadata for hallmark present:")
print(meta_pos.head())
print("\nData with hallmark absent:")
print(data_neg.head())
print("\nMetadata for hallmark absent:")
print(meta_neg.head())
Loaded Files and Processed Variables
data_hallmark_present.csv
anddata_hallmark_absent.csv
:\ These files are loaded intodata_pos
anddata_neg
respectively using thepandas.read_csv()
function. Theindex_col=0
argument ensures the first column is used as the row index.meta_hallmark_present.csv
andmeta_hallmark_absent.csv
:\ These files are loaded intometa_pos
andmeta_neg
respectively. Theindex_col=0
argument is similarly applied. A new column,sample
, is created by extracting the patient ID from the sample name (assuming the ID is before the '&' character).
Software and Workflow Overview
- Software: Python and
pandas
- Files Relationship:
meta_hallmark_present.csv
↔data_hallmark_present.csv
meta_hallmark_absent.csv
↔data_hallmark_absent.csv
- Steps:
- Load the data and metadata files using
pandas.read_csv
. - Process the metadata to extract patient IDs.
- Display or analyze the data using Python tools and libraries.
- Load the data and metadata files using
This workflow allows seamless viewing and processing of the dataset files. For detailed analysis, additional tools such as visualization libraries (matplotlib
or seaborn
) can also be used.
Access information
Other publicly accessible locations of the data:
- This is the only place where data can be found.
Data was derived from the following sources: