Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models
Data files
Nov 03, 2023 version files 196.99 GB
-
README.md
781 B
-
TCGA-CESC0.h5
3.94 GB
-
TCGA-CESC1.h5
3.94 GB
-
TCGA-CESC2.h5
3.94 GB
-
TCGA-CESC3.h5
3.94 GB
-
TCGA-CESC4.h5
3.94 GB
-
TCGA-CESC5.h5
3.94 GB
-
TCGA-CESC6.h5
3.94 GB
-
TCGA-CESC7.h5
3.94 GB
-
TCGA-CESC8.h5
3.94 GB
-
TCGA-CESC9.h5
3.94 GB
-
TCGA-COAD0.h5
3.94 GB
-
TCGA-COAD1.h5
3.94 GB
-
TCGA-COAD2.h5
3.94 GB
-
TCGA-COAD3.h5
3.94 GB
-
TCGA-COAD4.h5
3.94 GB
-
TCGA-COAD5.h5
3.94 GB
-
TCGA-COAD6.h5
3.94 GB
-
TCGA-COAD7.h5
3.94 GB
-
TCGA-COAD8.h5
3.94 GB
-
TCGA-COAD9.h5
3.94 GB
-
TCGA-GBM0.h5
3.94 GB
-
TCGA-GBM1.h5
3.94 GB
-
TCGA-GBM2.h5
3.94 GB
-
TCGA-GBM3.h5
3.94 GB
-
TCGA-GBM4.h5
3.94 GB
-
TCGA-GBM5.h5
3.94 GB
-
TCGA-GBM6.h5
3.94 GB
-
TCGA-GBM7.h5
3.94 GB
-
TCGA-GBM8.h5
3.94 GB
-
TCGA-GBM9.h5
3.94 GB
-
TCGA-KIRP0.h5
3.94 GB
-
TCGA-KIRP1.h5
3.94 GB
-
TCGA-KIRP2.h5
3.94 GB
-
TCGA-KIRP3.h5
3.94 GB
-
TCGA-KIRP4.h5
3.94 GB
-
TCGA-KIRP5.h5
3.94 GB
-
TCGA-KIRP6.h5
3.94 GB
-
TCGA-KIRP7.h5
3.94 GB
-
TCGA-KIRP8.h5
3.94 GB
-
TCGA-KIRP9.h5
3.94 GB
-
TCGA-LUAD0.h5
3.94 GB
-
TCGA-LUAD1.h5
3.94 GB
-
TCGA-LUAD2.h5
3.94 GB
-
TCGA-LUAD3.h5
3.94 GB
-
TCGA-LUAD4.h5
3.94 GB
-
TCGA-LUAD5.h5
3.94 GB
-
TCGA-LUAD6.h5
3.94 GB
-
TCGA-LUAD7.h5
3.94 GB
-
TCGA-LUAD8.h5
3.94 GB
-
TCGA-LUAD9.h5
3.94 GB
Nov 27, 2023 version files 196.99 GB
Abstract
Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single-modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient’s gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation derived from the patient’s RNA-Seq data. Our results demonstrate that the generated tiles accurately preserve the distribution of cell types observed in real-world data, with state-of-the-art cell identification models successfully detecting important cell types in the synthetic samples. Furthermore, we illustrate that the synthetic tiles maintain the cell fraction observed in bulk RNA-Seq data and that modifications in gene expression affect the composition of cell types in the synthetic tiles. Next, we utilize the synthetic data generated by RNA-CDM to pretrain machine learning models and observe improved performance compared to training from scratch. Our study emphasizes the potential usefulness of synthetic data in developing machine learning models in scarce-data settings, while also highlighting the possibility of imputing missing data modalities by leveraging the available information. In conclusion, our proposed RNA-CDM approach for synthetic data generation in biomedicine, particularly in the context of cancer diagnosis, offers a novel and promising solution to address data scarcity. By generating synthetic data that align with real-world distributions and leveraging it to pretrain machine learning models, we contribute to the development of robust clinical decision support systems and potential advancements in precision medicine.
README: RNA-CDM Generated One Million Synthetic Images
https://doi.org/10.5061/dryad.6djh9w174
One million synthetic digital pathology images were generated using the RNA-CDM model presented in the paper "RNA-to-image multi-cancer synthesis using cascaded diffusion models".
Description of the data and file structure
There are ten different h5 files per cancer type (TCGA-CESC, TCGA-COAD, TCGA-KIRP, TCGA-GBM, TCGA-LUAD). Each h5 file contains 20.000 images. The key is the tile number, ranging from 0-20,000 in the first file, and from 180,000-200,000 in the last file. The tiles are saved as numpy arrays.
Code/Software
The code used to generate this data is available under academic license in https://rna-cdm.stanford.edu .
Manuscript citation
Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng (2024). https://doi.org/10.1038/s41551-024-01193-8