Aberrant basal cell clonal dynamics shape early lung carcinogenesis
Data files
May 08, 2025 version files 12.29 GB
-
Human_tracheal_epithelial_harmony_seurat.rds
10.42 GB
-
Murine_Variant_Calls.xlsx
130.56 MB
-
NTCU_GEX_Trachea_processed_Celltype_seurat.rds
1.10 GB
-
NTCU_Trachea_Epithelial_integrated_seurat.rds
639.31 MB
-
README.md
10.42 KB
Abstract
We have used single-cell RNA sequencing (scRNA-seq) and low-input whole-genome sequencing (WGS) to investigate epithelial changes in the carcinogen-exposed airways and to track clonal trajectories. This dataset includes:
- Seurat object containing the processed scRNA-seq data for mouse tracheal cells from NTCU-treated and Control individuals.
- Seurat object containing data from epithelial subclustered cells isolated from mouse NTCU-treated and Control tracheal samples.
- Seurat object containing integrated data from human tracheal epithelial cells, sampled from both current-smoker and non-smoker individuals. The data spans multiple datasets and has been harmonised for downstream analysis.
- List of annotated substitutions, double base substitutions and indels detected across murine clones identified in WGS analyses.
We have submitted our processed data from single-cell RNA sequencing (scRNA-seq) and whole-genome sequencing (WGS), including Seurat objects, annotated variant calls, and detailed metadata for further analysis in the study: Aberrant Basal Cell Clonal Dynamics Shape Early Lung Carcinogenesis.
Descriptions
NTCU_GEX_Trachea_processed_Celltype_seurat.rds
This Seurat object contains processed single-cell RNA sequencing data for mouse tracheal cells from NTCU-treated and Control groups. The data includes annotations for various cell types, allowing for the analysis of gene expression profiles and cellular heterogeneity.
- Metadata columns:
orig.ident: Original identifier for the samplenCount_originalexp: Total RNA count per cellnFeature_originalexp: Number of features detected per cellsample: Sample identifierpatientID: Identifier for the mousetissue: Tissue type (trachea)phenotype: Phenotype group (NTCU-treated, control)age: Age of the animal in weeksscrublet__predicted_multiplet: Predicted multiplet status (Y/N)scrublet__multiplet_scores: Scrublet scores for multiplet detectionscrublet__multiplet_zscores: Z-scores from Scrubletbatch: Batch number for sample processingn_genes: Number of genes detected per celln_genes_by_counts: Number of genes expressed, calculated by countslog1p_n_genes_by_counts: Log-transformed count of genes by expressiontotal_counts: Total number of counts for the samplelog1p_total_counts: Log-transformed total counts per sampletotal_counts_mt: Total mitochondrial counts per samplelog1p_total_counts_mt: Log-transformed mitochondrial counts per samplepct_counts_mt: Percentage of mitochondrial counts relative to total countstotal_counts_ribo: Total ribosomal RNA counts per samplelog1p_total_counts_ribo: Log-transformed ribosomal RNA countspct_counts_ribo: Percentage of ribosomal RNA counts relative to total countsCell_types: Annotated cell types based on clustering
NTCU_Trachea_Epithelial_integrated_seurat.rds
This Seurat object contains data from epithelial subclustered cells isolated from mouse NTCU-treated and Control trachea samples. The cells are integrated for comparative analysis, facilitating the identification of cell subtypes and gene expression signatures.
- Metadata columns:
orig.ident: Sample identifiernCount_originalexp: Total RNA count per cellnFeature_originalexp: Number of features detected per cellsample: Sample identifierpatientID: Identifier for the mousetissue: Tissue type (trachea)phenotype: Phenotype group (NTCU-treated, control)age: Age of the animalscrublet__predicted_multiplet: Predicted multiplet status (Y/N)scrublet__multiplet_scores: Scrublet scores for multiplet detectionscrublet__multiplet_zscores: Z-scores from Scrubletbatch: Batch number for sample processingn_genes: Number of genes detected per celln_genes_by_counts: Number of genes expressed, calculated by countslog1p_n_genes_by_counts: Log-transformed count of genes by expressiontotal_counts: Total number of counts for the samplelog1p_total_counts: Log-transformed total counts per sampletotal_counts_mt: Total mitochondrial counts per samplelog1p_total_counts_mt: Log-transformed mitochondrial counts per samplepct_counts_mt: Percentage of mitochondrial counts relative to total countstotal_counts_ribo: Total ribosomal RNA counts per samplelog1p_total_counts_ribo: Log-transformed ribosomal RNA countspct_counts_ribo: Percentage of ribosomal RNA counts relative to total countsMontoro_Basal_Score: Basal cell score based on Montoro et al. scoringMontoro_Club_Score: Club cell score based on Montoro et al.Montoro_Ciliated_Score: Ciliated cell score based on Montoro et al.Montoro_Tuft_Score: Tuft cell score based on Montoro et al.Montoro_Neuroendocrine_Score: Neuroendocrine cell score from Montoro et al.Montoro_Ionocyte_Score: Ionocyte cell score based on Montoro et al.Hurskainen_AT2_Score: AT2 cell score based on Hurskainen et al.Hurskainen_AT1_Score: AT1 cell score from Hurskainen et al.Hurskainen_Ciliated_Score: Ciliated cell score based on Hurskainen et al.Hurskainen_Club_Score: Club cell score based on Hurskainen et al.Chen_AT2_Score: AT2 cell score from Chen et al.Chen_Club_Score: Club cell score from Chen et al.Chen_AT1_Score: AT1 cell score from Chen et al.Chen_Basal_Score: Basal cell score based on Chen et al.Chen_Goblet_Score: Goblet cell score from Chen et al.Chen_Krt8_Score: KRT8-expressing cell score from Chen et al.Chen_Proliferation_Score: Proliferation score based on Chen et al.Chen_Club_Progenitor_Score: Club progenitor score from Chen et al.Chen_AT1_AT2_Score: Combined AT1 and AT2 score from Chen et al.Chen_Activated_Club_Score: Activated club cell score from Chen et al.Chen_Tumour_Score: Tumour cell score based on Chen et al.Chen_DATP_Score: DATP score from Chen et al.Chen_PATS_Score: PATS score from Chen et al.Goldfarbmuren_SMG_Basal_Score: SMG basal score from Goldfarbmuren et al.Goldfarbmuren_Ciliated_Score: Ciliated cell score from Goldfarbmuren et al.Goldfarbmuren_Diff_Basal_Score: Differentiated basal cell score from Goldfarbmuren et al.Goldfarbmuren_Ionocytes_Tuft_Score: Ionocyte and tuft cell score from Goldfarbmuren et al.Goldfarbmuren_Krt8_High_Score: High KRT8 expression score from Goldfarbmuren et al.Goldfarbmuren_Mucus_Secretory_Score: Mucus secretory cell score from Goldfarbmuren et al.Goldfarbmuren_PNEC_Score: PNEC cell score from Goldfarbmuren et al.Goldfarbmuren_Proliferating_Basal_Score: Proliferating basal score from Goldfarbmuren et al.Goldfarbmuren_Proteasomal_Basal_Score: Proteasomal basal cell score from Goldfarbmuren et al.Goldfarbmuren_SMG_Secretory_Score: SMG secretory cell score from Goldfarbmuren et al.Goldfarbmuren_Core_Response_Down_Score: Core response down-regulation score from Goldfarbmuren et al.Goldfarbmuren_Core_Response_Up_Score: Core response up-regulation score from Goldfarbmuren et al.integrated_snn_res.0.6: Seurat cluster resolutionseurat_clusters: Cluster assignments from Seurat analysisintegrated_snn_res.0.4: Seurat cluster resolution (alternative)celltypes: Cell type annotations (e.g., basal, club, etc.)
Human_tracheal_epithelial_harmony_seurat.rds
This Seurat object contains harmonised data from human tracheal epithelial cells sampled from current-smoker and non-smoker individuals. The dataset spans multiple studies and has been integrated for downstream analysis.
- Metadata columns:
orig.ident: Original identifier for the samplenCount_RNA: Total RNA count per cellnFeature_RNA: Number of features detected per cellDonor: Pseudonymized identifier for the individual from whom the sample was obtainedSample: Sample identifierMethod: Method of RNA extraction or processingSex: Biological sex of the donorage_range: Broad age category of the donorSmoking_status: Smoking status of the donor (current-smoker, non-smoker)dataset: Dataset identifier includednUMI: Total UMI counts per cellnGene: Total gene count per celllog10GenesPerUMI: Log-transformed number of genes per UMIBatch: Batch number for sample processingpercent.mito: Percentage of mitochondrial genes detectedpercent.ribo: Percentage of ribosomal genes detectedRNA_snn_res.0.6: Seurat cluster resolutionseurat_clusters: Cluster assignments from Seurat analysisCluster_Annotations: Cell type or cluster annotation
Murine_Variant_Calls.xlsx
This file contains a list of annotated substitutions and indels detected across murine clones from NTCU-treated mice. The variants were identified using low-input whole-genome sequencing (WGS).
- Columns:
mouseID: Identifier for each mouse in the studyVariantID: Unique identifier for the detected variantChrom: Chromosome where the variant was locatedPos: Position of the variant on the chromosomeRef: Reference alleleAlt: Alternate alleleQual: Quality score for the variantFilter: Filter status (passed/failed)Gene: Gene affected by the variantTranscript: Transcript associated with the variantRNA: RNA sequence affected by the variantCDS: Coding sequence affected by the variantProtein: Protein sequence affected by the variantType: Type of variant (SNV, insertion, deletion)Effect: Predicted functional effect (missense, silent, etc.)sampleID: Identifier for the sampleMut_Frags: Number of mutant fragments detectedmut_id: Mutation identifiermutID_sampleID: Combination of mutation ID and sample IDcoverage: Sequencing coverage at the variant sitevaf: Variant allele frequency (VAF)cluster_id: ID of the genomic cluster to which the variant belongslung_id: Identifier for the anotomical site of lungmouseID_cluster: Combination of mouse ID and cluster ID
Key Information Sources
- Single-cell RNA sequencing (scRNA-seq) data from mouse and human tracheal epithelial cells was used in the analysis.
- Whole-genome sequencing (WGS) for annotated substitutions and indels was used to identify variants in murine clones, based on samples obtained through laser capture microdissection.
Code/Software
R and Seurat are required to run the analysis. The Seurat version used for the analysis is 5.0.1.
The script used for data processing and integration is provided on https://zenodo.org/records/15168190. Annotations are included throughout the script to explain the major steps, from data loading and cleaning to clustering and visualization.
