Aberrant basal cell clonal dynamics shape early lung carcinogenesis

Alhendi, Ahmed1; Przybilla, Moritz 2 ; Butler, Timothy2; Gómez-López, Sandra1; Whiteman, Zoe1; Rouhani, Maral1; Uddin, Imran1; Durrenberger, Pascal1; Martincorena, Iñigo2; Campbell, Peter2; Janes, Sam 1

Published May 08, 2025 on Dryad. https://doi.org/10.5061/dryad.547d7wmhw

Data files

May 08, 2025 version files 12.29 GB

Human_tracheal_epithelial_harmony_seurat.rds

10.42 GB
Murine_Variant_Calls.xlsx

130.56 MB
NTCU_GEX_Trachea_processed_Celltype_seurat.rds

1.10 GB
NTCU_Trachea_Epithelial_integrated_seurat.rds

639.31 MB
README.md

10.42 KB

Abstract

We have used single-cell RNA sequencing (scRNA-seq) and low-input whole-genome sequencing (WGS) to investigate epithelial changes in the carcinogen-exposed airways and to track clonal trajectories. This dataset includes:

Seurat object containing the processed scRNA-seq data for mouse tracheal cells from NTCU-treated and Control individuals.
Seurat object containing data from epithelial subclustered cells isolated from mouse NTCU-treated and Control tracheal samples.
Seurat object containing integrated data from human tracheal epithelial cells, sampled from both current-smoker and non-smoker individuals. The data spans multiple datasets and has been harmonised for downstream analysis.
List of annotated substitutions, double base substitutions and indels detected across murine clones identified in WGS analyses.

We have submitted our processed data from single-cell RNA sequencing (scRNA-seq) and whole-genome sequencing (WGS), including Seurat objects, annotated variant calls, and detailed metadata for further analysis in the study: Aberrant Basal Cell Clonal Dynamics Shape Early Lung Carcinogenesis.

Descriptions

NTCU_GEX_Trachea_processed_Celltype_seurat.rds

This Seurat object contains processed single-cell RNA sequencing data for mouse tracheal cells from NTCU-treated and Control groups. The data includes annotations for various cell types, allowing for the analysis of gene expression profiles and cellular heterogeneity.

Metadata columns:
- orig.ident: Original identifier for the sample
- nCount_originalexp: Total RNA count per cell
- nFeature_originalexp: Number of features detected per cell
- sample: Sample identifier
- patientID: Identifier for the mouse
- tissue: Tissue type (trachea)
- phenotype: Phenotype group (NTCU-treated, control)
- age: Age of the animal in weeks
- scrublet__predicted_multiplet: Predicted multiplet status (Y/N)
- scrublet__multiplet_scores: Scrublet scores for multiplet detection
- scrublet__multiplet_zscores: Z-scores from Scrublet
- batch: Batch number for sample processing
- n_genes: Number of genes detected per cell
- n_genes_by_counts: Number of genes expressed, calculated by counts
- log1p_n_genes_by_counts: Log-transformed count of genes by expression
- total_counts: Total number of counts for the sample
- log1p_total_counts: Log-transformed total counts per sample
- total_counts_mt: Total mitochondrial counts per sample
- log1p_total_counts_mt: Log-transformed mitochondrial counts per sample
- pct_counts_mt: Percentage of mitochondrial counts relative to total counts
- total_counts_ribo: Total ribosomal RNA counts per sample
- log1p_total_counts_ribo: Log-transformed ribosomal RNA counts
- pct_counts_ribo: Percentage of ribosomal RNA counts relative to total counts
- Cell_types: Annotated cell types based on clustering

NTCU_Trachea_Epithelial_integrated_seurat.rds

This Seurat object contains data from epithelial subclustered cells isolated from mouse NTCU-treated and Control trachea samples. The cells are integrated for comparative analysis, facilitating the identification of cell subtypes and gene expression signatures.

Metadata columns:
- orig.ident: Sample identifier
- nCount_originalexp: Total RNA count per cell
- nFeature_originalexp: Number of features detected per cell
- sample: Sample identifier
- patientID: Identifier for the mouse
- tissue: Tissue type (trachea)
- phenotype: Phenotype group (NTCU-treated, control)
- age: Age of the animal
- scrublet__predicted_multiplet: Predicted multiplet status (Y/N)
- scrublet__multiplet_scores: Scrublet scores for multiplet detection
- scrublet__multiplet_zscores: Z-scores from Scrublet
- batch: Batch number for sample processing
- n_genes: Number of genes detected per cell
- n_genes_by_counts: Number of genes expressed, calculated by counts
- log1p_n_genes_by_counts: Log-transformed count of genes by expression
- total_counts: Total number of counts for the sample
- log1p_total_counts: Log-transformed total counts per sample
- total_counts_mt: Total mitochondrial counts per sample
- log1p_total_counts_mt: Log-transformed mitochondrial counts per sample
- pct_counts_mt: Percentage of mitochondrial counts relative to total counts
- total_counts_ribo: Total ribosomal RNA counts per sample
- log1p_total_counts_ribo: Log-transformed ribosomal RNA counts
- pct_counts_ribo: Percentage of ribosomal RNA counts relative to total counts
- Montoro_Basal_Score: Basal cell score based on Montoro et al. scoring
- Montoro_Club_Score: Club cell score based on Montoro et al.
- Montoro_Ciliated_Score: Ciliated cell score based on Montoro et al.
- Montoro_Tuft_Score: Tuft cell score based on Montoro et al.
- Montoro_Neuroendocrine_Score: Neuroendocrine cell score from Montoro et al.
- Montoro_Ionocyte_Score: Ionocyte cell score based on Montoro et al.
- Hurskainen_AT2_Score: AT2 cell score based on Hurskainen et al.
- Hurskainen_AT1_Score: AT1 cell score from Hurskainen et al.
- Hurskainen_Ciliated_Score: Ciliated cell score based on Hurskainen et al.
- Hurskainen_Club_Score: Club cell score based on Hurskainen et al.
- Chen_AT2_Score: AT2 cell score from Chen et al.
- Chen_Club_Score: Club cell score from Chen et al.
- Chen_AT1_Score: AT1 cell score from Chen et al.
- Chen_Basal_Score: Basal cell score based on Chen et al.
- Chen_Goblet_Score: Goblet cell score from Chen et al.
- Chen_Krt8_Score: KRT8-expressing cell score from Chen et al.
- Chen_Proliferation_Score: Proliferation score based on Chen et al.
- Chen_Club_Progenitor_Score: Club progenitor score from Chen et al.
- Chen_AT1_AT2_Score: Combined AT1 and AT2 score from Chen et al.
- Chen_Activated_Club_Score: Activated club cell score from Chen et al.
- Chen_Tumour_Score: Tumour cell score based on Chen et al.
- Chen_DATP_Score: DATP score from Chen et al.
- Chen_PATS_Score: PATS score from Chen et al.
- Goldfarbmuren_SMG_Basal_Score: SMG basal score from Goldfarbmuren et al.
- Goldfarbmuren_Ciliated_Score: Ciliated cell score from Goldfarbmuren et al.
- Goldfarbmuren_Diff_Basal_Score: Differentiated basal cell score from Goldfarbmuren et al.
- Goldfarbmuren_Ionocytes_Tuft_Score: Ionocyte and tuft cell score from Goldfarbmuren et al.
- Goldfarbmuren_Krt8_High_Score: High KRT8 expression score from Goldfarbmuren et al.
- Goldfarbmuren_Mucus_Secretory_Score: Mucus secretory cell score from Goldfarbmuren et al.
- Goldfarbmuren_PNEC_Score: PNEC cell score from Goldfarbmuren et al.
- Goldfarbmuren_Proliferating_Basal_Score: Proliferating basal score from Goldfarbmuren et al.
- Goldfarbmuren_Proteasomal_Basal_Score: Proteasomal basal cell score from Goldfarbmuren et al.
- Goldfarbmuren_SMG_Secretory_Score: SMG secretory cell score from Goldfarbmuren et al.
- Goldfarbmuren_Core_Response_Down_Score: Core response down-regulation score from Goldfarbmuren et al.
- Goldfarbmuren_Core_Response_Up_Score: Core response up-regulation score from Goldfarbmuren et al.
- integrated_snn_res.0.6: Seurat cluster resolution
- seurat_clusters: Cluster assignments from Seurat analysis
- integrated_snn_res.0.4: Seurat cluster resolution (alternative)
- celltypes: Cell type annotations (e.g., basal, club, etc.)

Human_tracheal_epithelial_harmony_seurat.rds

This Seurat object contains harmonised data from human tracheal epithelial cells sampled from current-smoker and non-smoker individuals. The dataset spans multiple studies and has been integrated for downstream analysis.

Metadata columns:
- orig.ident: Original identifier for the sample
- nCount_RNA: Total RNA count per cell
- nFeature_RNA: Number of features detected per cell
- Donor: Pseudonymized identifier for the individual from whom the sample was obtained
- Sample: Sample identifier
- Method: Method of RNA extraction or processing
- Sex: Biological sex of the donor
- age_range: Broad age category of the donor
- Smoking_status: Smoking status of the donor (current-smoker, non-smoker)
- dataset: Dataset identifier included
- nUMI: Total UMI counts per cell
- nGene: Total gene count per cell
- log10GenesPerUMI: Log-transformed number of genes per UMI
- Batch: Batch number for sample processing
- percent.mito: Percentage of mitochondrial genes detected
- percent.ribo: Percentage of ribosomal genes detected
- RNA_snn_res.0.6: Seurat cluster resolution
- seurat_clusters: Cluster assignments from Seurat analysis
- Cluster_Annotations: Cell type or cluster annotation

Murine_Variant_Calls.xlsx

This file contains a list of annotated substitutions and indels detected across murine clones from NTCU-treated mice. The variants were identified using low-input whole-genome sequencing (WGS).

Columns:
- mouseID: Identifier for each mouse in the study
- VariantID: Unique identifier for the detected variant
- Chrom: Chromosome where the variant was located
- Pos: Position of the variant on the chromosome
- Ref: Reference allele
- Alt: Alternate allele
- Qual: Quality score for the variant
- Filter: Filter status (passed/failed)
- Gene: Gene affected by the variant
- Transcript: Transcript associated with the variant
- RNA: RNA sequence affected by the variant
- CDS: Coding sequence affected by the variant
- Protein: Protein sequence affected by the variant
- Type: Type of variant (SNV, insertion, deletion)
- Effect: Predicted functional effect (missense, silent, etc.)
- sampleID: Identifier for the sample
- Mut_Frags: Number of mutant fragments detected
- mut_id: Mutation identifier
- mutID_sampleID: Combination of mutation ID and sample ID
- coverage: Sequencing coverage at the variant site
- vaf: Variant allele frequency (VAF)
- cluster_id: ID of the genomic cluster to which the variant belongs
- lung_id: Identifier for the anotomical site of lung
- mouseID_cluster: Combination of mouse ID and cluster ID

Key Information Sources

Single-cell RNA sequencing (scRNA-seq) data from mouse and human tracheal epithelial cells was used in the analysis.
Whole-genome sequencing (WGS) for annotated substitutions and indels was used to identify variants in murine clones, based on samples obtained through laser capture microdissection.

Code/Software

R and Seurat are required to run the analysis. The Seurat version used for the analysis is 5.0.1.
The script used for data processing and integration is provided on https://zenodo.org/records/15168190. Annotations are included throughout the script to explain the major steps, from data loading and cleaning to clustering and visualization.