Hematopathologist-annotated mass cytometry dataset of acute myeloid leukemia diagnostic specimens
Data files
Jul 18, 2025 version files 499.23 MB
-
aml_dataset_for_dryad.csv
499.23 MB
-
README.md
6.26 KB
Abstract
This dataset comprises hematopathologist-annotated single-cell data from patients with Acute Myeloid Leukemia (AML), sourced from FlowRepository (ID: FR-FCMZ2E7). The dataset includes normalized, singlet-gated events with annotations provided by the authorship team of Tsai et al., excluding one patient diagnosed with Myelodysplastic Syndrome (MDS) due to diagnostic ambiguity. To align with blast enumeration standards and pathology laboratory procedures, only CD45+ hematopoietic lineage cells were included in the analysis.
Data preprocessing followed standard mass cytometry protocols, including hyperbolic arcsine transformation with a cofactor of 5, scaling markers to their 99.9th percentile, and exclusion of cells with marker values exceeding this threshold to remove artifacts and outliers. This dataset, processed using the R package tidytof, offers a high-quality resource for researchers investigating AML through single-cell analysis.
https://doi.org/10.5061/dryad.jq2bvq8kj
Description of the data and file structure
This dataset, originally collected to study single-cell morphometric profiling in AML, has now been annotated by a hematopathologist to provide single-cell labels distinguishing cancerous blasts from healthy bone marrow cells. These expert annotations, combined with the high-dimensional molecular and morphological data, make the dataset uniquely suited for developing and validating machine learning methods for disease-associated cell identification. The addition of cell-level annotations enables applications such as training classifiers to identify leukemia cells and selecting biologically meaningful features. This resource supports advances in single-cell data analysis and provides a valuable benchmark for studying acute myeloid leukemia and related conditions.
Files and variables
File: aml_dataset_for_dryad.csv
Description: This dataset consists of single-cell measurements from Acute Myeloid Leukemia (AML) samples collected via mass cytometry. Each row represents a single cell, and each column contains either a sample-level label, a cell-level label, or marker intensity measurements. The dataset has been annotated by a hematopathologist, distinguishing cancerous blasts from healthy bone marrow cells. There are 13 AML patients and 3 non-AML patients.
Variables
- patient: Unique identifier for each patient in the dataset, taken from the original paper.
patient
IDs that begin with the string "AML" are AML patients; the remaining patients are non-AML patients. - patient_label: Clinical label of the patient sample. Possible values:
aml
: Indicates the sample originates from a patient diagnosed with Acute Myeloid Leukemia.healthy
: Indicates the sample originates from a patients not diagnosed with Acute Myeloid Leukemia.
- cell_label: Hematopathologist-provided label for each cell. Possible values:
blast
: Indicates the cell is a cancerous blast.non-blast
: Indicates the cell is a healthy bone marrow cell.
- cd235ab: Marker intensity for CD235a/B, associated with erythroid lineage.
- cd61: Marker intensity for CD61, a marker for megakaryocyte lineage.
- cd71: Marker intensity for CD71, related to transferrin receptor expression.
- cd3: Marker intensity for CD3, a T-cell lineage marker.
- cd8: Marker intensity for CD8, a cytotoxic T-cell marker.
- cd2: Marker intensity for CD2, a pan-T-cell marker.
- cd5: Marker intensity for CD5, a T-cell and some B-cell marker.
- cd4: Marker intensity for CD4, a helper T-cell marker.
- cd7: Marker intensity for CD7, an early T-cell lineage marker.
- cd11c: Marker intensity for CD11c, a marker for dendritic and myeloid cells.
- cd23: Marker intensity for CD23, a B-cell activation marker.
- cd123: Marker intensity for CD123, an IL-3 receptor alpha chain.
- cd56: Marker intensity for CD56, associated with NK cells and some leukemias.
- cd45: Marker intensity for CD45, a pan-leukocyte marker.
- cd10: Marker intensity for CD10, associated with pre-B and germinal center B-cells.
- cd13: Marker intensity for CD13, a myeloid lineage marker.
- cd117: Marker intensity for CD117, a stem cell and progenitor cell marker.
- cd34: Marker intensity for CD34, a stem and progenitor cell marker.
- cd20: Marker intensity for CD20, a mature B-cell marker.
- cd19: Marker intensity for CD19, a pan-B-cell marker.
- cd22: Marker intensity for CD22, a mature B-cell marker.
- cd79a: Marker intensity for CD79a, a component of the B-cell receptor complex.
- cd15: Marker intensity for CD15, associated with granulocyte lineage.
- cd33: Marker intensity for CD33, a myeloid lineage marker.
- cd14: Marker intensity for CD14, a monocyte/macrophage marker.
- cd64: Marker intensity for CD64, an Fc-gamma receptor on macrophages and neutrophils.
- cd16: Marker intensity for CD16, an Fc-gamma receptor associated with NK cells and neutrophils.
- cd38: Marker intensity for CD38, a marker for plasma cells and activated lymphocytes.
- mpo: Marker intensity for myeloperoxidase, specific to myeloid cells.
- wga_102, wga_104, wga_105, wga_106, wga_108, wga_110: Morphometric markers measured using wheat germ agglutinin staining.
- ig_l: Marker intensity for immunoglobulin light chain lambda.
- lactoferrin: Marker intensity for lactoferrin, associated with neutrophils.
- ig_k: Marker intensity for immunoglobulin light chain kappa.
- r_rna: Marker intensity for ribosomal RNA, indicative of cellular activity.
- hp1b: Marker intensity for heterochromatin protein 1 beta.
- lamin_b1: Marker intensity for Lamin B1, associated with nuclear structure.
- vamp_7: Marker intensity for VAMP7, associated with vesicle trafficking.
- lysozyme: Marker intensity for lysozyme, present in monocytes and granulocytes.
- serpin_b1: Marker intensity for Serpin B1, an inhibitor of neutrophil elastase.
- lamin_a_c: Marker intensity for Lamin A/C, associated with nuclear lamina.
- beta_actin: Marker intensity for beta-actin, a cytoskeletal protein.
- hla_dr: Marker intensity for HLA-DR, a major histocompatibility class II marker.
Human subjects data
All human subjects data used in this study were obtained under protocols approved by the Institutional Review Board at Stanford University. Written informed consent was obtained from all participants, including consent for future research use and public data sharing of de-identified samples.
The data submitted to Dryad have been fully de-identified in accordance with HIPAA and international data protection guidelines. Specifically, all direct personal identifiers (e.g., names, dates of birth, medical record numbers) were removed prior to analysis. Clinical metadata were limited to non-identifiable variables.
As such, the dataset is compliant with applicable regulations governing the sharing of human subjects data and is suitable for public domain distribution via Dryad.