Skip to main content

Global Characterization of Megakaryocytes in Bone Marrow, Peripheral Blood, and Cord Blood by Single-cell RNA Sequencing

Cite this dataset

Jing, Hongmei (2020). Global Characterization of Megakaryocytes in Bone Marrow, Peripheral Blood, and Cord Blood by Single-cell RNA Sequencing [Dataset]. Dryad.


Megakaryocytes (MK) are mainly derived from bone marrow (BM) and are mainly involved in platelet production. Recent studies have shown that MK derived from BM may have immune function, and that MK from peripheral blood (PB) are associated with prostate cancer. We analyzed more than 1.2 million single-cell transcriptome data from 132 samples of PB, BM, and cord blood (CB) from healthy individuals and patients, and obtained 4474 MK single cell and 14 MK subtypes. We found that MK were widely distributed and the amount of MK in PB was more than that in BM and there were specificity MK subtypes in PB. We found classical MK1 with typical MK characteristics and non-classical MK2 closely related to immunity which was the most common subtype in BM and CB. Classical MK1 was closely related to Non-Small Cell Lung Cancer (NSCLC) and has diagnostic ability. MK2 may have potential adaptive immune function and play a role in tumor NSCLC and autoimmune diseases Systemic Lupus Erythematosus. This study deepened our understanding of MK and suggested that MK had potential immune functions and was involved in various diseases.


We integrated single-cell RNA-seq data of 132 cases of health and disease states, composed of 78 cases of BM samples, 46 cases of PB, and 8 cases of CB samples. 

78 cases of BM samples included 39 healthy specimens (BM26-31, MantonBM1-8, BM01-25), 35 AML specimens (AML01-35) and 4 hematopoietic stem cell transplant (HSCT) specimens including before and after transplantation (HSCT01-04). 46 cases of PB samples included 28 healthy specimens (PB01-05, PB06-10, PB12, PB21-26, P01-11), 5 SLE specimens (PB27-31), 1 NSCLC specimen (PB32), and 12 CLL specimens (PBCLL01-12). 8 cases of CB samples included 8 healthy specimens (MantonCB1-8). In all, there are more than 1.2 million single cell transcriptome data. We collected scRNA-seq data of AML01-35 from GSE116256(van Galen, P., Hovestadt, V., et al. 2019) in Gene Expression Omnibus (GEO) database, BM01-25 from GSE120446(Oetjen, K.A., Lindblad, K.E., et al. 2018), BM26-31 from GSE116256, PB06-10 from GSE128066(Sun, Z., Chen, L., et al. 2019), PB12 from GSE132802(Kim, D., Kobayashi, T., et al. 2020), PB21-26 from GSE132802, PB27-31 from GSE96583(Kang, H.M., Subramaniam, M., et al. 2018), PB32 from GSE127471(Newman, A.M., Steen, C.B., et al. 2019), and PBCLL01-12 from GSE111014(Rendeiro, A.F., Krausgruber, T., et al. 2020). We collected scRNA-seq data of HSCT01-04, PB01-05, and P01-11 from 10X Web ( and MantonBM1-8 and MantonCB1-8 from Human Cell Atlas (HCA)(Regev, A., Teichmann, S.A., et al. 2017). Among them, except AML01-35 and P01-11 were sorted by flow cytometer, all the above samples were obtained from centrifugation.

ScRNA‑seq Data Analysis

1. Pre-processing of scRNA‑seq data

Seurat v3.1.2 was used to filter raw scRNA‑seq data and cells with the following criterion were reserved: 1) Cells with more 100 genes detected but less than 30000; 2) Cells with more than 1000 RNAs (UMIs) but less than 50000; 3) Cells with less than 5% UMIs from mitochondrial genes. Then we merged normalied and scaled all the scRNA‑seq data using Seurat v3.1.2.

2. Integration of datasets

Merge function Seurat v3.1.2 was used to integrate scRNA‑seq data from different studies.We merge the scRNA‑seq meta data at cell-level and preserve the cell identities before pre-merge. We assembled the datasets and subsequently analyzed it as a single scRNA‑seq object.

3. Dimension reduction and unsupervised clustering

In short, we first selected the top 2000 high variances (features) in the scRNA‑seq meta dataset. The RunPCA inSeurat v3.1.2usingprincipal component analysis (PCA) determine which PCs could be applied for further dimension reduction and clustering. FindClusters in Seurat were used for unsupervised clustering. t-SNE and UMAP were used to visualize the clustering results. FindMarkers or FindAllmarkers in Seurat was used to detect cluster-specific markers between any given cell groups or among clusters.

4. Determine cell types of each cluster

AddModuleScore in Seurat was used to calculate the scores of different features according to the average expression levels of a set of feature‑relative genes  (Table S2).         

To identify MK cell populations in 37 cell types, we analyzed conventional bulk RNA-Seq data of 211 hematopoietic cell samples from GSE24759(Novershtern, N., Subramanian, A., et al. 2011) to obtain megakaryocyte-associated markers. They contained PF4, PPBP, SELP, ITGB3, DNM3, EGF, PDGFA, ARHGAP6, CTDSPL, CLEC1B, and HSPC159.

5. Pseudo-time trajectory construction

Monocle v2.12.0 R package was used to construct Pseudo-time trajectory. The reduceDimension  function in monocle was used to compute a projection of the scRNA‑seq meta dataset into a lower dimensional space and dimension reduction processing was performed by the DDRTree approach.

Classification performance of signature genes expression of 14 MK subtypes between 377 non-cancer samples and 402 NSCLC cohorts.

  Gene expression profiles of 377 non-cancer and 402 NSCLC were obtained from GSE89843(Best, M.G., Sol, N., et al. 2017) in the GEO database. Illumina HiSeq 2500 was applied to suqenceeach samples. Pearson correlation was applied to calculate the correlation coefficient between the gene expression signatures of 14 MK subtypes and the gene expression profile of blood platelet of 377 non-cancer and 402 NSCLC. ROC curves were built by using the correlation coefficient with pROC package (R 3.6.2).

Identification of protein expression by the Human Protein Atlas

  To further verify MK specific gene expression in MK at the protein level, we obtained the images of immunostaining for specifically expressed genes of megakaryocytes in BM MK from the Human Protein Atlas (

Statistical analysis

All statistical data analyses were performed in R version 3.6.2. The comparison of single-cell expression levels was made using Nonparametric Wilcoxon test. Pearson correlation was used to correlation analysis. The comparison of MK proportion of different tissue was made using Student’s t-test (unpaired, two-sided). The comparison of MK proportion of healthy and disease people was made using Mann-Whitney test. In all statistical tests, P-value of less than 0.05 was considered statistically significant.


National Natural Science Foundation of China, Award: 81800195

National Natural Science Foundation of China, Award: 81460315

Key Clinical Projects of Peking University Third Hospital, Award: BYSYZD2019026

Peking University, Award: BMU2018MB004

National Natural Science Foundation of China, Award: 7132183

National Natural Science Foundation of China, Award: 7182178

China Health Promotion Foundation, Award: CHPF-zlkysx-001

Scientific Research Foundation from Health Commission of Jiangxi Province, China, Award: 20141114

Science and Technology Research Foundation from Educational Commission of Jiangxi Province, China

Science and Technology Research Foundation from Educational Commission of Jiangxi Province, China, Award: GJJ14676