Skip to main content
Dryad

Machine learning identifies girls with central precocious puberty based on multi-source data

Cite this dataset

Pan, Liyan; Liu, Guangjian; Mao, Xiaojian; Liang, Huiying (2022). Machine learning identifies girls with central precocious puberty based on multi-source data [Dataset]. Dryad. https://doi.org/10.5061/dryad.bk3j9kd99

Abstract

Objective: The study aimed to develop simplified diagnostic models for identifying girls with central precocious puberty (CPP), without the expensive and cumbersome gonadotropin-releasing hormone (GnRH) stimulation test, which is the gold standard for CPP diagnosis.

Materials and Methods: Female patients who had secondary sexual characteristics before 8 years old and had taken a GnRH analog (GnRHa) stimulation test at a medical center in Guangzhou, China were enrolled. Data from clinical visiting, laboratory tests and medical image examinations were collected. We first extracted features from unstructured data such as clinical reports and medical images. Then, models based on each single-source data or multi-source data were developed with Extreme Gradient Boosting (XGBoost) classifier to classify patients as CPP or non-CPP.

Results: The best performance achieved an AUC of 0.88 and Youden index of 0.64 in the model based on multi-source data. The performance of single-source models based on data from basal laboratory tests and the feature importance of each variable showed that the basal hormone test had the highest diagnostic value for a CPP diagnosis.

Conclusion: We developed three simplified models that use easily accessed clinical data before the GnRH stimulation test to identify girls who are at high risk of CPP. These models are tailored to the needs of patients in different clinical settings. Machine learning technologies and multi-source data fusion can help to make a better diagnosis than traditional methods.

Methods

In this study, girls with secondary sexual characteristics onset under the age of 8 were enrolled from the Pediatric Endocrinology Department of Guangzhou Women and Children's Medical Center. Individuals with genetic disorders, tumors, lesions, McCune-Albright syndrome, neurofibromatosis, ovarian cysts, or other diseases and those taking hormone medications were excluded from this study. All patients underwent the GnRHa stimulation test.

This study was approved by the institutional review board of Guangzhou Women and Children’s Medical Center and conducted in accordance with the ethical guidelines of the Declaration of Helsinki of the World Medical Association. The requirement to obtain informed consent was waived because of the retrospective nature of the study. The data used in this study were anonymous, and no identifiable personal data of the patients were available for the analysis.

Instead of text data, we extracted values directly from unstructured reports (such as medical records, medical examination reports).

Usage notes

We eliminated data with missing rate over 60%. Missing values for continuous variables were filled with mean values of samples in the corresponding age group.

As reports from medical image examinations were Chinese, and we actually only used numerical values extracted from these reports. It is the same with medical reports.

Funding

Ministry of Science and Technology of the People's Republic of China, Award: 2018YFC1315402

Guangzhou Women and Children Medical Center, Award: YIP-2019-064

Guangzhou Women and Children Medical Center, Award: IP-2019-017

Guangzhou Institute of Pediatrics, Guangzhou Women and Children’s Medical Center, Award: YIP-2019-064