Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions

Atoyebi, Temitope Olufunmi 1 ; Olanrewaju, Rashidah Funke1 ; Blamah, N. V.1 ; Uwazie, Emmanuel Chinanu1

Research facility: Nasarawa State University

Published Nov 10, 2023 on Dryad. https://doi.org/10.5061/dryad.4xgxd25gn

Data files

Nov 10, 2023 version files 157.14 KB

Malaria_iIlment_and_Grading_Dataset.xlsx
148.04 KB
README.md
9.09 KB

Abstract

Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.

Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions
https://doi.org/10.5061/dryad.4xgxd25gn

The dataset used in this work serves as an excellent resource for detecting malaria in patients due to various determinants.

These elements include the following attributes: “pregnant”, “treated mosquito net”, “Malaria rate of infection”, “infected mosquito treated net”, “age” and target attribute “class”. Other attributes are biological factors, socio-economic realities, environmental and political factors. The important characteristic of the “class” determines whether the patient has malaria (Yes) or not (No).

This diverse dataset has a combination of numerical and categorical features, making it ideal for applying these supervised methods. The data set was processed before applying a classification method to handle missing values and allow the translation of categorical variables into a numerical format. The “position” attribute, originally represented as Yes or No, has been carefully converted to a binary value: 1 means Yes and 0 means No.

These classifiers function as both a supervised learning technique and a statistically based classification technique. It works on a theorem by finding the probability of an event occurring based on the probability of another event already occurring. It assumes a probability-based model to calculate the uncertainty of future events in a mechanized and disciplined way, by estimating the probabilities of events. Such a mechanism has been widely used in disease prediction and diagnosis

Classification is simple and is especially suitable when the dimensionality of the input is high.

Although simple, it can outperform more complex classification methods. It provides a perspective for understanding many learner algorithms and operates on easy-to-build assumptions that, by classifying categorical data, occurrences of an event (attribute) are independent and can be trained in a supervised manner. The main advantage of these classifiers in terms of classification lies in their simplicity and ability to approximate the probability of a class on a given instance.

Explaining Column Headers

Pregnancies: Is the patient pregnant or not, number 1 is for pregnant, number zero is for not pregnant
Availability of Treated Net: Does the patient sleeps under a treated net -1, else – 0
Season: Level of Rainfall-Stagnant water breeding: Heavy rainfall season-1 else -0
Rate of Malaria Infection (Lab Diagnosis): The Malaria Lab test in result percentage (%) rating
Malaria Parasite Density Fever- Rapid Diagnostic Test (Strip): Related complaints by the patient regarding the severalty of the fever
Complaints/Symptoms: Symptoms available or how the patient feels
Outcome of the Result: Presence of Malaria -1, else - 0
Electricity: Is there power and infrastructure like fan or AC to drive away mosquito -1, else - 0
Laboratory Equipment: Is there function laboratory equipment -1, else - 0
Doctor to Patient: Where you attended to by a doctor on time -1, else - 0
Environment - Sanitized or not: Sanitized location -1, else - 0
Complicated/ Uncomplicated Malaria Diagnosis: Complicated grading based on Lab diagnosis greater than 70% - 1, equal to less than 70% - Uncomplicated
Location (Urban/ Rural Area): Urban area – 1, Rural area - 0
Malaria Outcome Interpretation: Patient having malaria – 1, else - 0

Description of the data and file structure
A total of 2121 cleaned preprocess records was collected and store in database. According to Sordo and Zeng (2005) a sample size of ~150 -~8500 should be adequate for training while testing set adequate for a classifier performance measure (Indira, Vasanthakumari, Jegadeeshwaran, & Sugumaran, 2015). In this study, we shall be training the model in classification phase1. The records are expected to reveal positive and negative cases, which shall be used for training the model in classification phase2

The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.

The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.

Sharing/Access information
This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.
Links to other publicly accessible locations of the data:

http://… NOT YET

Data was derived from the following sources:

Malaria incidence data set was obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
The model will be trained and tested using large datasets from two different locations, the former urban areas and the latter rural areas.

Code/Software
Python is a high-level, interpreted programming language. It is characterized by its simplicity, readability, and user-friendliness. The language has features for object-oriented computing, control flow statements, functions, data structures, input/output, and input/output. It enables both small-scale programming, which can quickly produce shoddy throw-away programs, and large-scale programming, which can quickly produce comprehensive, intricate application programs. Python has widespread popularity and is used in various fields such as data science, machine learning, web development, scientific computing, and more. Here are some key features of Python which allows the reason to adopt for this research work is stated below:
a. Easy to Learn and Use: Python is easy to learn and use. It has a simple syntax that is easy to read and understand, which makes it an ideal language for beginners.
b. Interpreted Language: Python is an interpreted language, which means that you don’t need to compile your code before executing it. This makes it easy to test and debug your code.
c. Object-Oriented: Python is an object-oriented programming language, which means that it allows you to create classes and objects that encapsulate data and behavior.
d. Cross-Platform: Python is a cross-platform language, which means that it can run on different operating systems such as Windows, Mac OS, and Linux.
e. Large Standard Library: Python comes with a large standard library that provides support for various programming tasks such as working with files, network programming, and more.
f. Third-Party Libraries: Python has a vast collection of third-party libraries that are available for various tasks such as data analysis, scientific computing, machine learning, web development, and more.
g. Dynamically Typed: Python is a dynamically typed language, which means that you don’t need to declare the data type of a variable before using it. This makes it easy to write and modify code quickly.
<br>

3.3.3 Experimental and Model requirements

The type of equipment and materials required will depend on the nature of the experiment. This phase refers to the specifications and constraints that must be considered when designing a product, system, or process. The performance requirements describe the specific performance criteria that the design must meet, this essentializes for ensures that experiments and designs are conducted in a methodical and efficient way. The model development requirements for this research are into two main parts, software and hardware requirements they are further outlined below

i. Software / Model Requirements
This refers to the functional specifications that describe what the model developed should do and also subjected to, the constraints and limitations that it must adhere to. It basically refers to the specifications and constraints considered when building a machine learning model. The software used are stated below:
Python version 3.8
Anaconda (Jupyter)
Microsoft Excel (2016)

ii. Hardware
HP, Lenovo, Mac, Intel-inside, 4 GHz processor, 8GB RAM, 64-bits operating system

Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.

Data Source Collection

Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.

Data Preprocessing:

Data preprocessing shall be done to remove noise and outlier.

Transformation:

The data shall be transformed from analog to electronic record.

Data Partitioning

The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.

Classification and prediction:

Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:

i. Data collection and preprocessing shall be done.

ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.

iii. Test data set is shall be stored in database test data set.

iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:

Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.

Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.