Data from: Hybrid machine learning approach to zero-inflated data improves accuracy of dengue prediction
Data files
Dec 04, 2025 version files 31.29 MB
-
README.md
4.08 KB
-
Weekly_dengue_incidence_env_data.csv
31.28 MB
Abstract
Background
Spatiotemporal dengue forecasting using machine learning (ML) can contribute to the development of prevention and control strategies for impending dengue outbreaks. However, training data for dengue incidence may be inflated with frequent zero values because of the rarity of cases, which lowers the prediction accuracy. This study aimed to understand the influence of spatiotemporal resolutions of training data on the accuracy of dengue incidence prediction using ML models, to understand how the influence of spatiotemporal resolution differs between quantitative and qualitative predictions of dengue incidence, and to improve the accuracy of dengue incidence prediction with zero-inflated data.
Methodology
We predicted dengue incidence at six spatiotemporal resolutions and compared their prediction accuracy. Six ML algorithms were compared: generalized additive models, random forests, conditional inference forest (CIF), artificial neural networks, support vector machines and regression, and extreme gradient boosting. Data from 2009 to 2012 were used for training, and data from 2013 were used for model validation with quantitative and qualitative dengue variables. To address the inaccuracy in the quantitative prediction of dengue incidence due to zero-inflated data at fine spatiotemporal scales, we developed a hybrid approach in which the second-stage quantitative prediction is performed only when/where the first-stage qualitative model predicts the occurrence of dengue cases.
Principal Findings
At higher resolutions, the dengue incidence data were zero-inflated, which was insufficient for quantitative pattern extraction of relationships between dengue incidence and environmental variables by ML. Qualitative models, used as binary variables, eased the effect of data distribution. Our novel hybrid approach of combining qualitative and quantitative predictions demonstrated high potential for predicting zero-inflated or rare phenomena, such as dengue.
Significance
Our research contributes valuable insights to the field of spatiotemporal dengue prediction and provides a novel solution to enhance prediction accuracy in zero-inflated data where hurdle or zero-inflated models cannot be applied.
https://doi.org/10.5061/dryad.x3ffbg7ss
This dataset contains weekly log transformed weekly dengue incidence and environmental data for each village in Metropolitan Manila, Philippines from January 2009 to December 2013.
Description of the data and file structure
Weekly_dengue_incidence_env_data.csv
Data columns are abbreviated and below are the descriptions
| Colum name | Variable name |
|---|---|
| ID | Numeric identification of the observation |
| Year | Year of the observed data |
| week | Week of the observed data |
| City | City name of the observation |
| Village | Village name of the observation |
| LogDenIncd | Log transformed dengue incidence |
| tmin_L16 | Minimum land surface temperature (ºC) at lag 16 |
| tmax_L17 | Maximum land surface temperature (ºC) at lag 17 |
| tmean_L17 | Mean land surface temperature (ºC) at lag 17 |
| ndvi_L0 | Vegetation Index at lag 0 |
| gpm_L6 | Precipitation (mm/h) at lag 6 |
| nw_L5 | Northward wind speed (m/s) at lag 5 |
| ew_L11 | Eastward wind speed (m/s) at lag 11 |
| rh_L4 | Relative humidity (%)at lag 4 |
| Agr | Percentage of agricultural (%) |
| Gra | Percentage of grasslands (%) |
| For | Percentage of forest lands (%) |
| Wat | Percentage of surface covered by water bodies (%) |
| Ope | Percentage of open spaces (%) |
| Par | Percentage of parks and recreation areas (%) |
| Edu | Percentage of educational areas (%) |
| Hea | Percentage of health-related areas (%) |
| Cem | Percentage of cemetery areas (%) |
| Mil | Percentage of military areas (%) |
| Gov | Percentage of government areas (%) |
| Ind | Percentage of industrial areas (%) |
| Com | Percentage of commercial areas (%) |
| Tra | Percentage of transportation areas (%) |
| Inf. | Percentage of informal settlement areas (%) |
| Vlo | Percentage of very low residential density areas (%) |
| Low | Percentage of low residential density areas (%) |
| Med | Percentage of medium residential density areas |
| Hig | Percentage of high residential density areas (%) |
| Vhi | Percentage of very high residential density areas (%) |
| RND | Road network density (m/m2) |
| FloodRisk | Level of flood risk |
This analysis aimed to uncover the lags at which these factors most strongly correlate with dengue incidence. The identified best-lag time results are presented in S2 Table of the related article and above in the variable name column.
Code/Software
All satellite data (e.g. precipitation, temperature, vegetation, relative humidity, wind) data were acquired using the Google Earth Engine (GEE) code editor platform. The GEE code editor is a web-based integrated development environment for writing and running Java scripts to support geospatial analysis.
Additional processing to fill missing pixels in both land surface temperature and NDVI raster datasets was performed using the locally weighted regression method in GRASS GIS, version 7.8.3.
- Francisco, Micanaldo Ernesto; Carvajal, Thaddeus M.; Watanabe, Kozo (2024). Hybrid Machine Learning Approach to Zero-Inflated Data Improves Accuracy of Dengue Prediction. PLOS Neglected Tropical Diseases. https://doi.org/10.1371/journal.pntd.0012599
