A data-driven supervised machine learning approach to estimating global ambient air pollution concentrations with associated prediction intervals

Berrisford, Liam 1 ; Barbosa, Hugo1; Menezes, Ronaldo1

Published Jul 18, 2025 on Dryad. https://doi.org/10.5061/dryad.cfxpnvxg2

Data files

Jul 18, 2025 version files 164.70 GB

Abstract

Global ambient air pollution, a transboundary challenge, is typically addressed through interventions relying on data from spatially sparse and heterogeneously placed monitoring stations. These stations often encounter temporal data gaps due to issues such as power outages. In response, we have developed a scalable, data-driven, supervised machine learning framework. The models produced by the framework are designed to impute missing temporal and spatial measurements, thereby generating a comprehensive dataset for air pollutants including NO2, O3, PM10, PM2.5, and SO2. In this work we produce models providing concentration estimations at 261,377 locations across the globe. The dataset, with a fine granularity of 0.25° spatial resolution at hourly time intervals and accompanied by prediction intervals for each estimate, caters to a wide range of stakeholders relying on outdoor air pollution data for downstream assessments. This enables more detailed studies. Additionally, the model’s performance across various geographical locations is examined, providing insights and recommendations for strategic placement of future monitoring stations to further enhance the model’s accuracy.

Summary

This repository contains the dataset and code used for the manuscript "A data-driven supervised machine learning approach to estimating global ambient air pollution concentrations with associated prediction intervals." The study developed a supervised machine learning framework to predict hourly air pollution concentrations for NO2, O3, PM10, PM2.5, and SO2 at a 0.25° spatial resolution globally. The dataset includes the raw pollutant concentrations target vector, feature vectors (temporal, meteorological, remote sensing, and emissions), and the resulting predictions with associated prediction intervals.

File Structure

The file structure for this repository is as follows:

Output_Data_for_A_Data-Driven_Supervised_Machine_Learning_Approach_to_Estimating_Global_Ambient_Air_Pollution_Concentrations_With_Associated_Prediction_Intervals.zip: This folder contains all of the output data produced by the model. The name structure followed is "month-day-year hour.feather" for all of the hours during 2022. For example, the filename for the air pollution concentrations at 8 AM on 4th January 2022 would be "01-04-2022 08000.feather".

Output Directory

The Output/ directory contains the predicted hourly air pollution concentration datasets. Each file corresponds to a specific hour of a given date and includes modelled values for a range of pollutants and prediction quantiles.

File Naming Convention

Files are named using the following format:

DD-MM-YYYY HHMMSS.feather

DD-MM-YYYY: The date of the prediction (e.g., 01-01-2022)
HHMMSS: The time of day (in 24-hour format), representing the hour for which the predictions apply
Example: 000000 = 00:00, 010000 = 01:00, etc.

Example Files

Filename	Date	Time (UTC)	Description
01-01-2022 000000.feather	1st Jan 2022	00:00	Hourly prediction for 1 Jan 2022 at 00:00 UTC
01-01-2022 010000.feather	1st Jan 2022	01:00	Hourly prediction for 1 Jan 2022 at 01:00 UTC
01-01-2022 020000.feather	1st Jan 2022	02:00	Hourly prediction for 1 Jan 2022 at 02:00 UTC
...	...	...	...

File Contents

Each .feather file contains structured predictions across the global 0.25° resolution grid for the given hour. Variables typically include:

Pollutant Predictions:
NO₂, O₃, PM₁₀, PM₂.₅, SO₂
For each pollutant:

- Mean estimate
- 5th percentile estimate
- 50th percentile (median) estimate
- 95th percentile estimate

Geographic Identifiers:
Latitude
Longitude
Global Model Grid ID (Unique cell identifier)

These files allow users to conduct hourly air pollution analyses at global scale, including spatially explicit exposure estimates, temporal trends, and quantile-based risk assessments.

Models_for_A_Data-Driven_Supervised_Machine_Learning_Approach_to_Estimating_Global_Ambient_Air_Pollution_Concentrations_With_Associated_Prediction_Intervals.zip:

global_models: Models produced during the study were grouped into the experiment type (spatial, spatial_kfold, temporal), with the subdirectory name relating to the particular data that was excluded for each model.
global_auxiliary_models_quantile_regression: The supporting models were created to provide quantile regression estimates for all predictions.

Models Directory

The Models/ directory contains the trained machine learning models used to estimate air pollution concentrations globally. Models are organised by quantile (e.g., 0.05, 0.5, 0.95) or mean, pollutant type, spatial/temporal focus, and training data scope.

Each model is saved in two parts: a .pkl file for the model itself and a .csv with the results associated with that model. Within the Model directory, .csv files (LOOV_station) are also provided, documenting the monitoring stations that were excluded from the complete dataset for the experiment described in that directory. These stations constitute the Leave-One-Out Validation set.

Example Filepath

global_auxiliary_models_quantile_regression/0.5/temporal/models_global/no2/2014_2022_temporal/

This directory contains the median (0.5 quantile) prediction model for Nitrogen Dioxide (NO₂), trained on data spanning 2014–2022, using global station data for temporal generalisation.

Path Component	Description
`global_auxiliary_models_quantile_regression`	Directory containing quantile regression models for global air pollution.
`0.5`	Quantile of prediction (e.g., 0.05 = 5th percentile, 0.5 = median, 0.95 = 95th).
`temporal`	Indicates model was trained for temporal generalisation.
`models_global`	Trained using monitoring stations from all global regions.
`no2`	Pollutant type (Nitrogen Dioxide).
`2014_2022_temporal`	Time period for training data and temporal generalisation strategy.

Feature_and_Target_Vector_Input_Data_for_A_Data-Driven_Supervised_Machine_Learning_Approach_to_Estimating_Global_Ambient_Air_Pollution_Concentrations_With_Associated_Prediction_Intervals.zip: The directory contains the .feather files for the training data for all of the different air pollutants covered in the study.

Core Dataset Files

The following .feather files contain the processed datasets used for model training and evaluation. Each file corresponds to a specific air pollutant and contains the final structured dataset with feature vectors and target values at monitoring station locations.

Filename	Pollutant	Description
`core_dataset_no2.feather`	Nitrogen Dioxide (NO2)	Contains hourly NO2 measurements and corresponding features.
`core_dataset_o3.feather`	Ozone (O3)	Contains hourly O3 measurements and corresponding features.
`core_dataset_pm10.feather`	Particulate Matter ≤10 µm (PM10)	Contains hourly PM10 measurements and corresponding features.
`core_dataset_pm25.feather`	Particulate Matter ≤2.5 µm (PM2.5)	Contains hourly PM2.5 measurements and corresponding features.
`core_dataset_so2.feather`	Sulphur Dioxide (SO2)	Contains hourly SO2 measurements and corresponding features.

Each file is saved in Apache Feather format, enabling efficient loading and processing within Python using Pandas or PyArrow. These datasets are typically used to train and evaluate machine learning models for air pollution prediction.

Note on File and Folder Names:
Some of the files and folders in this dataset contain long names that may exceed path length limits on certain operating systems, particularly Windows. If you encounter issues accessing or extracting these files due to path length constraints, you may rename the files and folders locally. Renaming does not affect the reproducibility or analysis of the data.

Overview

Addressing the global challenge of ambient air pollution requires scalable and high-resolution data. However, many regions lack a comprehensive monitoring infrastructure. This research introduces a machine learning framework that extends air quality estimation globally, using remote sensing, meteorological reanalysis, and emissions datasets to produce hourly pollutant concentrations at a 0.25-degree spatial resolution across the globe.

This work builds on the LightGBM-based framework introduced in A framework for scalable ambient air pollution concentration estimation, adapting it for a global context. LightGBM, a fast and accurate gradient-boosted decision tree algorithm, enables this large-scale application with high predictive performance and interpretability.

Dataset Purpose

This dataset empowers diverse stakeholders, including researchers, policymakers, urban planners, and public health authorities, providing a robust basis for conducting air quality assessments and interventions at unprecedented resolution. The improved granularity facilitates more precise studies into air pollution impacts on human health, urban resilience, and environmental justice, surpassing the capabilities of conventional lower-resolution approaches. The framework’s computational efficiency and scalability further demonstrate its potential applicability for similar pollution estimation challenges globally, especially in regions with limited observational infrastructure.

The dataset is readily accessible and stored online, ensuring rapid retrieval and ease of use for various analytical and operational purposes. Users can confidently perform high-resolution analyses supported by validated machine-learning-driven estimates, thereby enhancing informed decision-making and targeted interventions aimed at reducing air pollution exposure and promoting sustainable urban environments.

For questions regarding this dataset, please reach out to Liam J Berrisford.

Models Datasets

All of the models used in the research are included within this dataset. The models included have been trained to predict the mean air pollution concentration and the 5th / 50th (median) / 95th percentile of the air pollution concentration. The model is saved as the LightGBM booster within a .txt file, and the model parameters are kept within a .json file. The code that is used to recreate a LightGBM object within Python that can be used to make predictions is available via the Environment Insights Python package. The directory structure provides details about whether all of the monitoring stations were used for the models, or whether all of the feature vectors were used for the respective model.

Air Pollution (Target Vector) Dataset Overview

The dataset includes hourly estimates for key ambient air pollutants: Nitrogen Dioxide (NO2), Ozone (O3), Particulate Matter with a diameter of 10 μm or less (PM10), Particulate Matter with a diameter of 2.5 μm or less (PM2.5), and Sulphur Dioxide (SO2). Rigorous validation was performed to assess the accuracy of model predictions for forecasting air pollution concentrations, estimating values at previously unmeasured locations, and capturing extreme pollution episodes.

Estimates were produced at the mean, as well as at the 0.05, 0.5, and 0.95 quantiles, to capture both the typical and extreme variations in air pollution concentrations. The mean provides a central tendency estimate, useful for general assessments and long-term policy planning. Quantile-based predictions, specifically at the 0.05, 0.5, and 0.95 quantiles, offer deeper insights into the variability and uncertainty inherent in pollution concentration estimates. These quantile predictions enable stakeholders to understand the range and extremes of pollutant levels, supporting risk assessments and targeted interventions, such as public health advisories or emergency response planning during pollution peaks.

Feature Vector Dataset Overview

The dataset developed in this study includes a comprehensive set of feature vectors used to estimate ambient air pollution concentrations across the globe. Feature vectors represent environmental conditions and phenomena known to influence air pollutant concentrations, including meteorological variables (e.g., wind speed and temperature), emissions from various human activities (e.g., traffic intensity, industrial processes) ,and remotely sensed air pollution measurements. Each feature was selected based on established scientific evidence linking these variables to the formation, dispersion, and accumulation of air pollutants. Incorporating such a diverse and detailed set of features enables the machine learning model to robustly capture complex spatial and temporal variations in air quality, ultimately improving the accuracy and applicability of pollution estimates.

Data Description

Data description
Data type	Point Estimates
Projection	EPSG:4326 WGS 84 (latitude/longitude)
Horizontal coverage	Global
Horizontal resolution	~0.25° (approx. 25 km at equator)
Vertical coverage	Surface only
Vertical resolution	Single layer
Temporal coverage	2022
Temporal resolution	Hourly
File format	NetCDF
Update frequency	Static

Coordinate Variables

Name	Units	Description
Timestamp	N/A	Time coordinate (hourly resolution)
Longitude	Degrees East	Longitude of centroid (EPSG:4326)
Latitude	Degrees North	Latitude of centroid (EPSG:4326)

Each NetCDF file is indexed by (Timestamp, Latitude, Longitude).

Data Variables

Output Variables

Name	Units	Description
Global Model Grid ID	–	Unique identifier for each grid in the Global model, synthetic monitoring station locations are grid centroids.
no2 Prediction 0.05 Quantile	µg/m³	Estimated 5th percentile of modelled NO2 concentration.
no2 Prediction 0.5 Quantile	µg/m³	Estimated 50th percentile (median) of modelled NO2 concentration.
no2 Prediction 0.95 Quantile	µg/m³	Estimated 95th percentile of modelled NO2 concentration.
no2 Prediction Mean	µg/m³	Estimated mean of modelled NO2 concentration.
o3 Prediction 0.05 Quantile	µg/m³	Estimated 5th percentile of modelled O3 concentration.
o3 Prediction 0.5 Quantile	µg/m³	Estimated 50th percentile (median) of modelled O3 concentration.
o3 Prediction 0.95 Quantile	µg/m³	Estimated 95th percentile of modelled O3 concentration.
o3 Prediction Mean	µg/m³	Estimated mean of modelled O3 concentration.
pm10 Prediction 0.05 Quantile	µg/m³	Estimated 5th percentile of modelled PM10 concentration.
pm10 Prediction 0.5 Quantile	µg/m³	Estimated 50th percentile (median) of modelled PM10 concentration.
pm10 Prediction 0.95 Quantile	µg/m³	Estimated 95th percentile of modelled PM10 concentration.
pm10 Prediction Mean	µg/m³	Estimated mean of modelled PM10 concentration.
pm2.5 Prediction 0.05 Quantile	µg/m³	Estimated 5th percentile of modelled PM2.5 concentration.
pm2.5 Prediction 0.5 Quantile	µg/m³	Estimated 50th percentile (median) of modelled PM2.5 concentration.
pm2.5 Prediction 0.95 Quantile	µg/m³	Estimated 95th percentile of modelled PM2.5 concentration.
pm2.5 Prediction Mean	µg/m³	Estimated mean of modelled PM2.5 concentration.
so2 Prediction 0.05 Quantile	µg/m³	Estimated 5th percentile of modelled SO2 concentration.
so2 Prediction 0.5 Quantile	µg/m³	Estimated 50th percentile (median) of modelled SO2 concentration.
so2 Prediction 0.95 Quantile	µg/m³	Estimated 95th percentile of modelled SO2 concentration.
so2 Prediction Mean	µg/m³	Estimated mean of modelled SO2 concentration.

Input Variables

Name	Units	Description
100m U Component of Wind	m/s	East–west wind component at 100 m above ground level.
100m V Component of Wind	m/s	North–south wind component at 100 m above ground level.
10m U Component of Wind	m/s	East–west wind component at 10 m above ground level.
10m V Component of Wind	m/s	North–south wind component at 10 m above ground level.
2m Dewpoint Temperature	K	Temperature at which air becomes saturated, measured at 2 m above ground level.
2m Temperature	K	Air temperature at 2 m above ground level.
Boundary Layer Height	m	Height of the atmospheric boundary layer above ground level.
Downward UV Radiation at Surface	W/m²	Downward ultraviolet radiant flux received at Earth’s surface.
Instantaneous 10m Wind Gust	m/s	Peak wind gust speed observed at 10 m AGL over a short time interval.
Surface Pressure	hPa	Atmospheric pressure at ground level.
Total Column Rain Water	kg/m²	Vertically integrated amount of rain water in a column of air above the surface.
S5P NO₂	mol/m²	Tropospheric column amount of nitrogen dioxide (NO₂) from Sentinel‑5P.
S5P Absorbing Aerosol Index	-	Absorbing Aerosol Index (AAI), indicating the presence of UV-absorbing aerosols in the atmosphere.
S5P CO	mol/m²	Total column amount of carbon monoxide (CO) retrieved from Sentinel‑5P.
S5P O₃	mol/m²	Total column ozone (O₃) retrieved by Sentinel‑5P.
Anthropogenic Emissions Sum Sectors co	kilotonne	Total anthropogenic CO emissions from all sectors.
Anthropogenic Emissions Sum Sectors nox	kilotonne	Total anthropogenic NOₓ emissions from all sectors.
Anthropogenic Emissions Sum Sectors nmvocs	kilotonne	Total anthropogenic non-methane volatile organic compound emissions from all sectors.
Anthropogenic Emissions Sum Sectors other-vocs	kilotonne	Total anthropogenic emissions of other volatile organic compounds from all sectors.
Anthropogenic Emissions Sum Sectors so2	kilotonne	Total anthropogenic SO₂ emissions from all sectors.
Biogenic Emissions Biogenic CO	kilotonne	Total biogenic CO emissions.
Timestamp Local	N/A	Local timestamp (adjusted using UTC offset).
UTC Offset	hours	Offset from UTC time (in hours).
Month Number	-	Integer representing the month, for example 1 (January) – 12 (December).
Week Number	-	Integer denoting the ISO week number (1–53).
Day of Week Number	-	Integer representing the weekday, for example 0 (Monday) – 6 (Sunday).
Hour Number	-	Hour of the day on a 24-hour clock, for example 0 (midnight) – 23 (11 pm).

Training Data Variables

Alongside the additional variables included in NetCDF files across this subdirectory, each NetCDF has additional attributes detailing the given station's official Site Name, Site Code, and Site Type. For convenience, the Global Model Grid ID is also provided as an attribute.

Name	Units	Description
no2 Measurement	µg/m³	Measured nitrogen dioxide (NO2) concentration.
o3 Measurement	µg/m³	Measured ozone (O3) concentration.
pm10 Measurement	µg/m³	Measured particulate matter <10 µm (PM10) concentration.
pm2.5 Measurement	µg/m³	Measured particulate matter <2.5 µm (PM2.5)concentration.
so2 Measurement	µg/m³	Measured sulfur dioxide (SO2) concentration.

Code

During the course of the study, 21 .ipynb were used, which are also included in this repository:

global_air_pollution_concentrations_quantile_regression.ipynb: This notebook includes code for applying quantile regression to estimate global air pollution concentrations.
global_air_pollution_data_anlaysis.ipynb: This notebook is used for analyzing global air pollution data.
global_air_pollution_maps_predictions.ipynb: This notebook generates prediction maps of global air pollution concentrations.
global_all_maps_visulisation.ipynb: This notebook provides visualization scripts for displaying various global air pollution maps.
global_data_download_airPollution-openAQ.ipynb: This notebook downloads air pollution data from the OpenAQ platform.
global_data_GEE_monitoringStationFeatureVector.ipynb: This notebook creates feature vectors for monitoring stations using Google Earth Engine data.
global_data_monitoringStationSiteFeatureVector.ipynb: This notebook constructs site-specific feature vectors for air pollution monitoring stations.
global_data_preprocessing_trainingData_featureVector.ipynb: This notebook preprocesses and prepares the training data feature vectors for the machine learning model.
global_data_processing_grids_enviromentalFeatureVector.ipynb: This notebook processes environmental feature vectors for grid-based data analysis.
global_model_uncertainity_monitoring_station_placement.ipynb: This notebook analyzes model uncertainty to optimize the placement of air pollution monitoring stations.
global_data_processing_grids_remoteSensingFeatureVector.ipynb: This notebook processes remote sensing data to create feature vectors for grid-based environmental analysis.
global_model_absoluteLegislationCompliance.ipynb: This notebook evaluates air pollution data against absolute legislation compliance standards and the overall AQI index sum maps used within the manuscript.
global_model_figures.ipynb: This notebook generates figures and visualizations for the global air pollution model results.
global_model_individual_station_predicitions_spatial.ipynb: This notebook analyzes the spatial predictions for individual air pollution monitoring stations.
global_model_individual_station_predicitions_temporal.ipynb: This notebook analyzes the temporal predictions for individual air pollution monitoring stations.
global_model_individual_stations_correlation_bias.ipynb: This notebook examines the correlation and bias in predictions for individual monitoring stations.
global_model_output_air_pollution_map_analysis.ipynb: This notebook analyzes the output maps of air pollution concentrations generated by the global model.
global_model_training_emissions_area.ipynb: This notebook trains the model using emissions data for specific areas.
global_model_training_emissions_area_quantile_regression.ipynb: This notebook applies quantile regression to train the model using emissions data for specific areas.
globalEmissions.ipynb: This notebook processes and analyzes global emissions data for use in the air pollution model.