Patterns discovery dataset for particulate matter (pm2.5) pollution trends in Japan
Abstract
Air pollution presents a significant environmental risk, impacting human health, accelerating climate change, and disrupting ecosystems. The main aim of air pollution research is to pinpoint the most harmful pollutants identified in previous studies and to map regions exposed to high pollution levels. This study introduces a large-scale, high-quality dataset to advance the analysis of PM2.5 pollution and reveal hidden patterns through pattern mining techniques. The dataset covers five years of hourly PM2.5 measurements collected from approximately 1,900 sensors across Japan, sourced from the Ministry of the Environment's Soramame platform. This platform offers hourly pollutant records, downloadable as monthly raw data files. The unorganised raw data files are systematically organised and stored in database tables using an Entity-Relationship (ER) schema.
The primary objective of this dataset is to aid in developing and validating pattern mining models, enabling the accurate detection of frequent patterns within the PM2.5 dataset under diverse conditions. The dataset collection includes the "FINAL_DATASET" CSV file containing timestamps, sensor location IDs, and recorded PM2.5 values. Due to storage limitations, raw data files are excluded from the compressed ZIP (AEROS) file but can be accessed directly via the link provided in the README (Data). By revealing complex patterns, this dataset is a valuable resource for researchers employing pattern mining techniques in PM2.5 analysis. Publicly sharing this dataset promotes collaboration and advances efforts to identify frequently polluted sensors or regions. Researchers are invited to use and contribute to the dataset, broadening its relevance and potential impact.
README: AEROS PM2.5 Dataset
Overview
The AEROS PM2.5 Dataset provides a comprehensive collection of hourly PM2.5 measurements recorded over a period of five years from sensors located across Japan. This dataset is a valuable resource for studying air quality trends, pollution patterns, and environmental health impacts.
Dataset Description
File Information
- File Name:
FINAL_DATASET.csv
- Content: Hourly PM2.5 measurements collected from sensors located in Japan over five years.
Structure
The dataset includes the following columns:
- Timestamps: The date and time when the measurement was recorded.
- Sensor Location IDs: Unique identifiers for the sensor locations.
- PM2.5 Values (µg/m³): The recorded PM2.5 concentration at a specific timestamp and location.
Units
- PM2.5 Values: Measured in micrograms per cubic meter (µg/m³).
Notes on Data
- Empty Cells: Represent instances where no PM2.5 data was recorded by the sensors at the corresponding timestamp.
- Value Interpretation:
- Lowest Values: Indicate less pollution at the specific timestamp and location.
- Highest Values: Indicate high pollution levels at the specific timestamp and location.
Use Cases
This dataset can be used for:
- Analyzing air quality trends over time.
- Investigating regional variations in PM2.5 pollution.
- Developing predictive models for air pollution.
- Studying the correlation between pollution and environmental or health-related factors.
User Interest
For users interested in a deeper analysis or in understanding the creation process of this dataset, please refer to the following GitHub repository:
Methods
The air pollution data was collected from Japan’s Soramame platform, which provides hourly updates on pollutant levels nationwide. The data files were collected from January 1, 2018, 01:00:00, to April 25, 2023, 22:00:00, covering records from approximately 1,900 sensors stationed in various locations across Japan. These files are initially unorganised in CSV format and require systematic organisation by year, month, time, sensor, and pollutant type. To maintain data integrity, we structured the dataset using an Entity-Relationship (ER) schema within a PostgreSQL database, comprising two main tables: the Sensor table (storing sensor name, ID, address, and location) and the Observations table (recording pollutant types and their values). A detailed step-by-step process is provided in the README, and this organization created a consolidated CSV file containing PM2.5 levels, timestamps, and sensor details.