Skip to main content
Dryad

Automating the interpretation of PM2.5 time-resolved measurements using a data-driven approach

Data files

Dec 10, 2020 version files 74.11 MB

Abstract

The rapid development of automated measurement equipment enables researchers to collect greater quantities of time-resolved data from indoor and outdoor environments. The interpretation of the resulting data can be a time-consuming effort. This dataset contains the R code and time-resolved indoor and outdoor PM2.5 data to illustrate a machine learning approach called Random Forest (RF). The method is used to study a dataset of 836 emission events that occurred over a two-week period, each in 18 apartments in California. The resulting RF model is applied to analyze PM2.5 data of an entirely separate dataset collected from 65 new homes in California. The RF model identifies 442 indoor emission events, with a few misidentifications. In the accompanying paper, we present the RH model development and evaluate its performance as the sample size and source vary. We discuss the characteristics of the dataset that tended to help the source identification and why. For example, we show that data from many events and from different apartments are essential for the model to be suitable for analyzing the new separate dataset. We also show that longitudinal data appears to be more helpful than the time frequency of measurements in a given apartment.