The rapid development of automated measurement equipment enables researchers to collect greater quantities of time-resolved data from indoor and outdoor environments. The interpretation of the resulting data can be a time-consuming effort. This dataset contains the R code and time-resolved indoor and outdoor PM_2.5 data to illustrate a machine learning approach called Random Forest (RF). The method is used to study a dataset of 836 emission events that occurred over a two-week period, each in 18 apartments in California. The resulting RF model is applied to analyze PM_2.5 data of an entirely separate dataset collected from 65 new homes in California. The RF model identifies 442 indoor emission events, with a few misidentifications. In the accompanying paper, we present the RH model development and evaluate its performance as the sample size and source vary. We discuss the characteristics of the dataset that tended to help the source identification and why. For example, we show that data from many events and from different apartments are essential for the model to be suitable for analyzing the new separate dataset. We also show that longitudinal data appears to be more helpful than the time frequency of measurements in a given apartment.

The ‘Dataset’ directory contains two datasets of indoor and outdoor PM2.5 data that were previously collected from field studies conducted by our research group at the Lawrence Berkeley National Laboratory. Dataset 1 contains PM2.5 data that were collected by Noris et al. (2013) from two-weeks of monitoring in 18 low-income apartments in California. Dataset 1 is used as the training dataset, where the indoor PM emission events were previously analyzed by Chan et al. (2018) using a rule-based method. Dataset 2 contains PM2.5 data that were collected by Singer et al. (2020) from 65 new California single-family homes for one week each.

The 18 apartments in Dataset 1 were identified by building number (‘Bldg’ = 1, 2, or 3), apartment number (‘Apt’ = 1 to 6), and whether the data was collected before (‘Period = 1) or after (‘Period = 2’) retrofit. The 65 single-family homes in Dataset 2 were identified by building number (‘Bldg’). An adjustment factor of 1.23 was applied to the indoor PM2.5 concentration “data_value_raw” measured using a photometer for Dataset 2, see Singer et al. (2020) for more details. The PM2.5 concentrations in Dataset 1 already incorporated an adjustment factor, see Chan et al. (2018) for more details.

Both datasets were processed to calculate the following “features”, some of which were used in the Random Forest model.

Indoor_value is the indoor PM_2.5 concentration (ug/m³)
Back_diff_x, where x = 1, 2, 3, 4, 5, and 10, corresponding to the backward-difference in indoor PM_2.5 (ug/m³) in relation to the value at x timestep before it.
Front_diff_x, where x = 1, 2, 3, 4, 5, and 10, corresponding to the frontward-difference in indoor PM_2.5 (ug/m³) in relation to the value at x timestep after it.
Variance_y_min, where y = 4, 8, 12, and 16, corresponding to the standard deviation of y minutes of indoor PM_2.5 (ug/m³) centering at the current timestep.
Outdoor_value is the outdoor PM_2.5 concentration (ug/m³)
Outdoor_hourly is the 1-hour average outdoor PM_2.5 (ug/m³) calculated using data from the previous hour ending at the current timestep.
Extreme_point is a data flag: 1 means the current timestep of indoor PM_2.5 is a local minimum or maximum, 0 = no
Extreme_forward is the indoor PM_2.5 concentration (ug/m³) at the next local minimum or maximum datapoint
Extreme_backward is the indoor PM_2.5 concentration (ug/m³) at the previous local minimum or maximum datapoint
Extreme_diff = Extreme_forward Extreme_backward, is the difference in indoor PM_2.5 (ug/m³) between two local minimum or maximum datapoint
Extreme_forward_outdoor is the outdoor PM_2.5 (ug/m³) at the next local minimum or maximum datapoint
Extreme_backward_outdoor is the outdoor PM_2.5 (ug/m³) at the previous local minimum or maximum datapoint

In addition to the above, the training Dataset 1 also contains the following data flags that were determined previously by Chan et al. (2018) using the rule-based method.

Emission is a data flag indicating whether the current datapoint was part of an indoor emission event: 1 = yes, 0 = no
Backward_E is a data flag indicating whether the pervious local minimum or maximum datapoint was part of an indoor emission event: 1 = yes, 0 = no
Forward_E is a data flag indicating whether the next local minimum or maximum datapoint was part of an indoor emission event: 1 = yes, 0 = no
Decay is a data flag indicating whether the current datapoint was part of a decay period following an indoor emission: 1 = yes, 0 = no

The ‘Dataset’ directory contains a third input file ‘Dataset2_Volume.csv’. The file provides data about the approximate well-mixed air volume of the 65 single-family homes, which is needed to compute indoor PM_2.5 emission rates for Dataset 2. The well-mixed air volume (ft³) is computed by ‘FloorArea_sqft’ x ‘CeilingHgt_ft x ‘Factor’. ‘Factor’ is the % of the house air volume in the vicinity of the photometer used to measure indoor PM_2.5, where the PM_2.5 concentration was assumed to be well-mixed during the indoor emission event and decay period.

In addition to the two PM2.5 datasets and the computed features, there are two additional directories: ‘Code’ and ‘Results.’

‘Code’ contains four R scripts that build a Random Forest model using Dataset 1 as the training dataset, then apply the model to Dataset 2 and compute statistics of indoor PM_2.5 emission events. ‘Results’ contains outputs from the R scripts.

'Code' files are:

‘1-Building random forest using dataset1.R’ reads ‘Training_Dataset1.csv’, builds Random Forest models, and saves the results in two R objects files: ‘Emission.RData’ and ‘Decay.RData’.
‘2-Apply resulting model to dataset2.R’ reads ‘Dataset2.csv’, and applies the Random Forest models from two R objects files: ‘Emission.RData’ and ‘Decay.RData’, to identify indoor PM_2.5 emission events and decay periods. The results are output to ‘Dataset2_analysis.csv’.
‘3-Dataset 2 summary.R’ reads ‘Dataset2_analysis.csv’ and outputs indoor PM_2.5 emission events (‘Sum_E.csv’) and decay periods (‘Sum_D.csv’) identified by the Random Forest models. ‘Selected_events.csv’ are a subset of emission events meeting the criteria outlined in the accompanying paper.
‘4-Emission rate calculation.R’ computes emission rates for the indoor PM_2.5 events and saves the results in two R objects files: ‘Pre_E.RData’ and ‘Pre_D.RData’. Summary statistics of emission rates and other event characteristics are written in ‘Dataset2_result.csv’.

Users of the codes are reminded to change the file paths to the working directory prior to running R scripts. R version 3.6.3 was used to produce the results. Two R libraries are needed: ‘lubridate’ version 1.7.8 and ‘randomForest’ version 4.6-14.

Automating the interpretation of PM2.5 time-resolved measurements using a data-driven approach

Data files

Abstract

Automating the interpretation of PM2.5 time-resolved measurements using a data-driven approach

Data files

Abstract

Methods

Usage notes

Works referencing this dataset