10th International Aerosol Conference
September 2 - September 7, 2018
America's Center Convention Complex
St. Louis, Missouri, USA

Abstract View


Spatial Modeling of PM2.5 Concentrations Measured by a Low-Cost Sensor Network: Comparison of Linear and Machine-Learning Enabled Land Use Models

SAKSHI JAIN, Naomi Zimmerman, Albert Presto, Carnegie Mellon University

     Abstract Number: 968
     Working Group: Low-Cost and Portable Sensors

Abstract
Low-cost sensors for PM2.5 and other pollutants can be widely deployed to characterize coupled temporal and spatial variations in concentration, inform human exposures, and disseminate information to the public. Many previous studies have characterized spatial patterns of PM2.5 by building land use regression (LUR) models from distributed filter samplers. These models can be generated with high spatial resolution, thereby producing estimates of long-term (e.g., annual average) spatial patterns of concentration. Deployment of low-cost PM2.5 sensors, which typically sample in real time, creates the possibility of time-resolved and/or real-time modeling of PM2.5 concentration surfaces. Additionally, since the low-cost sensors operate in real time, it may be possible to train models on a smaller number of sampling locations than in past studies that used integrated filter samples.

Our aim in this study is to develop spatial models for PM2.5 based on measurements collected by a network of low-cost PM2.5 nephelometers. We test two different models: LUR and a machine learning enabled land use model (land use random forest – LURF). LUR relates measured PM2.5 concentrations to land use variables (e.g., land zoning, population density, traffic intensity) using multi-linear regression. LURF uses the same basic structure and land use variables as LUR but uses random forests to link observed concentrations to land use variables. We expect LURF to outperform LUR because (1) near-source concentration profiles are not linear, and (2) LURF can resolve interactions between variables that are difficult to resolve in linear models, without overfitting.

Models were developed for daily average PM2.5 concentrations for periods spanning August 2016 through May 2017. PM2.5 data were collected from 15 different sites in Pittsburgh, Pennsylvania. We tested different combinations of sensors used for model training and validation. Land use variables included a set of 15 different classes of time-independent (e.g., building height) and time-dependent (e.g., wind speed) predictor variables.

We used both k-folds cross-validation and hold-out validation to evaluate model performance. For k-folds cross-validation, the training dataset for a subset of the 15 PM2.5 sampling sites was divided into 5 folds. Five different models were built, with each one using 4 folds (80% of data) for training and 1 fold for internal validation. The final model was an average across the set of models developed for each fold. For hold-out validation, we tested the model performance for sensors not used in model building. E.g., if the model was trained on 5 sampling sites, the remaining 10 were used to independently test the model. This validation tests model transferability both in space and time, as all sensors did not operate concurrently. We also performed a second, independent hold-out validation against a larger network of 35 additional PM2.5 sensors.

The LURF model significantly outperformed LUR model in cross validation and in most scenarios. With a testing window up to 13 weeks, the R-squared value for LURF model internal validation was above 0.75 and the Spearman rho value was above 0.9, for all cases that were tested.

LURF displayed greater temporal and spatial transferability than traditional LUR. A LUR model trained on 60 days of data from 10 sampling locations had R-squared value ~ 0.14. Whereas, similar training conditions in a LURF was able to predict concentrations at the remaining hold out sites (R-squared > 0.5).