Feature Extraction and Prediction of PM2.5 Chemical Constituents in Seoul Using GAIN Machine Learning Models

SONGKANG KIM, Jieun Park, Ilhan Ryoo, Taeyeon Kim, Yeonseung Cheong, Hyejin Shin, Sunghwan Shim, Sujung Han, Minsu Kang, Seung-Muk Yi, Seoul National University, Seoul, Korea

     Abstract Number: 84
     Working Group: Source Apportionment

Abstract
Fine particulate matter (PM2.5) is a significant component of air pollution and has been linked to adverse health effects, making its continuous monitoring crucial. Despite the high cost and time required to obtain PM2.5 chemical composition, complete and reliable data are not always available. Missing values in PM2.5 chemical composition data are a common challenge in data interpretation and limit the usefulness of available data. Additionally, imputing missing values of PM2.5 chemical composition is challenging due to its complex nature. Also, incomplete data can reduce the accuracy and reliability of modeling results, such as source apportionment. Previous studies have used real-time data to interpolate missing values between observed data, but we used data collected through a sampler that collects atmospheric aerosols for 24 hours once every 3 or 6 days using 3 channels. PM2.5 constituent consists of 3 groups, namely 6 ions, 2 carbons, and 20 trace elements, were targeted for prediction. Our study evaluated the applicability of feature extraction using machine learning models to predict missing values among the observed chemical constituents of PM2.5, in order to improve the reliability and availability of the data. We employed several machine learning models, and among them, the generative adversarial imputation network (GAIN) had the highest predictive power. The prediction accuracies of each model were compared to evaluate their applicability with stepwise increases in the input data and changes in the components to be predicted. Also, source apportionment was performed using the PMF model for the data generated using the predicted results and the data without interpolation of missing values. Our study aimed to compare and test the performance of the missing value interpolation method through the model for not only real-time data with many measured values but also directly sampled data with a small number of data based on each result.