TY - JOUR
T1 - Imputation methods for addressing missing data in short-term monitoring of air pollutants
AU - Hadeed, Steven J.
AU - O'Rourke, Mary Kay
AU - Burgess, Jefferey L.
AU - Harris, Robin B.
AU - Canales, Robert A.
N1 - Funding Information:
This work was funded by NIEHS : P50ES026089 , P30 ES006694 , T32 ES007091 , and EPA : R836151 . “Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under Award Number P50ES026089 . The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.”
Funding Information:
This work was funded by NIEHS: P50ES026089, P30 ES006694, T32 ES007091, and EPA: R836151. ?Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under Award Number P50ES026089. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.?, ?This publication was developed under Assistance Agreements No. 836151 awarded by the U.S. Environmental Protection Agency to The University of Arizona. It has not been formally reviewed by EPA. The views expressed in this document are solely those of the authors and do not necessarily reflect those of the Agency. EPA does not endorse any products or commercial services mentioned in this publication.?
Publisher Copyright:
© 2018
PY - 2020/8/15
Y1 - 2020/8/15
N2 - Monitoring of environmental contaminants is a critical part of exposure sciences research and public health practice. Missing data are often encountered when performing short-term monitoring (<24 h) of air pollutants with real-time monitors, especially in resource-limited areas. Approaches for handling consecutive periods of missing and incomplete data in this context remain unclear. Our aim is to evaluate existing imputation methods for handling missing data for real-time monitors operating for short durations. In a current field-study, realtime PM2.5 monitors were placed outside of 20 households and ran for 24-hours. Missing data was simulated in these households at four consecutive periods of missingness (20%, 40%, 60%, 80%). Univariate (Mean, Median, Last Observation Carried Forward, Kalman Filter, Random, Markov) and multivariate time-series (Predictive Mean Matching, Row Mean Method) methods were used to impute missing concentrations, and performance was evaluated using five error metrics (Absolute Bias, Percent Absolute Error in Means, R2 Coefficient of Determination, Root Mean Square Error, Mean Absolute Error). Univariate methods of Markov, random, and mean imputations were the best performing methods that yielded 24-hour mean concentrations with the lowest error and highest R2 values across all levels of missingness. When evaluating error metrics minute-by-minute, Kalman filters, median, and Markov methods performed well at low levels of missingness (20–40%). However, at higher levels of missingness (60–80%), Markov, random, median, and mean imputation performed best on average. Multivariate methods were the worst performing imputation methods across all levels of missingness. Imputation using univariate methods may provide a reasonable solution to addressing missing data for short-term monitoring of air pollutants, especially in resource-limited areas. Further efforts are needed to evaluate imputation methods that are generalizable across a diverse range of study environments.
AB - Monitoring of environmental contaminants is a critical part of exposure sciences research and public health practice. Missing data are often encountered when performing short-term monitoring (<24 h) of air pollutants with real-time monitors, especially in resource-limited areas. Approaches for handling consecutive periods of missing and incomplete data in this context remain unclear. Our aim is to evaluate existing imputation methods for handling missing data for real-time monitors operating for short durations. In a current field-study, realtime PM2.5 monitors were placed outside of 20 households and ran for 24-hours. Missing data was simulated in these households at four consecutive periods of missingness (20%, 40%, 60%, 80%). Univariate (Mean, Median, Last Observation Carried Forward, Kalman Filter, Random, Markov) and multivariate time-series (Predictive Mean Matching, Row Mean Method) methods were used to impute missing concentrations, and performance was evaluated using five error metrics (Absolute Bias, Percent Absolute Error in Means, R2 Coefficient of Determination, Root Mean Square Error, Mean Absolute Error). Univariate methods of Markov, random, and mean imputations were the best performing methods that yielded 24-hour mean concentrations with the lowest error and highest R2 values across all levels of missingness. When evaluating error metrics minute-by-minute, Kalman filters, median, and Markov methods performed well at low levels of missingness (20–40%). However, at higher levels of missingness (60–80%), Markov, random, median, and mean imputation performed best on average. Multivariate methods were the worst performing imputation methods across all levels of missingness. Imputation using univariate methods may provide a reasonable solution to addressing missing data for short-term monitoring of air pollutants, especially in resource-limited areas. Further efforts are needed to evaluate imputation methods that are generalizable across a diverse range of study environments.
KW - Ambient PM2.5
KW - Imputation
KW - Missing data
KW - Real-time monitoring
UR - http://www.scopus.com/inward/record.url?scp=85084299186&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084299186&partnerID=8YFLogxK
U2 - 10.1016/j.scitotenv.2020.139140
DO - 10.1016/j.scitotenv.2020.139140
M3 - Article
C2 - 32402974
AN - SCOPUS:85084299186
SN - 0048-9697
VL - 730
JO - Science of the Total Environment
JF - Science of the Total Environment
M1 - 139140
ER -