J. Cent. South Univ. Technol. (2010) 17: 307-315
DOI: 10.1007/s11771-010-0047-x
Prediction of dust fall concentrations in urban atmospheric environment through support vector regression
JIAO Sheng(焦胜)1, ZENG Guang-ming(曾光明)2, HE Li(何理)3,
HUANG Guo-he(黄国和)4, LU Hong-wei(卢宏玮)4, GAO Qing(高青)1
1. College of Architecture, Hunan University, Changsha 410082, China;
2. College of Environmental Science and Engineering, Hunan University, Changsha 410082, China;
3. Faculty of Engineering, Architecture and Science, Ryerson University, Toronto, Ontario, M5B 2K3, Canada;
4. Faculty of Engineering, University of Regina, Regina, Sask, S4S 0A2, Canada
? Central South University Press and Springer-Verlag Berlin Heidelberg 2010
Abstract: Support vector regression (SVR) method is a novel type of learning machine algorithms, which is seldom applied to the development of urban atmospheric quality models under multiple socio-economic factors. This study presents four SVR models by selecting linear, radial basis, spline, and polynomial functions as kernels, respectively for the prediction of urban dust fall levels. The inputs of the models are identified as industrial coal consumption, population density, traffic flow coefficient, and shopping density coefficient. The training and testing results show that the SVR model with radial basis kernel performs better than the other three both in the training and testing processes. In addition, a number of scenario analyses reveal that the most suitable parameters (insensitive loss function ε, the parameter to reduce the influence of error C, and discrete level or average distribution of parameters σ) are 0.001, 0.5, and 2 000, respectively.
Key words: support vector regression; urban air quality; dust fall; socio-economic factors; radial basis function
1 Introduction
Evaluation of potential air pollution impacts is usually conducted using physically-based air quality simulation models [1-2]. Implementing these models normally requires identification of many atmospheric, meteorological, topographical, and environmental parameters [3-4]. However, when no knowledge of the air dispersion can be available, the difficulty would exist in model development, calibration, and verification [1, 5]. This challenge motivates the need of developing statistical models that are independent of the causes of pollutant emission and dispersion.
In addition to statistical models, nonparametric approaches have also shown their performance in air quality forecasting due to their capability in approximating a wide class of nonlinear functions [6-10]. For example, HUANG [1] developed a stepwise cluster analysis modeling approach for dealing with discrete and nonlinear relationships between air pollutant levels and their associated pollution causes. SHIVELY and SAGER [10] proposed a nonparametric regression model for analyzing relationships between ambient ozone levels and related meteorological factors. Based on a long time series of hourly averages and a short time series of daily averages, PODNAR et al [11] used an artificial neural network method for simulating the transport and dispersion of chemical tracers in complex terrain. PEREZ and REYES [12] developed an artificial neural network model to forecast the maxima of 24 hourly averages of PM10 concentrations 1 d in advance. The model was applied to the case of five monitoring stations in the City of Santiago, Chile. The results showed that the model may be used as an operational tool for air quality forecasting with suitable adaptations in large cities. G?MEZ-SANCHIS et al [13] advanced a tropospheric ozone predicting model by using an artificial neural network method. And the surface meteorological variables and vehicle emission variables were used as predictor variables for estimation of ambient ozone concentrations.
Support vector machine (SVM) method, as a novel type of learning machine algorithms, has received extensive attraction of researchers due to its agreeable predicting capability [14]. With introduction of Vapnik’s ε-insensitive loss function [15], SVM has been extended to solve nonlinear regression problems. For instance, ASEFA et al [16] presented an application of the SVM method in network monitoring design, revealing the superior capability of SVM in capturing the complex nonlinear relationships between variables. LU and WANG [17] examined the feasibility of applying SVM to predict air pollutant levels in advancing time series in Hong Kong downtown area. The results revealed that SVM is superior to the conventional ANN model in predicting air quality parameters and of better generalization performance than the ANN model.
While there have been a few applications to air quality predicting [17], the existing SVR-based models were mostly built for capturing the temporal variation trend of pollutant levels. Moreover, few previous statistical models placed emphases on reflecting the effects of multiple social-economic factors on the urban air quality. Therefore, the objective of this work is to introduce the SVR method into the air quality prediction framework under impacts of multiple social-economic factors. In terms of the method, four models constructed by respectively selecting linear, radial basis, spline and polynomial functions as the kernels, which produced predicting models with satisfactory quality, were investigated.
2 Materials and method
The model was developed for the City of Xiamen, in the south of China. Through the use of principal component analysis, industrial coal consumption (x1, 104 t/a), population density (x2, 103 person/km2), traffic flow coefficient (x3, dimensionless) and shopping density coefficient (x4, dimensionless) were found to be the principal explanatory variables determining the air pollutant levels [18]. In order to investigate the urban air quality, Xiamen Environmental Monitoring Station (XEMS) established an air quality monitoring network which contained 31 stations distributed 31 grid squares (1 km×1 km) and covered all the urban districts [1]. From the network, the yearly average data (1984-1988) from the 31 stations were obtained [18]. Though six principal pollutants were monitored involving dust fall (DF), sulfur dioxide (SO2), nitrogen oxides (NOx), carbon monoxide (CO), ozone (O3) and hydrocarbons (HC), only DF was selected as the response variable (Y) to investigate the predicting performance of the SVR models. Note that the values of x3 and x4 have been divided by 100 to reduce computational errors.
Support vector machine is introduced based on the principle of structure risk minimization rather than that of empirical risk minimization which are the basis of traditional neural network models [14]. It can deal with two pattern recognition problems: classification and regression. When the purpose is to map the relationships between the inputs conditions and output pollutant concentrations, support vector regression (SVR) can then be used. If a set of training samples (G) is given as follows:
(1)
where Xi and Yi denote the explanatory and response variables of the ith sample, respectively; m is the number of training samples and i is the index for samples. Based on definition (1), the objective of the SVR problem is to find a function f(X) defined as:
(2)
where is the estimated DF concentration; W is the support vector weight and B is bias; represents a dot product between W and X. The function should have at most ε deviation from the observed response (Y) for all the training data. Meanwhile, the function should be as flat as possible; flatness of function f means that one seeks a small ||W||2 value equal to < W, W >. This problem can be visualized as a region (f(X)±ε) around the regression function. If the deviation between the observed and simulated values is smaller than ε, then the regression function is acceptable, implying that one does not need to concern the errors only if they are smaller than ε. This is important if one does not wish to lose objective function value more than ε. As proposed by VAPNIK et al [15], the SVR problem can be formulated as the following optimization model:
(3a)
subject to:
-i>-B+Yi≤ε, (3b)
i>+B-Yi≤ε, (3c)
The implicit assumption in Eq.(3) is that such a function is approximately equal to all pairs (Xi, Yi) with ε precision, or in other words, that the convex optimization problem is feasible. However, in some cases, the above problem might not be feasible so that slack variables were introduced to control the infeasibility or errors. Analogous to the “soft” margin loss function [19] used for SVR [20], one can introduce slack variables to cope with otherwise infeasible constraints of the optimization problem. Thus, the problem could be formulated as the following quadratic optimization problem by introducing a dual set of variables (refereed to as Lagrange multipliers) [15]:
(4a)
subject to:
(4b)
0≤≤C (4c)
0≤≤C (4d)
whereandare Lagrange multipliers. C becomes the upper bound of coefficientandThe optimal andvalues can be obtained by solving the above quadratic optimization problem. Then, the optimal weight vector of the SVR model could be calculated by
(5)
where X* is the input condition of a new sample. The optimal W* yields the following SVR prediction function:
(6)
The dot production between X* and Xi can be replaced by a kernel function [15]. Any function that meets Mercer’s condition can be used as a kernel. Therefore, various traditional functions (e.g., linear, radial basis function, spline, polynomial, and tangent functions) can become a kernel of a SVR model. In the SVR model, the input conditions corresponding to the Lagrange multipliers and are considered as the support vectors. For all SVR models, parameters ε and C are predefined, representing the tradeoff between the model complexity and the approximation error [21]. Meanwhile, the introduction of a kernel function might add the new parameters. For example, the radial basis function needs to use an additional parameter (σ) to depict the width of the function, and the polynomial adds a parameter (p) to characterize the degree of polynomials. In contrast, the linear, spline and tangent functions do not introduce any additional parameters into the SVR models. Since the selection of a kernel and the predefined values of parameters determine the model’s prediction accuracy, post-modeling approaches are required to aid in realizing this task. Sensitivity analysis and scenario analysis are two frequently used tools to screen a best SVR model from various alternatives. While the genetic algorithm has been proposed to optimize the model parameters [14], it cannot optimize the model structure, i.e., no answer can be provided that kernel function is the most appropriate.
3 Results and discussion
Various kernel functions (Table 1) can be used to identify a SVR model for predicting the DF concentrations. In this case, linear, radial basis, spline, and polynomial kernel functions were employed to formulate SVR-1, SVR-2, SVR-3, and SVR-4 models, respectively. The four models were acquired by using Matlab 7.0 with a quadratic optimization algorithm. Moreover, the construction of models was divided into two processes to evaluate the quality of the acquired SVR models. One is the training process, where the first four years (124 samples) data were used as the training samples. The other is the testing process, where the data of the last year (31 samples) were used as the testing samples.
Table 1 Four SVR models proposed for simulating DF concentrations
3.1 Training results of SVR models
Fig.1 presents the scatterplots of simulated and observed DF concentrations in the training process. It is shown that the SVR-1 model achieved the most unsatisfactory training performance. While the observed DF concentrations range between 0.08 and 0.45 t/(m2·d), the predicted levels only vary from 0.24 to 0.30 t/(m2·d). The narrowed prediction range indicates that the linear kernel function is not suitable at least in this work context as large errors were introduced in the training process (the root mean square error (RMSE) was 2.809 1). In addition, the SVR-3 and SVR-4 models only achieved a training performance with the RMSE values being over 2.4, showing that the spline and polynomial kernel functions are not reasonable. In comparison, the SVR-2 model has the best training performance. As shown in Fig.1(b), the predicted DF concentrations agree well with the observed ones except a few data points slightly deviated from the straight line. The RMSE achieved from the model is 1.345 1, which is significantly lower than those from the other three models.
Fig.1 Scatterplots of simulated and observed DF concentrations in training process: (a) SVR-1 model; (b) SVR-2 model; (c) SVR-3 model; (d) SVR-4 model
It is found that the R2 generated from the SVR-2 model is 0.789 8, showing that approximately 80% of the data variability can be interpreted by the model, due to the complexity in simulating the air quality. Through comparison, it can be seen that the other three models only achieve the R2 lower than 65%. Thus, the SVR-2 model has the best performance in the training process.
3.2 Testing results of SVR models
Fig.2 compares the observed and simulated DF concentrations for the testing samples. Almost half of the predicted values agree well with the observed ones at grids 7, 10, 16, 19 to 24, and 26 to 31, with the relative errors lower than 30%. This reveals that the four models are useful in simulating the DF concentrations at these grids. However, grids 5, 17, 18, and 27 are not well simulated by the four models, with the maximum absolute and relative errors being over 0.06 t/(m2?d) and 40%, respectively. The large errors may result from two sources. The first is probably the errors from the obtained data in air quality and the input socio-economic data. The second is that the model itself can hardly interpret all the relationships between explanatory and response variables.
Fig.2 Comparison of observed and simulated DF concentrations in testing process: (a) SVR-1 model; (b) SVR-2 model; (c) SVR-3 model; (d) SVR-4 model
Table 2 presents the DF concentrations obtained, from which the testing performance can be analyzed. As the SVR-1 model achieved poor training performance, it did not perform satisfactorily in the testing process. The average and maximum absolute errors are 0.08 and 0.19 t/(m2?d), respectively, and the average and maximum relative errors are 40.83% and 222.25%, respectively. Neither the SVR-3 nor SVR-4 models achieved agreeable testing performance. Their average and maximum relative errors are larger than 40% and 200%, respectively. For the SVR-2 model, the maximum absolute error (0.17 t/(m2?d)) and relative error (132.68%) occurred at grids 3 and 5, respectively, which are lower than those obtained from the other models. The R2 also reveals that the SVR-2 model performs better than the other ones in the testing process.
Table 2 DF concentrations obtained from four SVR models
In this work, RMSE and mean relative error (MRE) are selected. But a number of scenarios are analyzed to investigate the effects of some key parameters on the model’s predicting accuracy. The results show that ε from 0.001 to 1 has no significant effects on the predicting performance. Therefore, the scenario analysis was only conducted considering σ and C effects. Fig.3 presents variations of RMSE with C when σ takes 0.2, 0.5, 1.0, and 2.0, respectively. In the training process, the RMSE decreases sharply from 4.861 to almost zero when σ takes 0.2 (Fig.3(a)), showing that only when C is large enough, can the model achieve a perfect training performance. In the testing process, the RMSE can only be reduced from 5.806 (C=1) to 4.087 (C=20). This indicates that though the training data can be perfectly fitted by the SVR-1 model, the testing performance was not satisfactory due to the achieved high RMSE values. A reason leading to the remarkable discrepancy in the training and testing performances may be the overfitting of the SVR-2 model, which neglects the impacts of errors contained in the training samples. When σ takes 1.0 and 2.0, the training performance of the models will not significantly improved with the increase of the C value; however, the testing performance becomes worse and worse. Thus, the training and testing performances can hardly be acceptable under these situations. Comparatively, when σ takes 0.5, the training performance can be agreeable, with the RMSE value decreased from 3.395 to 1.047 (Fig.3(b)). For the testing, the lowest RMSE value is 1.292 at C=2 000, also lower than those of the other three models.
3.3 Sensitivity analysis
Fig.4 shows the variations of MRE with C when σ takes 0.2, 0.5, 1.0, and 2.0, respectively.
Fig.3 Variations of RMSE with parameters σ and C: (a) σ=0.2; (b) σ=0.5; (c) σ=1.0; (d) σ=2.0
For example, the MRE would decrease sharply from 41.11% to almost 0.14% (close to zero) when σ takes 0.2 (Fig.4(a)). This reveals the superior capability of the SVR model in fitting the training data (even if large errors are introduced). However, the testing performance can hardly be considerably improved, which can be illustrated by the rather high MRE values for all C values. This finding is similar to that observed in Fig.3(a). Note that the SVR-2 model can achieve the best testing performance when σ and C take 0.5 and 2 000, respectively. The acquired MRE was only 26.25%, remarkably lower than that under any other scenarios. This finding was also analogous to that achieved in Fig.3. With a satisfactory training and the best testing performance under this scenario, the SVR-2 model can thus be determined with the recommended ε, σ, and C values being 0.001, 0.5, and 2 000, respectively. These values are determined mainly under a compromised scenario that not only ensures satisfactory predicting accuracy but also prevents the occurrence of overfitting.
3.4 Discussion
In general, all the four SVR models can achieve high performance in the training process only when the parameters are identified properly. However, perfect training performance does not imply similar predicting performance. The error sources could be threefold. The first is the errors originating from the related input conditions (industrial coal consumption, population density, traffic flow coefficient, and shopping density coefficient) and output DF concentrations. The second is the structure of the SVR model where only four explanatory variables are considered. Although the previous results of principle component analysis showed that these four factors would significantly affect the DF levels, neglecting the effects of other factors could lead to the bias of the predicting concentrations.
The last is the model parameters (ε, σ, and C) that are identified through comparing the training and testing performance based upon a number of scenarios. While the most satisfactory SVR-2 model was achieved through the sensitivity analysis, this is not the best choice since only limited number of scenarios can be considered. Neglecting of scenarios under other ε, σ, and C levels could miss a better SVR-2 model than the recommended one. Thus, in future studies, optimization methods can be used to screen a really optimized SVR-2 model. The objectives of the optimization problem can minimize RMSE and MRE values, and the constraints can be the variation intervals of ε, σ, and C levels. As the optimiza- tion problem is nonlinear, heuristic optimization algorithms such as genetic algorithm and simulated annealing can be employed to find the globally optimized ε, σ, and C values. With the aid of the optimization methods, it would be convenient to identify a globally optimized SVR model with other kernel functions (e.g., sigmoid function) for which more parameters (ε, σ, and C) are required to be determined.
Fig.4 Variations of MRE with parameters σ and C: (a) σ=0.2; (b) σ=0.5; (c) σ=1.0; (d) σ=2.0
The DF concentrations can also be predicted by using other methods such as multiple regression analysis [22], artificial neural network [23-24], and stepwise cluster analysis [1]. However, a key limitation of the regression analysis is that the predicting accuracy to a high extent depends on assumed model structure. For instance, if the relationship between the DF concentration and the input conditions is linear, a nonlinear hypothesis may introduce a large error for the predicting results; contrarily, a linear hypothesis is not appropriate if the DF concentration is nonlinearly related to the input conditions. The artificial neural networks (ANN) and sneak circuit analysis (SCA) have been proved to be effective in predicting a variety of unknown linear, nonlinear, or discrete relationships between variables. However, a major concern of these methods is the potential overtraining, which may lead to the declined generalization capability [1, 25]. By contrast, SVR could be more robust if the key parameters are properly selected. This has been proven by the previous studies [26-27].
4 Conclusions
(1) Support vector machine (SVM) method is a novel type of learning machine algorithms, which has a few applications to environmental and chemical prediction fields. This work used SVM to predict the urban dust fall (DF) concentrations under impacts of multiple socio-economic factors. To investigate the effects of kernels on predicting performance, four models are constructed by selecting linear, radial basis, spline, and polynomial functions as the kernels, respectively. The model construction is divided into two processes (training and testing) to evaluate the model quality. The training results show that SVR-1 model is of the most unsatisfactory training performance, whereas SVR-2 model achieves the best one. In the testing process, it is also observed that the SVR-1 model has the worst testing performance, while SVR-2 model performs best.
(2) Due to the difficulty in determining the required modeling parameters, a number of scenarios are analyzed by changing the parameters within their possible variation ranges. The results show that ε values ranging from 0.001 to 1 have no significant effects on the training performance; only if the C is large enough, can the model achieve perfect training performance; and when σ takes 0.5, the training performance can be agreeable. Thus, the recommended ε, σ, and C values are respectively 0.001, 0.5, and 2 000 in this work.
(3) While scenario analysis can aid in screening modeling parameters, the computational efforts can be hardly neglected particularly with the increase of the number of statistical samples and explanatory variables. Thus, the future studies will apply global optimization techniques to finding the optimal parameters. Moreover, the four SVR models are not compared to conventional models developed by multiple regression analysis, artificial neural network, and stepwise cluster analysis. However, this work shows that the satisfactory training and testing performances can be gained through SVM. This provides a new approach for the prediction of urban dust fall concentrations, particularly under impacts of multiple socio-economic factors.
References
[1] HUANG Guo-he. Stepwise cluster analysis method for predicting air quality in an urban environment [J]. Atmos Environ, 1992, 26(3): 349-357.
[2] RUSSELL A G, WINNER D A, HARLEY R A, MCCUE K F, CASS G R. Mathematical modeling and control of the dry deposition flux of nitrogen-containing air pollutants [J]. Environ Sci Technol, 1993, 27(13): 2772-2782.
[3] LU Hong-wei, HUANG Guo-he, LIU Lei, HE Li. An interval-parameter fuzzy-stochastic programming approach for air quality management under uncertainty [J]. Environmental Engineering Science, 2008, 25(6): 895-909.
[4] LU Hong-wei, HUANG Guo-he, LIU Zhen-fang, HE Li. Greenhouse gas mitigation-induced rough-interval programming for municipal solid waste management [J]. J Air Waste Manage Assoc, 2008, 58(12): 1546-1559.
[5] LU Hong-wei, HUANG Guo-he, LIU Zhen-fang, HE Li, ZENG Guang-ming. An inexact dynamic optimization model for municipal solid waste management in association with greenhouse gas emission control [J]. J Environ Manage, 2009, 90(1): 396-409.
[6] HE Li, HUANG Guo-he, LU Hong-wei, ZENG Guang-ming. Wavelet-based multiresolution analysis technique for data cleaning and its application to water quality management system [J]. Expert Systems with Applications, 2008, 35(3): 1301-1310.
[7] HE Li, HUANG Guo-he, LU Hong-wei, ZENG Guang-ming. Optimization of surfactant-enhanced aquifer remediation for a laboratory BTEX system under parameter uncertainty [J]. Environ Sci Technol, 2008, 42(6): 2009-2014.
[8] HE Li, HUANG Guo-he, LU Hong-wei. Health-risk-based groundwater remediation system optimization through clusterwise linear regression [J]. Environ Sci Technol, 2008, 42(24): 9237-9243.
[9] HE Li, HUANG Guo-he, LU Hong-wei, ZENG Guang-ming. An integrated simulation, inference, and optimization method for identifying groundwater remediation strategies at petroleum- contaminated aquifers in western Canada [J]. Water Research, 2008, 42(10/11): 2629-2639.
[10] SHIVELY T S, SAGER T W. Semiparametric regression approach to adjusting for meteorological variables in air pollution trends [J]. Environ Sci Technol, 1999, 33(21): 3873-3880.
[11] PODNAR D, KORACIN D, PANORSKA A. Application of artificial neural networks to modeling the transport and dispersion of tracers in complex terrain [J]. Atmos Environ, 2002, 36: 561-570.
[12] PEREZ P, REYES J. An integrated neural network model for PM10 forecasting [J]. Atmos Environ, 2006, 40: 2845-2851.
[13] G?MEZ-SANCHIS J, MARTIN-GUERRERO J D, SORIA-OLIVAS E, VILA-FRANCES J, CARRASCO J L, VALLE-TASCON S D. Neural networks for analyzing the relevance of input variables in the prediction of tropospheric ozone concentration [J]. Atmos Environ, 2006, 40: 6173-6180.
[14] PAI P F, HONG W C. Forecasting regional electricity load based on recurrent support vector machines with genetic algorithms [J]. Electric Power Systems Research, 2005, 74: 417-425.
[15] VAPNIK V, GOLOWICH S, SMOLA A. Support vector machine for function approximation, regression estimation, and signal processing [J]. Adv Neural Inf Process Syst, 1996(9): 281-287.
[16] ASEFA T, KEMBLOWSKI M, URROZ G, MCKEE M, KHALIL A. Support vectors-based groundwater head observation networks design [J]. Water Resour Res, 2004, 40(11): 1-10.
[17] LU W Z, WANG W J. Potential assessment of the “support vector machine” method in forecasting ambient air pollutant trends [J]. Chemosphere, 2005, 59: 693-701.
[18] SUN S. Principal component analysis of air pollutant sources in Xiamen, China [J]. China Envir Sci, 1989, 10: 23-41.
[19] HE L, CHAN C W, HUANG G H, ZENG G M. A probabilistic reasoning-based decision support system for selecting remediation technologies for petroleum-contaminated sites [J]. Expert Systems with Applications, 2006, 30(4): 783-795.
[20] CORTES C, VAPNIK V. Support vector networks [J]. Machine Learning, 1995, 20: 273-297.
[21] KHALIL A, ALMASRI M N, MCKEE M, KALUARACHCHI J J. Applicability of statistical learning algorithms in groundwater quality modeling [J]. Water Resour Res, 2005, 41: 1-16.
[22] YUE W, LI X, LIU J, LI Y, YU X, DENG B. Characterization of PM2.5 in the ambient air of Shanghai city by analyzing individual particles [J]. Sci Total Environ, 2006, 368: 916-925.
[23] CHALOULAKOU A, SAISANA M, SPYRELLIS N. Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens [J]. Sci Total Environ, 2003, 313: 1-13.
[24] SAHOO G B, RAY C, MEHNERT E, KEEFER D A. Application of artificial neural networks to assess pesticide contamination in shallow groundwater [J]. Sci Total Environ, 2006, 367: 234-251.
[25] ELGAALI E, GARCIALA. Using neural networks to model the impacts of climate change on water supplies [J]. J Water Res Pl.-ASCE, 2007, 133(3): 230-243.
[26] GOYAL P, CHAN A T, JAISWAL N. Statistical models for the prediction of respirable suspended particulate matter in urban cities [J]. Atmos Environ, 2006, 40: 2068-2077.
[27] HANRAHAN G, GARZA C, GARCIA E, MILLER K. Experimental design and response surface modeling: A method development application for the determination of reduced inorganic species in environmental samples [J]. Journal of Environmental Informatics, 2007, 9(2): 71-79.
Foundation item: Projects(2007JT3018, 2008JT1013, 2009FJ4056) supported by the Key Project in Hunan Science and Technology Program, China; Project(20090161120014) supported by the New Teachers Sustentation Fund in Doctoral Program, Ministry of Education, China
Received date: 2009-06-17; Accepted date: 2009-08-25
Corresponding author: JIAO Sheng, PhD; Tel: +86-731-88852921; E-mail: jiaosheng2008@163.com
(Edited by YANG You-ping)