Revisiting urban air quality forecasting: a regression approach

Revisiting urban air quality forecasting: a regression approach We address air quality (AQ) forecasting as a regression problem employing computational intelligence (CI) methods for the Gdansk ´ Metropolitan Area (GMA) in Poland and the Thessaloniki Metropolitan Area (TMA) in Greece. Linear Regression as well as Artificial Neural Network models are developed, accompanied by Random Forest models, for five locations per study area and for a dataset of limited feature dimensionality. An ensemble approach is also used for generating and testing AQ forecasting models. Results indicate good model performance with a correlation coefficient between forecasts and measurements for the daily mean PM concentration one day in advance reaching 0.765 for one of the TMA locations and 0.64 for one of the GMA locations. Overall results suggest that the specific modelling approach can support the provision of air quality forecasts on the basis of limited feature space dimensionality and by employing simple linear regression models. Keywords Computational intelligence · Air pollution · Regression models · Ensemble 1 Introduction pollution abatement is one of its main targets [5]. Air Qual- ity forecasting is among the main pillars of AQ management In a recently published paper [1] we underlined the impor- [6] and is materialized with the aid of appropriate AQ mod- tance of air quality (AQ) forecasting in urban environmental els. Such models are establishing a time-varying relationship management as well as in contemporary smart city devel- between the concentration of air pollutants at a specific time opment [2,3]. In the current paper we revisit and extend and location c(t , x), and other parameters p(t , x) affecting the our initial approach, focusing on urban AQ forecasting from urban atmospheric environment. Such a relationship may be the regression point of view and incorporating an ensemble expressed with the aid of the following general function: modelling approach. For doing so, we take into account that in the framework of smart city information systems, envi- c(t , x) = f (p(t , x)) (1) ronmental management plays an important role [4] and air Here t represents time and x is the location vector corre- sponding to physical space. In this case the vector c(t , x) Kostas Karatzas kkara@auth.gr refers to concentration values of air pollutants like Nitrogen Dioxide (NO ), Carbon Monoxide (CO), Ozone (O ) and 2 3 Nikos Katsifarakis nikolakk@auth.gr Particulate Matter (PM), while p(t , x) includes parameters like wind speed, wind direction, air temperature, solar radi- Cezary Orlowski corlowski@wsb.gda.pl ation, air pollutant emissions, air pollutant concentrations, land use type, land surface height, etc. The nature of function Arkadiusz Sarzynski ´ arek3108@gmail.com f is dictated by the model type employed: thus, if f recon- structs the physical and chemical relationships between the Department of Mechanical Engineering, Environmental parameters p(t , x) and values c(t , x), where x addresses the Informatics Research Group, Aristotle University, whole area of interest in a 3-D gridded manner, then mod- Thessaloniki, Greece els are said to follow an analytic-deterministic approach [7], Institute of Management and Finance, WSB University in while if f is a statistical or data-mining oriented function, Gdansk, ´ Gdansk, ´ Poland then models are said to follow a data-driven approach (as Department of Applied Business Informatics, Faculty of reported in [8] and in references therein). In the latter case, x Management and Economics, Gdansk ´ University of Technology, Gdansk, ´ Poland refers to specific areas within the studied area, which usually 123 178 Vietnam Journal of Computer Science (2018) 5:177–184 correspond to AQ measuring station locations. Thus, x is not use of an ensemble approach based on an ANN model of sim- varying with time and is excluded, leading to an equation of ple architecture which can be applied to multiple geographic the form: areas, thus simplifying the ensemble approach suggested by [14] and [15], while maintaining a performance compara- c(t ) = f (p(t )) (2) ble to the one reported by similar studies [16], and therefore providing with a novel approach to the problem at hand. The objective of this paper is to suggest CI-based, ensemble In the rest of the paper we firstly present the materials of oriented models that are able to depict as much information our study (Chapter 2), followed by the computational meth- as possible from atmospheric quality data of low dimension- ods (Chapter 3). Then we proceed with the presentation and ality, and to thus contribute to the scientific area of urban AQ the discussion of the results in Chapter 4, and we draw our forecasting. For this reason we employ a variety of CI meth- conclusion in Chapter 5. ods and we suggest and test ensemble functions f in Eqs. (1) and (2). The geographic areas of interest are the Gdansk ´ Metropolitan Area (GMA) in Poland and the Thessaloniki 2 Materials: area of study and data made Metropolitan Area (TMA) in Greece, and the parameter of available interest is the daily concentration of Particulate Matter with a th mean aerodynamic diameter of 10 µm(PM ), approx. 1/5 The areas of study as well as the AQ problem addressed have of the diameter of the human hair. The specific pollutant is been the focus of multiple studies performed in the past. able to penetrate in the bronchial part of the human lung sys- In the case of Gdansk ANNs have been employed for AQ tem [9] and is one of the most important pollutants in the forecasting in [17]. The same data set has been used for PM GMA [10]aswellasinthe TMA[11]. Air pollutant concen- forecasting in [18] as well as for the adaptation of an AQ trations are addressed as numerical values. AQ forecasting forecasting model developed for Gdansk to the Thessaloniki follows a twofold approach: area [19]. The air pollution of Thessaloniki has been studied and a) Each AQ monitoring station is treated individually, i.e. modeled with the aid of ANNs [20], with special emphasis on AQ models are developed and tested per station location. PM [21]. The similarity of the GMA as well as of the TMA Thus, the forecasting of the parameter of interest is per- in terms of population and existence of a sea front suggest formed as a regression problem. that there might also be a similarity in the way that PM ori- b) Regression models are being created based on ensemble ented air pollution can be modeled in both areas. Moreover, modelling principles, and are evaluated via their ability the need for the construction of data-driven models which to forecast AQ levels at different locations (i.e. at each use a small number of input parameters, suggested that a monitoring station). generalized, ensemble-based approach should be employed for the AQ modeling in both areas of interest, these being the The mean daily concentration level of PM one day in novelty points of the research results at hand. advance is the target of the forecasting models under devel- opment. This choice corresponds to the requirements posed 2.1 The two areas of interest by relevant legislation for citizens as well as the decision makers to be informed about the expected PM levels for The city of Gdansk ´ is located on the Baltic coast in the south- the next day, not to exceed 50 µg/m more than 35 days west of the bay of Gdansk, ´ in the northern part of Poland. It is per year according to the European Regulations [9,12] and the capital of a tri-city metropolitan area merging with Gdy- according to the World Health Organization guidelines [13]. nia (known for its shipyards) and Sopot (a recreational resort) Combustion processes, traffic and natural sources directly and adding more than 1,000,000 residents in the GMA tak- emit PM , while in some regions the mechanical degrada- ing into account suburban communities also. The economy tion of the road surface and of winter tires also contributes in Gdansk ´ is dominated by shipbuilding, petrochemicals and to its production. PM are part of the inhalable fraction of chemical industries, which are all concentrated quite close to PM and have adverse effects to human health [9]. the city center. The majority of air pollutant emissions origi- The research question posed in the current paper moves nate from the industrial sector, the port activities and the city one step ahead of our previously published results [1] and traffic [22], while the most important pollutants are PM , addresses (a) the ability of a low dimensionality feature space NO and SO (http://www.airqualitynow.eu). 2 2 (small number of input parameters) to support effective data- The city of Thessaloniki faces an oval harbor bay and driven models for PM forecasting and (b) the modelling stands on a rising ground at the heart of a long gulf which approach to be used in terms of algorithms and their setup is formed by the peninsula of Chalcidice. Various munici- (single vs. ensemble oriented models). In addition, we make palities surround the city while an industrial zone is located 123 Vietnam Journal of Computer Science (2018) 5:177–184 179 Table 1 The Air Quality monitoring stations used for the current study 3.1 Algorithms for single station model creation in GMA and TMA ´ The algorithms applied were selected based on computa- GMA stations AMI (Gdansk- ´ Sródmiescie), ´ AM2 (Gdansk-Stogi), ´ AM3 (Gdansk-No ´ wy Port), AM4 tional experiments employing various CI methods, which (Gdynia-Pogórze), AM5 (Gdansk-Szadółki) ´ were conducted with the aid of Matlab (www.mathworks. TMA stations Egnatia, Martiou, Lagkada, Eptapyrgiou, Malakopi com) as well as of the WEKA computational environment [23]. On this basis, we chose the following three algorithms as the basis for AQ model development: in the north-west of its outskirts. The TMA is the second (i) Linear Regression (LR). Here the relationship between largest urban agglomeration in Greece accounting for more the forecasted parameter and the input parameters are than 1,000,000 inhabitants, with a considerable accumula- described by an equation of the form: tion of urban traffic as well as industrial activities. The TMA is characterized by high pollution levels especially related to y = x · β + ε (3) PM while O appears to be high in suburban locations of 10 3 the area and NO levels are still high in dense urban areas in where x is the input vector, β is the slope vector and ε the association with traffic [11]. error vector. The slope vector is commonly calculated via the least square method, thus: 2.2 The atmospheric quality data −1 β = (x · x) · x · y (4) In both the GMA and in the TMA a number of AQ monitor- ing stations operate (9 and 17 respectively), which routinely (ii) Artificial Neural Networks (ANNs). In ANNs the input record concentration values of basic pollutants as well as the vector x for each neuron k, is weighted with the aid of variation of meteorological parameters. As not all pollutants a weighting vector w , and the result is summed (taking are recorded at all stations, and in order to focus on the pol- into account any bias) and then fed into a transfer function lutant of interest (PM ), we decided to select five stations f to produce the overall output vector y : from each area of interest (included in Table 1), that were able to provide with PM concentrations as well as mete- y = f (w · x) (5) k k orological data, in order to come up with data sets that are identical in terms of the parameters they include. In order to The training of the ANN aims at reducing the error e deal with the non-negligible frequency of missing data, we between the model output y and the actual (real) value selected data from the year 2013 which contained only daily observed d , which here is the PM concentration of the k 10 PM concentrations as well as information for air tempera- next day for each station. ture and relative humidity. As a result and for each station, the same atmospheric e =y − d  (6) k k parameters were used for the modelling and forecasting process: the model input or feature vector x included five This error reduction is based on a number of methods all parameters, namely PM concentration of the current day as of which aim at recalculating the initial weights so that the well as temperature and relative humidity of the current day, overall network error is minimized. In the case of the gradient complemented by the day and the month of the year. The tar- descent method (which is the simples of all but nevertheless get parameter to be forecasted y was the PM concentration representative of the way that the weights are recalculated), of the next day. A summary of the basic statistical character- the relationship between the updated and the initial weighting istics of the parameters involved in our study is included in vector for all neurons k of the ANN, is given by: Table 2. w(t + 1) = w(t ) − a(t )g(t ) (7) 3 Computational methods Here t and t + 1 denote the initial and the updated weights, while the error term is described by: The forecasting of the numerical value of PM concentration levels for the next day was the goal set for the development g(t) = J (t) · e(t) (8) of relevant forecasting models. For this reason, we made use of the available datasets for each AQ monitoring station to where J is the (transposed) Jacobian and e(t ) is the overall develop individual (per station) AQ forecasting models. error vector [1]. 123 180 Vietnam Journal of Computer Science (2018) 5:177–184 Table 2 Basic statistics for the 3 ◦ Datasets PM10 (in µg/m ) Temperature (in C) Humidity (%) AQ and meteorological Min Max Mean Std Min Max Mean Std Min Max Mean Std parameters available for each station at GMA and TMA AM1 6 92 20.55 10.48 −11.1 24.5 8.03 7.93 48 100 81.56 11.00 AM2 6 66 21.45 10.57 −11.1 24.5 7.41 7.77 48 100 82.00 10.88 AM3 3 79 16.90 10.14 −11.1 24.5 7.71 7.95 48 100 81.73 11.00 AM4 0 61 16.97 10.20 −11.1 24.5 7.82 7.92 48 100 81.65 10.97 AM5 0 55 15.01 8.35 −11.1 24.5 7.82 7.92 48 100 81.65 10.97 Egnatia 18 131 48.21 19.67 1.6 31.4 18.14 7.48 29 88 59.17 13.43 Martiou 9 113 34.44 18.69 1.331 18.13 7.58 33 87 60.80 13.00 Lagkada 20 244 57.04 32.62 0.8 31.6 18.01 7.76 33 89 60.56 13.39 Eptapyrgiou 8 135 28.90 18.03 −0.7 30.1 16.88 7.45 31 94 60.07 15.34 Malakopi 7 119 29.18 17.41 0.1 29.7 16.91 7.48 31 89 61.39 14.56 In this specific case a MultiLayer Perceptron Network of nodes, where for each node the splitting is based on with a feed-forward architecture and a back propagation a (randomly) selected subset of L attributes that optimize training method was used, with an input layer consisting 5 a target function (best split criterion). In our case L = nodes (i.e. all the input parameters per station), an output int[log (Number of attributes) + 1]. Each of the aforemen- layer consisting of only one node (the PM concentration tioned random trees had an unlimited number of levels and of the next day) and a hidden layer with 10 nodes. The sig- nodes. The prediction created by each tree is averaged and moid function is employed as the transfer function while the gradient descent algorithm is used for minimizing the error function. (iii) Random Forests (RF), an ensemble method origi- thus the ensemble-based overall prediction of the RF (here nating from the Decision Tree family of algorithms [24] the PM concentration of the next day) is generated. A pseu- that has shown high capacity to effectively model atmo- docode for this method based on http://dataaspirant.com/ is spheric parameters of interest [1]. The method creates presented below: N subsets of the input vector x using random selection with replacement, each subset containing 2/3 of the ini- tial data, while the remaining data are used to estimate error and variable importance. Then for each subset, a deci- sion tree is created with the aid of an arbitrary number 123 Vietnam Journal of Computer Science (2018) 5:177–184 181 2. Foreign ensemble: the calculation was done exactly as in The prediction is then made on the basis of an ensemble the case of the local ensemble, yet making use of the for- of results based on voting for each one of the trees generated. eign individual model slope vectors (for LR) and weights (for ANN) instead of the local individual model charac- 3.2 Ensemble models teristics. 3. Cross ensemble: the parameters of the local and the for- In addition to the above approach, we investigated the possi- eign ensemble models were averaged in order to calculate bility to develop ensemble-based models to be common for the parameters of the cross ensemble models. all monitoring stations. More specifically: 3.3 Model validation 1. A single ensemble model was created for each one of the In order to validate the results of the PM predictions, it two areas of interest, and then applied to all individual AQ is important to make use of as many of the available data as monitoring stations for the same area (local ensemble). possible for the training as well as for the testing phase. For 2. The ensemble created in the one of the geographic areas this reason we followed a 10-fold cross validation procedure was applied to each one of AQ monitoring stations of the [25] for each one of the individual models developed: we ran- other geographic area (foreign ensemble). domly divided the initial dataset into 10 equal subsets. Then 3. Both local and foreign ensembles are combined to gener- 9 out of these datasets were used for training the model, ate a cross ensemble model, which is then applied to each while the 10th one was used for testing, This process was one of the AQ monitoring stations for both geographic repeated 10 times, each time leaving a different subset out of areas of interest. the training phase and using it for the test phase. The overall model results are the mean values of the statistical indices of The aforementioned approach was materialized for both LR the 10 models developed. Concerning the ensemble models, and ANN models as follows: these were defined on the basis of the (pre-existing) individ- ual models per algorithm used, and therefore no additional model validation was used. 1. Local ensemble: In the case of LR, the parameters of the Model results were evaluated based on the following sta- slope vector β of the ensemble model were calculated as tistical indices: weighted mean values of the parameters of each one of the individual LR models, and the local ensemble model was then applied to all stations. In the case of the ANN (a) Pearson’s correlation coefficient r that describes the models, the weights of the individual models were used degree of linear relationship between forecasted and real for the calculation of the weighted mean value of the PM concentration values. weights of the local ensemble model. In both cases, the (b) Mean Absolute Error (MAE), which is a measure of the weighted means were calculated on the basis of the corre- mean absolute distance between forecasted and real val- lation coefficients of each one of the models participating ues. in the ensemble, as resulting from their application to the (c) Root Mean Squared Error (RMSE), which is the square of monitoring station for which they were developed. the Mean Square Error and expresses the standard devi- 123 182 Vietnam Journal of Computer Science (2018) 5:177–184 Table 3 Correlation coefficient (r), Mean absolute error (MAE) and Root mean square error (RMSE) for three models per monitoring station concerning the forecast of the mean daily PM10 concentration one day in advance Datasets Random forest ANN (Multilayer perceptron) Linear Regression (Multivariate) r MAE RMSE r MAE RMSE r MAE RMSE AM1 0.530 6.441 8.947 0.226 8.244 12.168 0.545 6.380 8.767 AM2 0.456 7.322 9.843 0.361 8.202 10.696 0.479 7.206 9.599 AM3 0.401 6.105 7.785 0.233 7.528 10.572 0.406 6.758 9.306 AM4 0.601 6.007 8.235 0.427 7.379 9.787 0.641 5.581 7.821 AM5 0.592 4.853 6.821 0.301 5.373 7.400 0.607 4.754 6.690 Egnatia 0.664 10.381 14.710 0.506 12.610 16.671 0.693 9.935 14.118 Martiou 0.731 8.791 12.715 0.563 11.771 15.788 0.732 8.851 12.666 Lagkada 0.713 15.157 22.989 0.571 17.996 25.590 0.728 15.050 22.391 Eptapyrgiou 0.742 7.497 12.000 0.587 11.633 16.229 0.720 8.057 12.390 Malakopi 0.723 7.871 12.014 0.617 9.829 14.753 0.742 7.800 11.639 ation of the differences between forecasted and actual dictating persistence as the main mechanism affecting the values. forecast of PM levels one day in advance [26]. In the case of the ensemble approach used (local, for- eign and cross ensembles), the results of the two algorithms employed (LR and ANN) are presented in Table 4.The optimum ensemble approach is selected on the basis of the 4 Results and discussion highest correlation coefficient achieved and taking in paral- lel with the lowest possible error metric values (MAE and Based on the model calculations performed as described in RMSE). On this basis the local ensemble achieves the best Chapter 3, the Pearson’s correlation coefficient r accompa- results, followed by the cross ensemble and leaving the for- nied by the Mean Absolute Error and the Root Mean Squared eign ensemble last. The result may be attributed to the ability Error were calculated for the three models developed and for of the local ensemble to better represent the dependencies each one of the ten AQ monitoring stations for which data between the modelled parameter (mean daily PM concen- were available (Table 3). tration for the next day) and the parameters of the feature Results suggest that the algorithm leading to the best space (input parameters). In terms of algorithms employed, (highest) correlation coefficient between forecasted and LR is always better in comparison to ANNs. Concerning the monitored values is LR, with an r ranging from 0.406 for areas of study r, values range from 0.505 (station AM2) up station AM3 up to 0.641 for station AM4 for the GMA. Con- to 0.64 (station AM4) for the GMA, while r values range cerning the TMA, LR is again the best algorithm in terms of from 0.710 (station Egnatia) up to 0.765 (station Malakopi) the highest correlation coefficient achieved, with an r value for the TMA. The value range of the correlation coefficient ranging from 0.72 for Eptapyrgiou station up to 0.742 for achieved for the TMA corresponds to a value range of the the Malakopi station. The RF algorithm can be ranked as coefficient of determination (which is actually the correla- 2nd, achieving correlation coefficients very close to the ones tion coefficient squared) between 0.504 (for Egnatia) and received with the aid of LR (and surpassing it for the Eptapyr- 0.585 (for Malakopi), which are better in comparison to the giou station), while in some cases leading to the best (lower) values achieved for the TMA but for two different stations, MAE (like in the AM3, Martiou and Eptapyrgiou stations) as reported by [27] and [28]. and to the best (lower) RMSE (like in the AM3 and in the By comparing ensembles with the local models, it is evi- Eptapyrgiou stations). LR is a simple algorithm of linear logic dent that in the case of LR-based models, the local ensemble generally considered weak in depicting nonlinear phenomena provides with a better performance in comparison to the local like the ones involved in AQ problems, and usually perform- models for all GMA stations with the exception of AM4, ing more poorly when compared with algorithms like ANNs while in the case of the TMA local ensembles outperform or RF [1]. The success of the specific algorithm in our case has local models for three out of five stations (Lagkada, Eptapyr- to do with the limited number of atmospheric quality param- giou and Malakopi). In the case of ANN modes, both the eters being available in all studied areas and stations (low local ensemble and the local models perform almost equally number of features), thus leading to the (possible) exclusion in terms of correlation coefficient values achieved. of nonlinear dependencies from the available dataset, and 123 Vietnam Journal of Computer Science (2018) 5:177–184 183 5 Conclusions In this paper, we address the problem of air quality fore- casting for two different geographical areas of interest, the GMA and the TMA, by employing a regression approach, making use of a limited dimension feature space, and target- ing at the forecast of the mean daily PM concentration of the next day. We initially develop location specific models by employing ANNs, LR and RF, and achieving correlation coefficients between 0.406 and 0.641 for the GMA stations, and between 0.693 and 0.742 for the TMA stations. The best performance was provided by the LR models, followed by the RF and the ANN models. In addition, we developed and tested three types of ensemble models per area, namely the local, the foreign and the cross ensemble models. Their appli- cation proved the local ensemble models to be the superior for both ANNs and LR algorithms. These results indicate that even when the feature space is of limited dimensional- ity, the best individual model outperforms the common model for all the monitoring stations, making use of the ensemble principle, and employing the recalculation of weights in a simple LR model. This suggests that city authorities may develop effective AQ models by targeting their investment in AQ monitoring to the parameters of interest, a vast feature space not being necessary for the success of the modelling approach. In terms of geographic area of interest, models for the GMA present with a lower overall performance in compari- son to TMA models, regardless of the algorithm employed. Taking into account that in both areas the same features were made available and used for the development of the relevant models, this result indicates the importance of additional fea- ture space parameters (reflecting atmospheric mechanisms) in order to further improve modelling performance. When coming to the choice of algorithms for the development of AQ models, the superiority of LR-based models in our study supports the finding that in the case of feature spaces of low dimension, the basic mechanisms which influence the qual- ity of the atmospheric environment are persistence and linear dependencies. This result is of use for those wishing apply AQ models in the frame of an urban environmental manage- ment system, having a low-dimension feature space available for model deployment. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Table 4 Results of the local, foreign and cross ensemble models for the ANN and LR algorithms in all stations of the GMA and TMA Datasets ANN-local ensemble ANN-foreign ensemble ANN-cross ensemble LR-local ensemble LR-foreign ensemble LR-cross ensemble r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE AM1 0.242 7.935 11.453 0.192 9.101 13.637 0.220 8.432 12.684 0.561 6.018 8.852 0.525 6.512 9.165 0.558 6.161 8.962 AM2 0.360 8.210 10.801 0.332 8.793 11.249 0.342 8.402 10.835 0.505 6.935 9.719 0.361 8.935 11.292 0.442 7.422 10.219 AM3 0.232 7.612 10.713 0.202 8.234 11.183 0.230 7.712 10.812 0.453 5.922 9.752 0.337 7.815 10.621 0.405 6.760 9.412 AM4 0.433 7.254 9.610 0.400 8.013 10.615 0.416 7.531 10.258 0.640 5.601 8.016 0.584 6.215 9.121 0.624 5.686 8.174 AM5 0.308 5.310 7.284 0.282 5.619 7.735 0.294 5.593 7.634 0.629 4.611 6.731 0.441 6.125 8.017 0.545 4.951 7.192 Egnatia 0.511 12.518 16.619 0.422 14.195 18.490 0.482 13.151 16.215 0.710 9.911 14.183 0.681 10.165 14.317 0.705 9.932 14.209 Martiou 0.563 11.767 15.731 0.442 14.015 18.183 0.542 12.835 16.015 0.741 8.841 12.673 0.712 9.252 12.851 0.737 8.853 12.672 Lagkada 0.572 18.104 26.107 0.451 22.526 30.015 0.532 20.015 28.124 0.749 15.001 22.843 0.727 15.053 22.481 0.743 15.009 22.912 Eptapyrgiou 0.588 11.652 16.310 0.452 13.124 19.258 0.551 12.106 17.514 0.742 7.951 12.414 0.721 8.053 12.415 0.743 7.939 12.420 Malakopi 0.616 9.848 14.981 0.442 13.253 18.315 0.561 11.257 16.938 0.765 7.716 11.863 0.727 8.183 12.285 0.757 7.731 11.915 184 Vietnam Journal of Computer Science (2018) 5:177–184 References 16. Biancofiore, F., Busilacchio, M., Verdecchia, M., Tomassetti, B., Aruffo, E., Bianco, S., Di Tommaso, S., Colangeli, C., Rosatelli, G., Carlo, P.: Recursive neural network model for analysis and 1. Karatzas, K., Katsifarakis, N., Orlowski, C. Sarzynski ´ A.: Urban air forecast of PM10 and PM2.5. atmospheric. Pollut. Res. 8(4), 652– quality forecasting: a regression and a classification approach. In: 659 (2017) In Nguyen N.T. et al. (eds.): Intelligent information and database th 17. Khokhlov, V.N., Glushkov, A.V., Loboda, N.S., Bunyakova, Y.Y.: systems, 9 Asian Conference on Intelligent Information and Short-range forecast of atmospheric pollutants using non-linear Database Systems, Part II, Lecture Notes in Artificial Intelligence prediction method. Atmos. Environ. 42(31), 7284–7292 (2008) vol. 10192, pp. 1–10 (2017). https://doi.org/10.1007/978-3-319- 18. Orłowski, C., Sarzynski, ´ A.: A model for forecasting pm10 levels 54430-4_52 with the use of artificial neural networks. In: Information Systems 2. Riffat, S., Powell, R., Aydin, D.: Future cities and environmental Architecture and Technology—the use of IT Technologies to Sup- sustainability. Future Cities Environ. 2, 1 (2016). https://doi.org/ port Organizational Management in Risky Environment, Wrocław 10.1186/s40984-016-0014-2 (2014) 3. Webel, S.: Forecasting Software that’s a Breath of Fresh Air. Pic- 19. Orłowski, C., Sarzynski, ´ A., Karatzas, K., Katsifarakis, N., Nazarko tures of the Future Siemens Magazine, (2016) http://www.siemens. J.: Adaptation of an ANN-based air quality forecasting model to com/innovation/en/home/pictures-of-the-future/infrastructure- a new application area. In: Król D., Nguyen N., Shirai K. (eds) and-finance/smart-cities-air-pollution-forecasting-models.html. Advanced Topics in Intelligent Information and Database Systems Accessed 18 Aug 2017 479-488 (2017) 4. Dawe, S. Paradice, D.: A systems approach to smart city infras- 20. Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid tructure: a small city perspective. In: Proceedings of the Thirty of computational intelligence methods in Thessaloniki, Greece. Seventh International Conference on Information Systems, Simul. Model. Pract. Theory 15(10), 1310–1319 (2007) Dublin, http://iot-smartcities.lero.ie/wp-content/uploads/2016/ 21. Voukantsis, D., Karatzas, K., Kukkonen, J., Räsänen, T., Karp- 12/A-Systems-Approach-to-Smart-City-Infrastructure-A-Small- pinen, A., Kolehmainen, M.: Intercomparison of air quality data City-Perspective.pdf. Accessed 18 Aug 2017 using principal component analysis, and forecasting of PM10 and 5. Marinov, M.B., Topalov, I., Gieva, E., Nikolov, G.: Air quality PM2.5 concentrations using artificial neural networks, in Thessa- monitoring in urban environments. In: 39th International Spring loniki and Helsinki. Sci. Total Environ. 409, 1266–1276 (2011) Seminar on Electronics Technology (ISSE), Pilsen, pp. 443–448. 22. Szczepaniak, K., Astel, A., Bode, P., Sârbu, C., Biziuk, M., Rainska, ´ (2016). https://doi.org/10.1109/ISSE.2016.7563237 E., Gos, K.: Assessment of atmospheric inorganic pollution in the 6. Bukoski, B., Taylor, E.M.: Air quality forecasting. Air quality man- urban region of Gdansk. ´ J. Radioanal. Nuclear Chem. 270(1), 35– agement 129–138 (2014) 42 (2006) 7. Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A., Klein, T., 23. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Miranda, A.I., Monteiro, A., Hirtl, M., Tarvainen, V., Boy, M., Witten, I.: The WEKA data mining software: an update. SIGKDD Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S., Sofiev, Explorations 11(1), 10–18 (2009) M., Sokhi, R., Lehtinen, K.E.J., Karatzas, K., San José, R., Astitha, 24. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) M., Kallos, G., Schaap, M., Reimer, E., Jakobs, H., Eben, K.: A 25. Kohavi, R.: A study of cross-validation and bootstrap for accuracy review of operational, regional-scale, chemical weather forecasting estimation and model selection. Proc. Fourteenth Int. Joint Conf. models in Europe. Atmos. Chem. Phys. 12, 1–87 (2012) Artif. Intel. 2(12), 1137–1143 (1995) 8. Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid 26. EPA: Guidelines for developing an air quality (ozone and of computational intelligence methods in Thessaloniki, Greece. PM2.5) forecasting program, U.S. Environmental Protection Simul. Modelling Pract. Theory 15(10), 1310–1319 (2007) Agency report EPA-456/R-03-002, https://www3.epa.gov/airnow/ 9. EEA, 2016: Air quality in Europe—2016 report, European Envi- aq_forecasting_guidance-1016.pdf. Accessed 18 Aug 2017 ronment Agency, https://doi.org/10.2800/80982. https://www.eea. 27. Voukantsis, D., Niska, H., Karatzas, K., Riga, M., Damialis, A., europa.eu//publications/air-quality-in-europe-2016. Accessed 18 Vokou, D.: Forecasting daily pollen concentrations using data- Aug 2017 driven modeling methods in Thessaloniki, Greece. Atmos. Environ. 10. Juda-Rezler, K., Trapp, W., Reizer, M.: Modelling the impact of 44(39), 5101–5111 (2010) climate changes on particulate matter levels over Poland. In: Steyn, 28. Tzima, F., Mitkas, P., Voukantsis, D., Karatzas, K.: Sparse episode D.G., Rao, S.T. (eds.) Air pollution modeling and its application identification in environmental datasets: the case of air quality XX, pp. 499–450 (2010) assessment. Expert Syst. with Appl. 38(5), 5019–5027 (2011) 11. Moussiopoulos, N., Vlachokostas, C., Tsilingiridis, G., Douros, I., Hourdakis, E., Naneris, C., Sidiropoulos, C.: Air quality status in Greater Thessaloniki Area and the emission reductions needed for attaining the EU air quality legislation. Sci. Total Environ. 407(4), Publisher’s Note Springer Nature remains neutral with regard to juris- 1268–1285 (2009) dictional claims in published maps and institutional affiliations. 12. Andrews, A.: The clean air handbook, a practical guideline to EU air quality law, https://www.clientearth.org/reports/20140515- clientearth-air-pollution-clean-air-handbook.pdf. Accessed 18 Aug 2017 13. WHO: Air Quality Guidelines, global update 2005, ISBN 92 890 2192 6 via http://www.euro.who.int. Accessed 18 Aug. 2017 14. Siwek, K., Osowski, S.: Improving the accuracy of prediction of PM10 pollution by the wavelet transformation and an ensemble of neural predictors. Eng. Appl. Artif. Intel. 25(6), 1246–1258 (2012) 15. Zhou, Q., Jiang, H., Wang, J., Zhou, J.: A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network. Sci. Total Environ. 496, 264– 274 (2014) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Vietnam Journal of Computer Science Springer Journals

Revisiting urban air quality forecasting: a regression approach

Free
8 pages

Loading next page...
 
/lp/springer_journal/revisiting-urban-air-quality-forecasting-a-regression-approach-wNhMxavTGi
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Information Systems and Communication Service; Artificial Intelligence (incl. Robotics); Computer Applications; e-Commerce/e-business; Computer Systems Organization and Communication Networks; Computational Intelligence
ISSN
2196-8888
eISSN
2196-8896
D.O.I.
10.1007/s40595-018-0113-0
Publisher site
See Article on Publisher Site

Abstract

We address air quality (AQ) forecasting as a regression problem employing computational intelligence (CI) methods for the Gdansk ´ Metropolitan Area (GMA) in Poland and the Thessaloniki Metropolitan Area (TMA) in Greece. Linear Regression as well as Artificial Neural Network models are developed, accompanied by Random Forest models, for five locations per study area and for a dataset of limited feature dimensionality. An ensemble approach is also used for generating and testing AQ forecasting models. Results indicate good model performance with a correlation coefficient between forecasts and measurements for the daily mean PM concentration one day in advance reaching 0.765 for one of the TMA locations and 0.64 for one of the GMA locations. Overall results suggest that the specific modelling approach can support the provision of air quality forecasts on the basis of limited feature space dimensionality and by employing simple linear regression models. Keywords Computational intelligence · Air pollution · Regression models · Ensemble 1 Introduction pollution abatement is one of its main targets [5]. Air Qual- ity forecasting is among the main pillars of AQ management In a recently published paper [1] we underlined the impor- [6] and is materialized with the aid of appropriate AQ mod- tance of air quality (AQ) forecasting in urban environmental els. Such models are establishing a time-varying relationship management as well as in contemporary smart city devel- between the concentration of air pollutants at a specific time opment [2,3]. In the current paper we revisit and extend and location c(t , x), and other parameters p(t , x) affecting the our initial approach, focusing on urban AQ forecasting from urban atmospheric environment. Such a relationship may be the regression point of view and incorporating an ensemble expressed with the aid of the following general function: modelling approach. For doing so, we take into account that in the framework of smart city information systems, envi- c(t , x) = f (p(t , x)) (1) ronmental management plays an important role [4] and air Here t represents time and x is the location vector corre- sponding to physical space. In this case the vector c(t , x) Kostas Karatzas kkara@auth.gr refers to concentration values of air pollutants like Nitrogen Dioxide (NO ), Carbon Monoxide (CO), Ozone (O ) and 2 3 Nikos Katsifarakis nikolakk@auth.gr Particulate Matter (PM), while p(t , x) includes parameters like wind speed, wind direction, air temperature, solar radi- Cezary Orlowski corlowski@wsb.gda.pl ation, air pollutant emissions, air pollutant concentrations, land use type, land surface height, etc. The nature of function Arkadiusz Sarzynski ´ arek3108@gmail.com f is dictated by the model type employed: thus, if f recon- structs the physical and chemical relationships between the Department of Mechanical Engineering, Environmental parameters p(t , x) and values c(t , x), where x addresses the Informatics Research Group, Aristotle University, whole area of interest in a 3-D gridded manner, then mod- Thessaloniki, Greece els are said to follow an analytic-deterministic approach [7], Institute of Management and Finance, WSB University in while if f is a statistical or data-mining oriented function, Gdansk, ´ Gdansk, ´ Poland then models are said to follow a data-driven approach (as Department of Applied Business Informatics, Faculty of reported in [8] and in references therein). In the latter case, x Management and Economics, Gdansk ´ University of Technology, Gdansk, ´ Poland refers to specific areas within the studied area, which usually 123 178 Vietnam Journal of Computer Science (2018) 5:177–184 correspond to AQ measuring station locations. Thus, x is not use of an ensemble approach based on an ANN model of sim- varying with time and is excluded, leading to an equation of ple architecture which can be applied to multiple geographic the form: areas, thus simplifying the ensemble approach suggested by [14] and [15], while maintaining a performance compara- c(t ) = f (p(t )) (2) ble to the one reported by similar studies [16], and therefore providing with a novel approach to the problem at hand. The objective of this paper is to suggest CI-based, ensemble In the rest of the paper we firstly present the materials of oriented models that are able to depict as much information our study (Chapter 2), followed by the computational meth- as possible from atmospheric quality data of low dimension- ods (Chapter 3). Then we proceed with the presentation and ality, and to thus contribute to the scientific area of urban AQ the discussion of the results in Chapter 4, and we draw our forecasting. For this reason we employ a variety of CI meth- conclusion in Chapter 5. ods and we suggest and test ensemble functions f in Eqs. (1) and (2). The geographic areas of interest are the Gdansk ´ Metropolitan Area (GMA) in Poland and the Thessaloniki 2 Materials: area of study and data made Metropolitan Area (TMA) in Greece, and the parameter of available interest is the daily concentration of Particulate Matter with a th mean aerodynamic diameter of 10 µm(PM ), approx. 1/5 The areas of study as well as the AQ problem addressed have of the diameter of the human hair. The specific pollutant is been the focus of multiple studies performed in the past. able to penetrate in the bronchial part of the human lung sys- In the case of Gdansk ANNs have been employed for AQ tem [9] and is one of the most important pollutants in the forecasting in [17]. The same data set has been used for PM GMA [10]aswellasinthe TMA[11]. Air pollutant concen- forecasting in [18] as well as for the adaptation of an AQ trations are addressed as numerical values. AQ forecasting forecasting model developed for Gdansk to the Thessaloniki follows a twofold approach: area [19]. The air pollution of Thessaloniki has been studied and a) Each AQ monitoring station is treated individually, i.e. modeled with the aid of ANNs [20], with special emphasis on AQ models are developed and tested per station location. PM [21]. The similarity of the GMA as well as of the TMA Thus, the forecasting of the parameter of interest is per- in terms of population and existence of a sea front suggest formed as a regression problem. that there might also be a similarity in the way that PM ori- b) Regression models are being created based on ensemble ented air pollution can be modeled in both areas. Moreover, modelling principles, and are evaluated via their ability the need for the construction of data-driven models which to forecast AQ levels at different locations (i.e. at each use a small number of input parameters, suggested that a monitoring station). generalized, ensemble-based approach should be employed for the AQ modeling in both areas of interest, these being the The mean daily concentration level of PM one day in novelty points of the research results at hand. advance is the target of the forecasting models under devel- opment. This choice corresponds to the requirements posed 2.1 The two areas of interest by relevant legislation for citizens as well as the decision makers to be informed about the expected PM levels for The city of Gdansk ´ is located on the Baltic coast in the south- the next day, not to exceed 50 µg/m more than 35 days west of the bay of Gdansk, ´ in the northern part of Poland. It is per year according to the European Regulations [9,12] and the capital of a tri-city metropolitan area merging with Gdy- according to the World Health Organization guidelines [13]. nia (known for its shipyards) and Sopot (a recreational resort) Combustion processes, traffic and natural sources directly and adding more than 1,000,000 residents in the GMA tak- emit PM , while in some regions the mechanical degrada- ing into account suburban communities also. The economy tion of the road surface and of winter tires also contributes in Gdansk ´ is dominated by shipbuilding, petrochemicals and to its production. PM are part of the inhalable fraction of chemical industries, which are all concentrated quite close to PM and have adverse effects to human health [9]. the city center. The majority of air pollutant emissions origi- The research question posed in the current paper moves nate from the industrial sector, the port activities and the city one step ahead of our previously published results [1] and traffic [22], while the most important pollutants are PM , addresses (a) the ability of a low dimensionality feature space NO and SO (http://www.airqualitynow.eu). 2 2 (small number of input parameters) to support effective data- The city of Thessaloniki faces an oval harbor bay and driven models for PM forecasting and (b) the modelling stands on a rising ground at the heart of a long gulf which approach to be used in terms of algorithms and their setup is formed by the peninsula of Chalcidice. Various munici- (single vs. ensemble oriented models). In addition, we make palities surround the city while an industrial zone is located 123 Vietnam Journal of Computer Science (2018) 5:177–184 179 Table 1 The Air Quality monitoring stations used for the current study 3.1 Algorithms for single station model creation in GMA and TMA ´ The algorithms applied were selected based on computa- GMA stations AMI (Gdansk- ´ Sródmiescie), ´ AM2 (Gdansk-Stogi), ´ AM3 (Gdansk-No ´ wy Port), AM4 tional experiments employing various CI methods, which (Gdynia-Pogórze), AM5 (Gdansk-Szadółki) ´ were conducted with the aid of Matlab (www.mathworks. TMA stations Egnatia, Martiou, Lagkada, Eptapyrgiou, Malakopi com) as well as of the WEKA computational environment [23]. On this basis, we chose the following three algorithms as the basis for AQ model development: in the north-west of its outskirts. The TMA is the second (i) Linear Regression (LR). Here the relationship between largest urban agglomeration in Greece accounting for more the forecasted parameter and the input parameters are than 1,000,000 inhabitants, with a considerable accumula- described by an equation of the form: tion of urban traffic as well as industrial activities. The TMA is characterized by high pollution levels especially related to y = x · β + ε (3) PM while O appears to be high in suburban locations of 10 3 the area and NO levels are still high in dense urban areas in where x is the input vector, β is the slope vector and ε the association with traffic [11]. error vector. The slope vector is commonly calculated via the least square method, thus: 2.2 The atmospheric quality data −1 β = (x · x) · x · y (4) In both the GMA and in the TMA a number of AQ monitor- ing stations operate (9 and 17 respectively), which routinely (ii) Artificial Neural Networks (ANNs). In ANNs the input record concentration values of basic pollutants as well as the vector x for each neuron k, is weighted with the aid of variation of meteorological parameters. As not all pollutants a weighting vector w , and the result is summed (taking are recorded at all stations, and in order to focus on the pol- into account any bias) and then fed into a transfer function lutant of interest (PM ), we decided to select five stations f to produce the overall output vector y : from each area of interest (included in Table 1), that were able to provide with PM concentrations as well as mete- y = f (w · x) (5) k k orological data, in order to come up with data sets that are identical in terms of the parameters they include. In order to The training of the ANN aims at reducing the error e deal with the non-negligible frequency of missing data, we between the model output y and the actual (real) value selected data from the year 2013 which contained only daily observed d , which here is the PM concentration of the k 10 PM concentrations as well as information for air tempera- next day for each station. ture and relative humidity. As a result and for each station, the same atmospheric e =y − d  (6) k k parameters were used for the modelling and forecasting process: the model input or feature vector x included five This error reduction is based on a number of methods all parameters, namely PM concentration of the current day as of which aim at recalculating the initial weights so that the well as temperature and relative humidity of the current day, overall network error is minimized. In the case of the gradient complemented by the day and the month of the year. The tar- descent method (which is the simples of all but nevertheless get parameter to be forecasted y was the PM concentration representative of the way that the weights are recalculated), of the next day. A summary of the basic statistical character- the relationship between the updated and the initial weighting istics of the parameters involved in our study is included in vector for all neurons k of the ANN, is given by: Table 2. w(t + 1) = w(t ) − a(t )g(t ) (7) 3 Computational methods Here t and t + 1 denote the initial and the updated weights, while the error term is described by: The forecasting of the numerical value of PM concentration levels for the next day was the goal set for the development g(t) = J (t) · e(t) (8) of relevant forecasting models. For this reason, we made use of the available datasets for each AQ monitoring station to where J is the (transposed) Jacobian and e(t ) is the overall develop individual (per station) AQ forecasting models. error vector [1]. 123 180 Vietnam Journal of Computer Science (2018) 5:177–184 Table 2 Basic statistics for the 3 ◦ Datasets PM10 (in µg/m ) Temperature (in C) Humidity (%) AQ and meteorological Min Max Mean Std Min Max Mean Std Min Max Mean Std parameters available for each station at GMA and TMA AM1 6 92 20.55 10.48 −11.1 24.5 8.03 7.93 48 100 81.56 11.00 AM2 6 66 21.45 10.57 −11.1 24.5 7.41 7.77 48 100 82.00 10.88 AM3 3 79 16.90 10.14 −11.1 24.5 7.71 7.95 48 100 81.73 11.00 AM4 0 61 16.97 10.20 −11.1 24.5 7.82 7.92 48 100 81.65 10.97 AM5 0 55 15.01 8.35 −11.1 24.5 7.82 7.92 48 100 81.65 10.97 Egnatia 18 131 48.21 19.67 1.6 31.4 18.14 7.48 29 88 59.17 13.43 Martiou 9 113 34.44 18.69 1.331 18.13 7.58 33 87 60.80 13.00 Lagkada 20 244 57.04 32.62 0.8 31.6 18.01 7.76 33 89 60.56 13.39 Eptapyrgiou 8 135 28.90 18.03 −0.7 30.1 16.88 7.45 31 94 60.07 15.34 Malakopi 7 119 29.18 17.41 0.1 29.7 16.91 7.48 31 89 61.39 14.56 In this specific case a MultiLayer Perceptron Network of nodes, where for each node the splitting is based on with a feed-forward architecture and a back propagation a (randomly) selected subset of L attributes that optimize training method was used, with an input layer consisting 5 a target function (best split criterion). In our case L = nodes (i.e. all the input parameters per station), an output int[log (Number of attributes) + 1]. Each of the aforemen- layer consisting of only one node (the PM concentration tioned random trees had an unlimited number of levels and of the next day) and a hidden layer with 10 nodes. The sig- nodes. The prediction created by each tree is averaged and moid function is employed as the transfer function while the gradient descent algorithm is used for minimizing the error function. (iii) Random Forests (RF), an ensemble method origi- thus the ensemble-based overall prediction of the RF (here nating from the Decision Tree family of algorithms [24] the PM concentration of the next day) is generated. A pseu- that has shown high capacity to effectively model atmo- docode for this method based on http://dataaspirant.com/ is spheric parameters of interest [1]. The method creates presented below: N subsets of the input vector x using random selection with replacement, each subset containing 2/3 of the ini- tial data, while the remaining data are used to estimate error and variable importance. Then for each subset, a deci- sion tree is created with the aid of an arbitrary number 123 Vietnam Journal of Computer Science (2018) 5:177–184 181 2. Foreign ensemble: the calculation was done exactly as in The prediction is then made on the basis of an ensemble the case of the local ensemble, yet making use of the for- of results based on voting for each one of the trees generated. eign individual model slope vectors (for LR) and weights (for ANN) instead of the local individual model charac- 3.2 Ensemble models teristics. 3. Cross ensemble: the parameters of the local and the for- In addition to the above approach, we investigated the possi- eign ensemble models were averaged in order to calculate bility to develop ensemble-based models to be common for the parameters of the cross ensemble models. all monitoring stations. More specifically: 3.3 Model validation 1. A single ensemble model was created for each one of the In order to validate the results of the PM predictions, it two areas of interest, and then applied to all individual AQ is important to make use of as many of the available data as monitoring stations for the same area (local ensemble). possible for the training as well as for the testing phase. For 2. The ensemble created in the one of the geographic areas this reason we followed a 10-fold cross validation procedure was applied to each one of AQ monitoring stations of the [25] for each one of the individual models developed: we ran- other geographic area (foreign ensemble). domly divided the initial dataset into 10 equal subsets. Then 3. Both local and foreign ensembles are combined to gener- 9 out of these datasets were used for training the model, ate a cross ensemble model, which is then applied to each while the 10th one was used for testing, This process was one of the AQ monitoring stations for both geographic repeated 10 times, each time leaving a different subset out of areas of interest. the training phase and using it for the test phase. The overall model results are the mean values of the statistical indices of The aforementioned approach was materialized for both LR the 10 models developed. Concerning the ensemble models, and ANN models as follows: these were defined on the basis of the (pre-existing) individ- ual models per algorithm used, and therefore no additional model validation was used. 1. Local ensemble: In the case of LR, the parameters of the Model results were evaluated based on the following sta- slope vector β of the ensemble model were calculated as tistical indices: weighted mean values of the parameters of each one of the individual LR models, and the local ensemble model was then applied to all stations. In the case of the ANN (a) Pearson’s correlation coefficient r that describes the models, the weights of the individual models were used degree of linear relationship between forecasted and real for the calculation of the weighted mean value of the PM concentration values. weights of the local ensemble model. In both cases, the (b) Mean Absolute Error (MAE), which is a measure of the weighted means were calculated on the basis of the corre- mean absolute distance between forecasted and real val- lation coefficients of each one of the models participating ues. in the ensemble, as resulting from their application to the (c) Root Mean Squared Error (RMSE), which is the square of monitoring station for which they were developed. the Mean Square Error and expresses the standard devi- 123 182 Vietnam Journal of Computer Science (2018) 5:177–184 Table 3 Correlation coefficient (r), Mean absolute error (MAE) and Root mean square error (RMSE) for three models per monitoring station concerning the forecast of the mean daily PM10 concentration one day in advance Datasets Random forest ANN (Multilayer perceptron) Linear Regression (Multivariate) r MAE RMSE r MAE RMSE r MAE RMSE AM1 0.530 6.441 8.947 0.226 8.244 12.168 0.545 6.380 8.767 AM2 0.456 7.322 9.843 0.361 8.202 10.696 0.479 7.206 9.599 AM3 0.401 6.105 7.785 0.233 7.528 10.572 0.406 6.758 9.306 AM4 0.601 6.007 8.235 0.427 7.379 9.787 0.641 5.581 7.821 AM5 0.592 4.853 6.821 0.301 5.373 7.400 0.607 4.754 6.690 Egnatia 0.664 10.381 14.710 0.506 12.610 16.671 0.693 9.935 14.118 Martiou 0.731 8.791 12.715 0.563 11.771 15.788 0.732 8.851 12.666 Lagkada 0.713 15.157 22.989 0.571 17.996 25.590 0.728 15.050 22.391 Eptapyrgiou 0.742 7.497 12.000 0.587 11.633 16.229 0.720 8.057 12.390 Malakopi 0.723 7.871 12.014 0.617 9.829 14.753 0.742 7.800 11.639 ation of the differences between forecasted and actual dictating persistence as the main mechanism affecting the values. forecast of PM levels one day in advance [26]. In the case of the ensemble approach used (local, for- eign and cross ensembles), the results of the two algorithms employed (LR and ANN) are presented in Table 4.The optimum ensemble approach is selected on the basis of the 4 Results and discussion highest correlation coefficient achieved and taking in paral- lel with the lowest possible error metric values (MAE and Based on the model calculations performed as described in RMSE). On this basis the local ensemble achieves the best Chapter 3, the Pearson’s correlation coefficient r accompa- results, followed by the cross ensemble and leaving the for- nied by the Mean Absolute Error and the Root Mean Squared eign ensemble last. The result may be attributed to the ability Error were calculated for the three models developed and for of the local ensemble to better represent the dependencies each one of the ten AQ monitoring stations for which data between the modelled parameter (mean daily PM concen- were available (Table 3). tration for the next day) and the parameters of the feature Results suggest that the algorithm leading to the best space (input parameters). In terms of algorithms employed, (highest) correlation coefficient between forecasted and LR is always better in comparison to ANNs. Concerning the monitored values is LR, with an r ranging from 0.406 for areas of study r, values range from 0.505 (station AM2) up station AM3 up to 0.641 for station AM4 for the GMA. Con- to 0.64 (station AM4) for the GMA, while r values range cerning the TMA, LR is again the best algorithm in terms of from 0.710 (station Egnatia) up to 0.765 (station Malakopi) the highest correlation coefficient achieved, with an r value for the TMA. The value range of the correlation coefficient ranging from 0.72 for Eptapyrgiou station up to 0.742 for achieved for the TMA corresponds to a value range of the the Malakopi station. The RF algorithm can be ranked as coefficient of determination (which is actually the correla- 2nd, achieving correlation coefficients very close to the ones tion coefficient squared) between 0.504 (for Egnatia) and received with the aid of LR (and surpassing it for the Eptapyr- 0.585 (for Malakopi), which are better in comparison to the giou station), while in some cases leading to the best (lower) values achieved for the TMA but for two different stations, MAE (like in the AM3, Martiou and Eptapyrgiou stations) as reported by [27] and [28]. and to the best (lower) RMSE (like in the AM3 and in the By comparing ensembles with the local models, it is evi- Eptapyrgiou stations). LR is a simple algorithm of linear logic dent that in the case of LR-based models, the local ensemble generally considered weak in depicting nonlinear phenomena provides with a better performance in comparison to the local like the ones involved in AQ problems, and usually perform- models for all GMA stations with the exception of AM4, ing more poorly when compared with algorithms like ANNs while in the case of the TMA local ensembles outperform or RF [1]. The success of the specific algorithm in our case has local models for three out of five stations (Lagkada, Eptapyr- to do with the limited number of atmospheric quality param- giou and Malakopi). In the case of ANN modes, both the eters being available in all studied areas and stations (low local ensemble and the local models perform almost equally number of features), thus leading to the (possible) exclusion in terms of correlation coefficient values achieved. of nonlinear dependencies from the available dataset, and 123 Vietnam Journal of Computer Science (2018) 5:177–184 183 5 Conclusions In this paper, we address the problem of air quality fore- casting for two different geographical areas of interest, the GMA and the TMA, by employing a regression approach, making use of a limited dimension feature space, and target- ing at the forecast of the mean daily PM concentration of the next day. We initially develop location specific models by employing ANNs, LR and RF, and achieving correlation coefficients between 0.406 and 0.641 for the GMA stations, and between 0.693 and 0.742 for the TMA stations. The best performance was provided by the LR models, followed by the RF and the ANN models. In addition, we developed and tested three types of ensemble models per area, namely the local, the foreign and the cross ensemble models. Their appli- cation proved the local ensemble models to be the superior for both ANNs and LR algorithms. These results indicate that even when the feature space is of limited dimensional- ity, the best individual model outperforms the common model for all the monitoring stations, making use of the ensemble principle, and employing the recalculation of weights in a simple LR model. This suggests that city authorities may develop effective AQ models by targeting their investment in AQ monitoring to the parameters of interest, a vast feature space not being necessary for the success of the modelling approach. In terms of geographic area of interest, models for the GMA present with a lower overall performance in compari- son to TMA models, regardless of the algorithm employed. Taking into account that in both areas the same features were made available and used for the development of the relevant models, this result indicates the importance of additional fea- ture space parameters (reflecting atmospheric mechanisms) in order to further improve modelling performance. When coming to the choice of algorithms for the development of AQ models, the superiority of LR-based models in our study supports the finding that in the case of feature spaces of low dimension, the basic mechanisms which influence the qual- ity of the atmospheric environment are persistence and linear dependencies. This result is of use for those wishing apply AQ models in the frame of an urban environmental manage- ment system, having a low-dimension feature space available for model deployment. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Table 4 Results of the local, foreign and cross ensemble models for the ANN and LR algorithms in all stations of the GMA and TMA Datasets ANN-local ensemble ANN-foreign ensemble ANN-cross ensemble LR-local ensemble LR-foreign ensemble LR-cross ensemble r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE r MAE RMSE AM1 0.242 7.935 11.453 0.192 9.101 13.637 0.220 8.432 12.684 0.561 6.018 8.852 0.525 6.512 9.165 0.558 6.161 8.962 AM2 0.360 8.210 10.801 0.332 8.793 11.249 0.342 8.402 10.835 0.505 6.935 9.719 0.361 8.935 11.292 0.442 7.422 10.219 AM3 0.232 7.612 10.713 0.202 8.234 11.183 0.230 7.712 10.812 0.453 5.922 9.752 0.337 7.815 10.621 0.405 6.760 9.412 AM4 0.433 7.254 9.610 0.400 8.013 10.615 0.416 7.531 10.258 0.640 5.601 8.016 0.584 6.215 9.121 0.624 5.686 8.174 AM5 0.308 5.310 7.284 0.282 5.619 7.735 0.294 5.593 7.634 0.629 4.611 6.731 0.441 6.125 8.017 0.545 4.951 7.192 Egnatia 0.511 12.518 16.619 0.422 14.195 18.490 0.482 13.151 16.215 0.710 9.911 14.183 0.681 10.165 14.317 0.705 9.932 14.209 Martiou 0.563 11.767 15.731 0.442 14.015 18.183 0.542 12.835 16.015 0.741 8.841 12.673 0.712 9.252 12.851 0.737 8.853 12.672 Lagkada 0.572 18.104 26.107 0.451 22.526 30.015 0.532 20.015 28.124 0.749 15.001 22.843 0.727 15.053 22.481 0.743 15.009 22.912 Eptapyrgiou 0.588 11.652 16.310 0.452 13.124 19.258 0.551 12.106 17.514 0.742 7.951 12.414 0.721 8.053 12.415 0.743 7.939 12.420 Malakopi 0.616 9.848 14.981 0.442 13.253 18.315 0.561 11.257 16.938 0.765 7.716 11.863 0.727 8.183 12.285 0.757 7.731 11.915 184 Vietnam Journal of Computer Science (2018) 5:177–184 References 16. Biancofiore, F., Busilacchio, M., Verdecchia, M., Tomassetti, B., Aruffo, E., Bianco, S., Di Tommaso, S., Colangeli, C., Rosatelli, G., Carlo, P.: Recursive neural network model for analysis and 1. Karatzas, K., Katsifarakis, N., Orlowski, C. Sarzynski ´ A.: Urban air forecast of PM10 and PM2.5. atmospheric. Pollut. Res. 8(4), 652– quality forecasting: a regression and a classification approach. In: 659 (2017) In Nguyen N.T. et al. (eds.): Intelligent information and database th 17. Khokhlov, V.N., Glushkov, A.V., Loboda, N.S., Bunyakova, Y.Y.: systems, 9 Asian Conference on Intelligent Information and Short-range forecast of atmospheric pollutants using non-linear Database Systems, Part II, Lecture Notes in Artificial Intelligence prediction method. Atmos. Environ. 42(31), 7284–7292 (2008) vol. 10192, pp. 1–10 (2017). https://doi.org/10.1007/978-3-319- 18. Orłowski, C., Sarzynski, ´ A.: A model for forecasting pm10 levels 54430-4_52 with the use of artificial neural networks. In: Information Systems 2. Riffat, S., Powell, R., Aydin, D.: Future cities and environmental Architecture and Technology—the use of IT Technologies to Sup- sustainability. Future Cities Environ. 2, 1 (2016). https://doi.org/ port Organizational Management in Risky Environment, Wrocław 10.1186/s40984-016-0014-2 (2014) 3. Webel, S.: Forecasting Software that’s a Breath of Fresh Air. Pic- 19. Orłowski, C., Sarzynski, ´ A., Karatzas, K., Katsifarakis, N., Nazarko tures of the Future Siemens Magazine, (2016) http://www.siemens. J.: Adaptation of an ANN-based air quality forecasting model to com/innovation/en/home/pictures-of-the-future/infrastructure- a new application area. In: Król D., Nguyen N., Shirai K. (eds) and-finance/smart-cities-air-pollution-forecasting-models.html. Advanced Topics in Intelligent Information and Database Systems Accessed 18 Aug 2017 479-488 (2017) 4. Dawe, S. Paradice, D.: A systems approach to smart city infras- 20. Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid tructure: a small city perspective. In: Proceedings of the Thirty of computational intelligence methods in Thessaloniki, Greece. Seventh International Conference on Information Systems, Simul. Model. Pract. Theory 15(10), 1310–1319 (2007) Dublin, http://iot-smartcities.lero.ie/wp-content/uploads/2016/ 21. Voukantsis, D., Karatzas, K., Kukkonen, J., Räsänen, T., Karp- 12/A-Systems-Approach-to-Smart-City-Infrastructure-A-Small- pinen, A., Kolehmainen, M.: Intercomparison of air quality data City-Perspective.pdf. Accessed 18 Aug 2017 using principal component analysis, and forecasting of PM10 and 5. Marinov, M.B., Topalov, I., Gieva, E., Nikolov, G.: Air quality PM2.5 concentrations using artificial neural networks, in Thessa- monitoring in urban environments. In: 39th International Spring loniki and Helsinki. Sci. Total Environ. 409, 1266–1276 (2011) Seminar on Electronics Technology (ISSE), Pilsen, pp. 443–448. 22. Szczepaniak, K., Astel, A., Bode, P., Sârbu, C., Biziuk, M., Rainska, ´ (2016). https://doi.org/10.1109/ISSE.2016.7563237 E., Gos, K.: Assessment of atmospheric inorganic pollution in the 6. Bukoski, B., Taylor, E.M.: Air quality forecasting. Air quality man- urban region of Gdansk. ´ J. Radioanal. Nuclear Chem. 270(1), 35– agement 129–138 (2014) 42 (2006) 7. Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A., Klein, T., 23. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Miranda, A.I., Monteiro, A., Hirtl, M., Tarvainen, V., Boy, M., Witten, I.: The WEKA data mining software: an update. SIGKDD Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S., Sofiev, Explorations 11(1), 10–18 (2009) M., Sokhi, R., Lehtinen, K.E.J., Karatzas, K., San José, R., Astitha, 24. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) M., Kallos, G., Schaap, M., Reimer, E., Jakobs, H., Eben, K.: A 25. Kohavi, R.: A study of cross-validation and bootstrap for accuracy review of operational, regional-scale, chemical weather forecasting estimation and model selection. Proc. Fourteenth Int. Joint Conf. models in Europe. Atmos. Chem. Phys. 12, 1–87 (2012) Artif. Intel. 2(12), 1137–1143 (1995) 8. Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid 26. EPA: Guidelines for developing an air quality (ozone and of computational intelligence methods in Thessaloniki, Greece. PM2.5) forecasting program, U.S. Environmental Protection Simul. Modelling Pract. Theory 15(10), 1310–1319 (2007) Agency report EPA-456/R-03-002, https://www3.epa.gov/airnow/ 9. EEA, 2016: Air quality in Europe—2016 report, European Envi- aq_forecasting_guidance-1016.pdf. Accessed 18 Aug 2017 ronment Agency, https://doi.org/10.2800/80982. https://www.eea. 27. Voukantsis, D., Niska, H., Karatzas, K., Riga, M., Damialis, A., europa.eu//publications/air-quality-in-europe-2016. Accessed 18 Vokou, D.: Forecasting daily pollen concentrations using data- Aug 2017 driven modeling methods in Thessaloniki, Greece. Atmos. Environ. 10. Juda-Rezler, K., Trapp, W., Reizer, M.: Modelling the impact of 44(39), 5101–5111 (2010) climate changes on particulate matter levels over Poland. In: Steyn, 28. Tzima, F., Mitkas, P., Voukantsis, D., Karatzas, K.: Sparse episode D.G., Rao, S.T. (eds.) Air pollution modeling and its application identification in environmental datasets: the case of air quality XX, pp. 499–450 (2010) assessment. Expert Syst. with Appl. 38(5), 5019–5027 (2011) 11. Moussiopoulos, N., Vlachokostas, C., Tsilingiridis, G., Douros, I., Hourdakis, E., Naneris, C., Sidiropoulos, C.: Air quality status in Greater Thessaloniki Area and the emission reductions needed for attaining the EU air quality legislation. Sci. Total Environ. 407(4), Publisher’s Note Springer Nature remains neutral with regard to juris- 1268–1285 (2009) dictional claims in published maps and institutional affiliations. 12. Andrews, A.: The clean air handbook, a practical guideline to EU air quality law, https://www.clientearth.org/reports/20140515- clientearth-air-pollution-clean-air-handbook.pdf. Accessed 18 Aug 2017 13. WHO: Air Quality Guidelines, global update 2005, ISBN 92 890 2192 6 via http://www.euro.who.int. Accessed 18 Aug. 2017 14. Siwek, K., Osowski, S.: Improving the accuracy of prediction of PM10 pollution by the wavelet transformation and an ensemble of neural predictors. Eng. Appl. Artif. Intel. 25(6), 1246–1258 (2012) 15. Zhou, Q., Jiang, H., Wang, J., Zhou, J.: A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network. Sci. Total Environ. 496, 264– 274 (2014)

Journal

Vietnam Journal of Computer ScienceSpringer Journals

Published: May 24, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off