Risk-based framework for SLA violation abatement from the cloud service provider’s perspective

Risk-based framework for SLA violation abatement from the cloud service provider’s perspective Abstract The constant increase in the growth of the cloud market creates new challenges for cloud service providers. One such challenge is the need to avoid possible service level agreement (SLA) violations and their consequences through good SLA management. Researchers have proposed various frameworks and have made significant advances in managing SLAs from the perspective of both cloud users and providers. However, none of these approaches guides the service provider on the necessary steps to take for SLA violation abatement; that is, the prediction of possible SLA violations, the process to follow when the system identifies the threat of SLA violation, and the recommended action to take to avoid SLA violation. In this paper, we approach this process of SLA violation detection and abatement from a risk management perspective. We propose a Risk Management-based Framework for SLA violation abatement (RMF-SLA) following the formation of an SLA which comprises SLA monitoring, violation prediction and decision recommendation. Through experiments, we validate and demonstrate the suitability of the proposed framework for assisting cloud providers to minimize possible service violations and penalties. 1. INTRODUCTION Cloud computing has captured a huge customer base in enterprise and small business due to its ability to provide users with a wide range of flexible services at reduced cost. Due to its wide adoption, cloud computing is often referred to as a fifth utility for human beings [1]. Enterprises and businesses using the operational paradigm of cloud computing have drastically reduced their business costs by moving from capital expenditure (e.g. buying resources by building data centres) to operational expenditure, thus enabling them to focus on their core business activities [2]. Features such as the elastic scaling of resources, pay-as-you-go and metered resource usage have also enabled users of such enterprises and businesses to reduce their operational costs [3]. However, while such features are beneficial from the user’s perspective, they create the illusion that businesses have an infinite quantity of resources that can be accessed as required. This may not be true in all cases, especially when the business is a small to medium (SME) cloud service provider. Unlike large scale cloud service providers, such as Amazon, an SME has a finite quantity of computing resources with which to manage their users’ requests [4]. As shown in the literature, these issues between a service provider and service user are addressed by defining and managing a service level agreement (SLA). An SLA describes all the service level objectives (SLOs) and agreed quality of service (QoS) parameters [5] and shows the commitment and obligations of each party, including the deliverability and penalties to be applied in the case of SLA violation [6]. As is the case in any business activity, the primary aim of a service provider in cloud computing is to fulfill its commitment to the many users with whom it has formed an SLA, to avoid violations. This falls within the broad domain of cloud service management. Recent contributions in this area [7–10] have looked at different methods, such as the automatic extraction of metrics, ontology-based semantic reasoning and linked unified service description language to manage the SLA and avoid service violations. However, these approaches consider the management of a service after a violation has taken place; in other words, they adopt a reactive approach to service management, which may be detrimental to the cloud service provider’s reputation and may negatively impact the likelihood of attracting future business from existing or new cloud service users. This can be avoided if service providers proactively manage their services. In this form of service management, service providers constantly monitor the SLOs after the SLA has been formed to ensure that possible violations are averted before they occur. In our previous work [4], we observed that this proactive management after an SLA has been formed but before violation occurs works well for large scale cloud service providers. This is because large providers have abundant resources and can easily obtain additional resources if/when required to avert possible violations as they are detected. However, for an SME cloud service provider that has a finite quantity of computing resources, obtaining such additional resources at the time and in the quantity required after an SLA has been formed may not be possible. For SLA violations to be proactively managed by such cloud service providers, we emphasize that the service management process should start before the formation of the SLA [4], during the SLA negotiation/formation phase (referred to here as the pre-interaction phase) in which the cloud service provider pre-allocates its available resources to users after conducting a vetting process. In the SLA execution phase (referred to here as the post-interaction phase), which includes SLA monitoring, SLA violation prediction and decisions on violation abatement, the SLOs are constantly monitored to ensure that possible violations are averted. From the perspective of SME cloud service providers, therefore, active service management in both the pre-interaction phase and the post-interaction phase will lead to the better administration of the SLA, maximizing the likely commitment of the service provider, reducing the prospect of SLA violation, and achieving maximum financial returns [4, 11, 12]. Our previous work proposed the provider-based optimized personalized viable-SLA (OPV-SLA) framework for service management [13, 14]. OPV-SLA is divided into two parts, namely the pre-interaction phase and the post-interaction phase. In the pre-interaction phase, the provider starts the process of SLA management by negotiating and forming a viable SLA, which is then proactively managed in the post-interaction phase. In this paper, we explain the workings of the OPV-SLA post-interaction phase, which we term the Risk Management Framework for SLA violation abatement (RMF-SLA). In this framework, the runtime performance of the SLOs is captured and predicted, and the service provider recommends the appropriate actions to take to proactively mitigate the risk of SLA violation. The rest of the paper is organized as follows. Section 2 describes the related literature on SLA management. Sections 3 and 4 detail the components of the RMF-SLA along with their workings. Section 5 describes the evaluation of RMF-SLA and Section 6 concludes the paper. 2. LITERATURE REVIEW The activities in SLA management can be broadly categorized into two time periods, namely the pre-interaction phase and the post-interaction phase, as mentioned in the Introduction. The activities in the pre-interaction phase are the SLA negotiation and formation, while the activities in the post-interaction phase are QoS prediction for future intervals, runtime QoS monitoring, the comparison between actual and promised QoS parameters and determining the best course of action for SLA management in the event of observed differences [4]. As our focus in this paper is on the post-interaction phase, we present a summary of some of the existing approaches to SLA management and violation abatement in the literature. Wood et al. [15] proposed the Sandpiper approach for SLA monitoring and resource management to detect hotspots that indicate a possibility of violation. To eliminate a hotspot, Sandpiper resizes and shifts the virtual machine or adjusts resources. It gathers the usage records of virtual and physical servers and flags a hotspot when resource usage exceeds a defined threshold. The proposed approach manages the runtime workload of the servers. Other approaches such as [16–19] map low-level resource metrics to SLA parameters. This is done by mapping the service status to the predefined threshold and identifying the deviation between the agreed and actual behavior to detect SLA violations using case-based reasoning (CBR) approaches. Although the proposed idea of mapping service resource metrics to SLA parameters helps the service provider to identify potential violation on current performance measures, it may not guarantee commitment to the requirements of all customers, as the performance measures are not formed and agreed in the pre-interaction phase. These approaches do not describe what needs to be done when the system identifies a likely violation. Some approaches offer a limited set of rules and use a CBR approach, which has its own limitations such as adaption, processing time, and storage, and usually does not produce optimal results [20]. Another work in this category by Falasi et al. [21] presents the Sky framework, which adaptively implements SLAs to manage changes in a federated cloud environment. The framework is capable of managing multilevel SLAs but does not describe the process for handling SLA violation. Also, SLAs are not formed during the negotiation process of the pre-interaction phase in this framework, which may not guarantee the requirement commitment of customers in the post-interaction phase. There are a number of approaches with self-management features which try to manage SLA violation before end users are affected. Brandic et al. [18] proposed a bottom-up hierarchical layered approach for the propagation of SLA violation when a violation threat is found. Mosallanejad and Atan [22] proposed a hierarchical self-healing approach in which each layer of cloud is responsible for managing the problem by itself. If the problem cannot be, the framework informs the upper layer for possible remedial action. Lu et al. [23] proposed an actor system framework that adopts a parent-child relationship for managing SLA violation. When the actor system detects a possible SLA violation, it first tries to resolve it, and then sends the error information to the upper parent actor if it is unable to do so. A multilayer monitoring approach was proposed by Katsaros et al. [24] that monitors SLAs based on observing time intervals and SLA parameters. The proposed approach has the features of runtime adaptability of resource provision, estimation, and decision taking. Although the self-management approaches [18, 22, 23] attempt to adjust violations when they are detected by the system, they do not suggest what action to take to avoid violation occurring. Moreover, these approaches lack the agreement process in the pre-interaction phase of SLA management. Other approaches in the literature use a third-party broker to manage SLAs. Lee et al. [25] proposed a cloud service broker portal that provides a gateway for cloud service providers and users to interact with each other. The portal has a single entry point for a cloud service broker, a cloud service provider and a cloud service user. It interacts with a unique interface designed for each stakeholder, and has a brokerage API that integrates various cloud service providers into the cloud service broker portal. The cloud service brokerage model has five components [26]. The framework helps cloud users to select a suitable cloud provider that satisfies the functional and non-functional requirements of the SLA; from a provider’s perspective, however, they lack the pre-interaction processing steps and actions to be taken if an SLA violation is detected. Other SLA management approaches focus on trusted relationships between the provider and the user. Noor and Sheng [27] proposed an adaptive credibility model that offers trust as a service to the service provider. The proposed approach helps the service provider to differentiate between biased and unbiased feedback. Although this approach is helpful for the service provider, it is only effective for a system that has existing users. It lacks a process for differentiating between possible users and the recommended action (RAc) to be taken on SLA formation and violation. Fan and Perros [28] differentiated between biased and unbiased feedback based on the familiarity and consistency of the feedback. They proposed a trust value range and ranked users based on that value. However, without a bootstrapping mechanism, this method cannot be applied to new users who have only recently subscribed to services. Another category of SLA management approaches uses proactive mechanisms to identify and predict likely SLA violations. QoS parameters are predicted using a user-based collaborative filtering (CF) mechanism, item-based CF mechanism, and stream processing framework [29–31]. Cardellini et al. [32] proposed heuristic policies to predict QoS parameters and determine the resources needed in future intervals using the recursive least squares method; however the process of managing SLA violations when they are predicted by the system is not defined. It can be seen from the above discussion that even though many approaches have been proposed in the literature for cloud SLA management, not all of them guide the service provider on the steps required for SLA violation abatement. In Table 1, we compare SLA management approaches on the three criteria required for SLA abatement, namely the ability to predict possible SLA violations, a description of the process to be followed when the system identifies an SLA violation threat, and the SLA violation abatement recommendation. It is important to mention that most of the existing approaches focus on the post-interaction phase of SLA management, that is, after a user and provider have formed their SLA. As mentioned in Section 1, this is not beneficial for SME cloud service providers, since the careful prior negotiation of SLOs is necessary to maximize the likelihood of a consumer commitment, reduce the possibility of SLA violation, and gain maximum financial returns. To address these drawbacks, we proposed the OPV-SLA management framework, shown in Fig. 1, in our previous work [13, 14, 49]. This framework first assists the user and provider to agree on QoS expectations and then monitors the terms of the agreement for violations. Table 1. Comparison of SLA management approaches in the post-interaction phase. Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Table 1. Comparison of SLA management approaches in the post-interaction phase. Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Figure 1. View largeDownload slide OPV-SLA management framework (reproduced from [13]). Figure 1. View largeDownload slide OPV-SLA management framework (reproduced from [13]). The process of forming customized SLAs between providers and users in the pre-interaction phase is explained in our previous work [13]. In this paper, we explain the process for determining and abating SLA violation, which forms part of the post-interaction phase. It is important to mention that various approaches in the literature have used techniques such as QoS prediction [29], workflow detection control model [39], machine learning regression technique [41] and workload analyzer [50] to ascertain the possibility of SLA violation. In our method, we analyze the notion of risk as the criterion for ascertaining possible SLA violation and the subsequent actions to take for its abatement. A related work in this category that uses risk as the criterion for SLA management is Kiran et al. [51], who proposed a risk assessment framework for cloud service provisioning. Their proposed framework assists both SaaS and IaaS providers to identify, evaluate and mitigate risk in service provisioning. The risk assessment between a service provider and an infrastructure provider consists of six steps: the infrastructure provider’s business dealings, the service provider’s business dealings, the potential for service failure under the SLA, the reliability of the services offered under the SLA, the service provider for runtime operation and monitoring of QoS parameters, and lastly, the infrastructure provider for potential infrastructure failure. Zhang et al. [52] proposed a risk management framework that analyzes, assesses and mitigates risk to help the service provider to achieve better management of SLAs. Risk assessment in this approach is comprised of four steps: to define the likelihood of vulnerabilities and associated threats, to determine the magnitude of risk, to find the level of risk, and to take all necessary actions to mitigate risk. Cicotti et al. [53] proposed a model that predicts future QoS based on runtime monitoring data and data from a probabilistic model-checking method. The system generates an alert when it detects probable QoS violation and helps service providers to stop or minimize possible service violation. Albakri et al. [54] proposed a security risk management framework that allows users to evaluate risk and contribute to the risk assessment process. The framework permits users to define the legal requirements, identify the risk factors, and obtain feedback from a service provider. While there are approaches that consider the notion of risk in SLA management, most of them are unable to guide a service provider in relation to the steps to be taken to determine and address possible SLA violation. In the next section, we define our RMF-SLA, which assists cloud service providers to identify and assess the risk of SLA violation occurring in the post-interaction phase and to manage it by considering a set of decision parameters. 3. RISK MANAGEMENT FRAMEWORK FOR SLA VIOLATION ABATEMENT As shown in Fig. 1, RMF-SLA is a combination of five modules that address the detection and abatement of SLA violation. They are Module 1: Threshold Formation Module (TFM) Module 2: Runtime QoS Monitoring Module (RQoSMM) Module 3: QoS Prediction Module (QoSPM) Module 4: Risk Identification Module (RIM) Module 5: Risk Management Module (RMM). The workings of each module of RMF-SLA are explained in the following subsections. 3.1. Module 1: TFM This is the first module of the RMF-SLA framework, as shown in Fig. 2. It takes the QoS values of the SLOs determined between the cloud provider and the user in the pre-interaction phase and forms two thresholds for determining and managing violations. These two thresholds are the Agreed threshold (Ta) and the Safe threshold (Ts). Agreed threshold (Ta): This threshold value is described in the SLA and is mutually agreed by the user and the provider. When both parties have finalized their SLAs, they agree on certain thresholds for each service level objective (SLO) and QoS parameter. A service provider that does not comply with the agreed QoS parameters commits a service violation and is liable for violation penalties. Safe threshold (Ts): To avoid possible service violation and penalties, we propose that a provider should define a safe threshold (Ts) that is stricter than the agreed threshold (Ta). This is a customized threshold defined by the provider. It raises an alarm of possible SLA violation when a runtime QoS reaches or exceeds the threshold and invokes Module 5, the RMM, to take necessary action to avert the violation. Figure 2. View largeDownload slide Provider-based RMF-SLA. Figure 2. View largeDownload slide Provider-based RMF-SLA. To explain the importance of Ts and Ta, let us consider a provider and user forming an SLA in the pre-interaction phase who agree on having 80% availability of a resource (memory). This 80% availability of memory is the Ta value agreed by both parties. For service management and SLA violation abatement, the provider defines a customized threshold for the total memory, say 90%, which is the Ts value. When the availability at runtime falls below this 90% threshold, the framework alerts the service provider and activates the RMM to manage the risk of the QoS value of the SLO falling below the Ta value. 3.2. Module 2: RQoSMM This is the second module of RMF-SLA, which is responsible for monitoring the runtime QoS parameters of each SLO in the SLA. The captured runtime QoS values are sent to Module 3—QoSPM where the QoS values of the SLOs in the near future are determined. 3.3. Module 3: QoSPM The QoSPM is the third module of RMF-SLA, which predicts users’ resource usage in each SLO. The module uses an optimal prediction algorithm for each SLO to predict the user’s likely resource usage based on his or her usage history. The choice of an optimal prediction algorithm plays a key role in decision making, since the accuracy of the prediction method depends on the choice of dataset. In our previous work [55], we considered the stochastic, neural network, and different time series prediction methods and analyzed their prediction accuracy on a dataset from the Amazon EC2 cloud. We observed from the evaluation results that an optimal prediction result is obtained by considering small intervals for prediction and using the autoregressive integrated moving average (ARIMA) method. ARIMA is one of the most efficient versions of the autoregressive moving average (ARMA) method formulated by mathematical statisticians George Box and Gwilym Jenkins in the 1970s [56] for use with business and economic data. It has been widely used as an optimal prediction method in the cloud service domain. For example, Calheiros et al. [57] developed a cloud virtual machine workload prediction model and observed 91% prediction accuracy when using the ARIMA method. Other researchers, such as Rehman et al. [58], have used the ARIMA method to forecast the QoS values from the user perspective with good accuracy. Hence, for the prediction of QoS we use the ARIMA method in QoSPM. To enhance the accuracy of the prediction result, RQoSMM constantly inputs the value of the SLOs in previous time intervals to QoSPM. For example, RQoSMM captures the QoS values from time interval 1 to time interval 10 (t1–t10) to predict them over time interval 11–14 (t11–t14). When the QoS values over the interval t15 to t18 are predicted, RQoSMM gives QoSPM the captured QoS values up to t14, to ensure that an optimal prediction result closely related to the observed data with minimum deviation [55] is achieved. The pseudocode of the workings in QoSPM is as follows: for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); The algorithm starts by ensuring that the runtime data are available. If a transaction has just started and runtime QoS data are not available, QoSPM considers the user’s data from the Identity Management Module (IMM) in the pre-interaction phase. As explained in our previous work [13], this module stores the interaction history of a user. If a user is new and has no previous transaction records, the commitment of the user to the QoS values is ascertained by IMM using top-K nearest neighbors and their transactions. Once these values have been obtained, the QoSPM prediction is not made by taking the relative values of the SLOs, but by taking the percentage value of the level of the SLO commitment to the level of the SLO requested. This is because the level of resources requested by a user in the current SLO interaction may be different from what was requested in the past, so a standardized scale on which to represent these values is needed for fair analysis. To obtain the standardized scale, we take the relative values of the previous transaction, as presented in Equation (1). Rpred=∑i=1m(RusedRrequest)i∗Rci (1) where Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA The output of the QoSPM is forwarded to Module 4—RIM which invokes Module 5—RMM when the possibility of SLA violation is detected. 3.4. Module 4: RIM RIM is responsible for comparing the predicted values from QoSPM with the formed Ts value in Module 1. If the value of QoSPM reaches or exceeds the Ts value, Module 5—RMM is activated to abate possible SLA violation. 3.5. Module 5: RMM As mentioned earlier, RMM is invoked when RIM determines the possibility of an SLA violation occurring. Once invoked, RMM estimates the severity of the risk of SLA violation and determines how to manage it. As presented in Fig. 3, the RMM takes three inputs - risk attitude of the provider, reputation of the consumer and the transaction trend to take appropriate actions and to manage the risk of SLA violation. RMM as shown in Fig. 5 is comprised of two sub-modules, the risk estimation module (REM) and the risk mitigation module (RMtM). REM: This sub-module is responsible for estimating the risk of an SLA violation occurring. The notion of risk is subjective, as is the process for managing it, so to determine the severity of the risk from the subjective viewpoint of the provider, the following three inputs are considered in REM: Risk attitude (RA) of the provider: The RA of the provider represents its capacity to deal with risk. A provider’s RA is risk averse, risk neutral or risk taking. A provider with a risk averse attitude is more reluctant to take a risk (in this case, to allow an SLA violation to occur) than a provider with a risk neutral or risk taking attitude [13]. Reputation of the user for whom the possibility of SLA violation is being determined: Reputation is the reliability or trust value a provider places on a user to uphold the terms of the SLA. The reputation of a user shows the user’s commitment to previously formed SLAs with the provider and is represented as being bronze, silver or gold. The process for determining the class of reputation is described in our previous work [13]. The reputation of a user is an input of REM, because we consider that if a provider values a user highly (represented as being silver or gold class), the provider will prefer to take immediate action to minimize the possible risk of SLA violation, in contrast to a similar situation with a user whose reputation is bronze class. Transaction trend (TT) curve over future time intervals: The third input to REM is the TT curve that shows a user’s use of an SLO over future time intervals. This shows the prevailing use of resources by the user over a period of time (from QoSPM) and how this usage maps against the formed Ta and Ts values. When the TT curve exceeds Ts, it may either move towards Ta, as shown in Fig. 4a, or away from Ta, as shown in Fig. 4b. REM captures the direction of the TT curve to ascertain the risk of SLA violation and estimate the steps required to mitigate the risk. RMtM: As discussed above, REM estimates the risk of possible SLA violation occurring by considering the relevant inputs. Subsequently, RMtM recommends an appropriate action to manage and mitigate the risk. A fuzzy inference system is used to perform the computation with the recommendation to take immediate action, delayed action or no action. When a risk of violation is assessed as high, RMtM recommends that the service provider should take immediate action. In taking this action, the service provider stops accepting new requests and arranges for sufficient resources to be provided in the fastest possible time to avoid service violation. When the risk of violation is estimated as medium or low, RMtM decides and recommends whether to take delayed action or no action. Here, it is implied that the provider accepts the risk but keeps the situation under observation, with the intention of taking any necessary action within a certain timeframe. Figure 3. View largeDownload slide Working of RMM in RMF-SLA. Figure 3. View largeDownload slide Working of RMM in RMF-SLA. Figure 4. View largeDownload slide TT curve moving towards and away from Ta. Figure 4. View largeDownload slide TT curve moving towards and away from Ta. To summarize, the working of RMF-SLA is as follows and as shown in Fig. 5: Step 1: After forming the Ts and Ta thresholds, QoSPM collects data from IMM and RQoSMM. Step 2: QoSPM predicts the QoS usage values in future time intervals. Step 3: RIM compares the predicted values from the QoSPM with the Ts value. Step 4: If the value from QoSPM is below Ts, then no action is taken and the runtime QoS parameters of the SLO are monitored. However, if the value of the QoSPM reaches or exceeds Ts, RMM is activated to manage the risk of SLA violation. Step 5: REM of RMM estimates the risk of SLA violation by capturing the RA of the provider, the reputation of the user and the TT curve. A fuzzy inference system (FIS) is used to estimate the risk of SLA violation occurring. Step 6: Depending on the estimated risk in REM, RMtM suggests the appropriate action that the cloud provider should take to mitigate the risk of violation occurring. The type of action is immediate action, delayed action, or no action. Figure 5. View largeDownload slide The working of RMF-SLA in the post-interaction phase of OPV-SLA. Figure 5. View largeDownload slide The working of RMF-SLA in the post-interaction phase of OPV-SLA. The fuzzy inference system for estimating the possible risk of SLA violation and determining the appropriate mitigation action is explained in the next section. 4. FIS FOR DETERMINING POSSIBLE SLA VIOLATIONS AND THEIR ABATEMENT IN RMF-SLA To assess the possible risk of SLA violation and manage its abatement, we use a Mamdani type FIS [59] to combine the various inputs. Figure 6 represents the input and the output used to manage possible SLA violations. The FIS and the membership functions of each of its inputs and outputs are explained in the following subsections. Figure 6. View largeDownload slide FIS for assessing and managing the risk of SLA violation in RMF-SLA. Figure 6. View largeDownload slide FIS for assessing and managing the risk of SLA violation in RMF-SLA. 4.1. Defining the fuzzy sets and membership function for input—RA of the provider The RA of the provider defines the provider’s propensity to take risk. Depending on its RA, a service provider may be risk averse, risk neutral or risk taking. These are the fuzzy sets over which the RA will be represented. A risk averse provider attempts to avoid any risk, whether it is small or large. A risk neutral provider takes the middle ground; depending on the nature of the risk, it may decide to take action or to ignore the risk. A risk taking provider has a bold attitude, ignoring small risks and taking action only for risks that will have a significant effect. We consider 1–5 as the Universe of Discourse over which the fuzzy sets of this input will be represented. The membership function for this input is as shown in Fig. 7, and the corresponding membership function for each fuzzy set is as follows: μriskaverse(RA)=3−x2if1<x≤3;0if3<x≤5 μriskneutral(RA)=x−12if1<x≤3;5−x2if3<x≤5 μrisktaking(RA)=0if0<x≤3;x−32if3<x≤5 Figure 7. View largeDownload slide RA of a provider in assessing the possibility of SLA violation occurring. Figure 7. View largeDownload slide RA of a provider in assessing the possibility of SLA violation occurring. 4.2. Defining the fuzzy sets and membership function for input—reputation of the user User reputation (R) is the trustworthiness of the user’s commitment in previous transactions to the defined SLAs with the service provider. The fuzzy set over which the reputation value of a user is represented is bronze, silver or gold, and the universe of discourse is from 0 to 100. The membership function for this input is as shown in Fig. 8, and the corresponding membership function for each fuzzy set is as follows: μBronze(R)=1if0<x≤40;45−x5if41<x≤45;0if46<x≤100 μSilver(R)=0if0<x≤40,x−405if41<x≤45,1if45<x≤70,75−x5if71<x≤75,0if76<x≤100 μGold(R)=0if0<x≤70x−705if71<x≤75,1if76<x≤100 Figure 8. View largeDownload slide Membership function for the reliability of a user. Figure 8. View largeDownload slide Membership function for the reliability of a user. 4.3. Defining the fuzzy sets and membership function for input—TT TT shows the trajectory of the predicted resource usage in future intervals. The values of the predicted trajectory are obtained from the QoSPM. The fuzzy sets used to represent input TT are towards the Ta or away from the Ta. The universe of discourse over which the input TT is represented is from 0 to 1. The membership function for this input is shown in Fig. 9, and the corresponding membership function for each fuzzy set is as follows: μAway(TT)=1−x1if0<x≤1 μTowards(TT)=x−11if0<x≤1 Figure 9. View largeDownload slide Membership function for the TT. Figure 9. View largeDownload slide Membership function for the TT. 4.4. Defining the fuzzy sets and membership function for output—RAc The output RAc is the appropriate action to be taken to manage the possible risk of violation occurring, and the recommended output of Immediate Action, Delayed Action or No Action. These are the fuzzy sets used to represent the output, and the universe of discourse over which these fuzzy sets are represented is 0 to 1. The membership function for this input is as shown in Fig. 10 and the corresponding membership function for each fuzzy set is as follows: μNoAction(RAc)=1if0<x≤0.01,0if0.01<x≤1 μDelayedAction(RAc)=0if0<x≤0.01;1ifx=0.01;x−0.010.99if0.01<x≤1; μDelayedAction(RAc)=0ifx=0.01;1−x0.99if0.01<x≤1 Figure 10. View largeDownload slide Membership function for recommended risk mitigation action. Figure 10. View largeDownload slide Membership function for recommended risk mitigation action. 4.5. Fuzzy rules for possible risk of SLA violation occurring and mitigation action to be taken The combination of linguistic variables for the inputs, resulting in a total of eighteen rules, is presented in Table 2. The variables are: the RA of the provider [risk averse (Ra), risk neutral (Rn), or risk taking (Rt)], the reputation of the user [bronze (B), silver (S), or gold (G)] and the TT [towards (T) or away (A)] Table 2. FIS rules for the assessment and abatement of SLA violation risk. Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Table 2. FIS rules for the assessment and abatement of SLA violation risk. Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA In the next section, we demonstrate how the service provider can assess and manage the risk of possible SLA violation with the consumer using the RMF-SLA framework of OPV-SLA. 5. VALIDATION OF RMF-SLA FRAMEWORK FOR MANAGING POSSIBLE SLA VIOLATION ABATEMENT To demonstrate the applicability of the RMF-SLA framework for service providers in the abatement of possible SLA violations, we utilize the dataset from Amazon EC2 IaaS cloud services—EC2 US West collected from CloudClimate [60] through the PRTG monitoring service [61]. This dataset is used for QoS prediction and for managing SLAs and the abatement of possible violation. The prototype was built using Microsoft Visual Studio 2010 to develop the interface, Microsoft SQL Server Management Studio 2008 for the databases and MATLAB to design the FIS application. To implement RMF-SLA, we first need to form an SLA between the user and provider in the OPV-SLA pre-interaction phase. Readers should refer to our previous work in [13], in which we explain these computations in detail. The outcome of this phase is a well-formed SLA between the service provider and service user which maximizes the likelihood of the service provider’s commitment to the formed SLOs, reduces the potential for SLA violation, and achieves the maximum financial return for the available resources. To ensure the successful fulfillment of the SLA, the service provider needs to undertake the following management steps in the post-interaction phase: prediction, monitoring and decision-making, which are assisted by RMF-SLA, as explained next. Using the EC2 US West dataset from Amazon EC2 IaaS cloud services, we adopt CPU usage as the SLO we want to monitor to proactively pre-determine possible SLA violations. As discussed in Section 4, the first module of RMF-SLA is TFM, which defines the Ts value for the SLO being monitored. This is different from the Ta value, which is decided during the formation of the SLA. The next two stages in the RMF-SLA are RQoSMM and QoSPM. QoSPM predicts the QoS values over a future period. A number of prediction methods are available, each of which generates a different output depending on the nature of the dataset being used. RMF-SLA uses ten prediction methods, namely cascade forward backpropagation (CFBP), Elman backpropagation (EBP), generalized regression (GR), nonlinear autoregressive neural network with external input (NARX), simple exponential smoothing (SES), simple moving average (SMA), weighted moving average (WMA), extrapolation (EXP), Holt-Winters double exponential smoothing (HDES) and autoregressive integrated moving average method (ARIMA). Root mean square error (RMSE) and mean absolute deviation (MAD) are used as the benchmark to measure prediction accuracy, and the method which gives the least error is used for prediction. We use an example to explain the process. Figure 11 shows the observed QoS of the SLO CPU usage for the period of 1 h on 6 September 2016 from 06:35 AM to 7:30 AM. To test the accuracy of the prediction methods, we use the QoS values for that SLO from a previous time period and use them to predict the QoS values for 06:35 AM to 7:30 AM on 6 September 2016. The neural network-based methods were trained by considering 1002 datasets from the previous 6 days. The results of the observed and predicted QoS values are shown in Table 3 and Fig. 12. The prediction results at 5-min intervals are given, and all units are measured in millisecond (ms). The accuracy of each method is measured using RMSE and MAD. The prediction accuracy of all methods is presented in Table 4 and Fig. 13. Table 3. Prediction results of 10 methods at 5-min intervals. Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Table 3. Prediction results of 10 methods at 5-min intervals. Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Table 4. Prediction accuracy of all methods. Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Table 4. Prediction accuracy of all methods. Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Figure 11. View largeDownload slide CPU usage data for 1 h [60]. Figure 11. View largeDownload slide CPU usage data for 1 h [60]. Figure 12. View largeDownload slide Prediction output of each approach at 5-min intervals. Figure 12. View largeDownload slide Prediction output of each approach at 5-min intervals. Figure 13. View largeDownload slide Prediction accuracy of all methods using RMSE and MAD as a benchmark. Figure 13. View largeDownload slide Prediction accuracy of all methods using RMSE and MAD as a benchmark. We evaluate CPU usage every 5 min for 1 h, starting on 6 September 2016 at 06:35 AM and ending on 6 September 2016 at 7:30 AM. Figure 12 presents the CPU usage for the period. From Table 4, we can see that of all the prediction methods, ARIMA gives the optimal prediction result with an RMSE value of 0.461865174 and a MAD value of 0.249580313. Extending our example, QoSPM uses the ARIMA method to predict the QoS of the CPU usage for the next hour from 7:40 AM to 8:35 AM, as shown in Table 5. To determine the possibility of SLA violation and its abatement, we consider that the values for Ts and Ta are 575 ms and 599 ms, respectively, as shown in Fig. 14. Ta is the value of the SLO determined on the formation of the SLA and Ts is the safe threshold defined by the provider. RIM compares the predicted QoS value with these values, and if the Ts value is exceeded, the RMM is activated to ascertain and manage the risk of SLA violation. Table 5. Prediction of the SLO over a period of 1 h using the ARIMA method. Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Table 5. Prediction of the SLO over a period of 1 h using the ARIMA method. Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Figure 14. View largeDownload slide Showing the Ts and Ta values of the predicted QoS over a future period. Figure 14. View largeDownload slide Showing the Ts and Ta values of the predicted QoS over a future period. From Fig. 14, we see that at the third time interval (7:50 AM) the predicted result exceeds the Ts threshold. RMM at this stage considers the RA of the provider, the reputation of the user, and the projected TT to suggest an appropriate action. In this scenario, we consider that the reputation of the user at the pre-interaction phase is 45 (silver), the RA of the provider is risk neutral and the TT is moving towards the agreed threshold value. These inputs are processed by the FIS rules and the recommended output is immediate action. This is because the provider is risk neutral and the TT has exceeded the Ts and is moving towards the Ta value, so if the provider does not take action, there is a high risk of SLA violation. The provider needs to take immediate action by arranging supply of the deficient resources, either itself or from external resources, to avoid possible violation. Similarly, at 8:15 AM we see that the predicted QoS value moves towards Ts and drops below it. In this scenario, the output from the FIS recommends no action to be taken, as no likelihood of SLA violation is determined. From the above example, we see that RMF-SLA suggests the appropriate action to be taken to manage potential SLA violation according to the RA of the provider, the user’s reputation, and the TT. The combination of RMF-SLA with the pre-interaction phase module of OPV-SLA assists an SME service provider to first form viable SLA and then manage the risk associated with possible SLA violations. 6. CONCLUSION The SLA is the key agreement made between a service provider and a service user in a cloud computing environment. To increase and maintain their reputation, service providers need a viable SLA management framework that helps them to first form viable SLAs and then intelligently predict the occurrence of possible SLA violations before recommending an appropriate action to be taken. Our proposed OPV-SLA management framework helps service providers, particularly SME providers with limited resources, to achieve this. In this paper, we have briefly explained the OPV-SLA framework and focused on its post-interaction phase module, namely the RMF-SLA, which is responsible for QoS prediction, detecting the possible occurrence of SLA violations, and recommending the best possible decision to avert violation. We have demonstrated the application of RMF-SLA with an example and have shown how the proposed method assists cloud service providers in SLA management. In our future work, we will find the hidden patterns between SLOs and low-level metrics to predict likely violation for SLA management. REFERENCES 1 Weinhardt , C. , Anandasivam , D.-I.-W.A. , Blau , B. , Borissov , D.-I.N. , Meinl , D.-M.T. , Michalk , D.-I.-W.W. et al. ( 2009 ) Cloud computing—a classification, business models, and research directions . Bus. Inf. Syst. Eng. , 1 , 391 – 399 . Google Scholar CrossRef Search ADS 2 Armbrust , M. , Fox , A. , Griffith , R. , Joseph , A.D. , Katz , R.H. , Konwinski , A. et al. ., Above the Clouds: A Berkeley View of Cloud Computing, Electrical Engineering and Computer Sciences University of California at Berkeley, http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html, Technical Report No. UCB/EECS-2009-282009. 3 Rhoton , J. ( 2013 ) Cloud Computing Explained: Implementation Handbook for Enterprises . Recursive Press , London, UK . 4 Hussain , W. , Hussain , F.K. , Hussain , O.K. , Damiani , E. and Chang , E. ( 2017 ) Formulating and managing viable SLAs in cloud computing from a small to medium service provider's viewpoint: A state-of-the-art review . Information Systems , 71 , 240 – 259 . http://www.sciencedirect.com/science/article/pii/S0306437917302697. Google Scholar CrossRef Search ADS 5 Ludwig , H. , Keller , A. , Dan , A. , King , R. and Franck , R. ( 2003 ) A service level agreement language for dynamic electronic services . Electron. Commer. Res , 3 , 43 – 59 . https://link.springer.com/article/10.1023/A:1021525310424. Google Scholar CrossRef Search ADS 6 Hussain , W. , Hussain , F.K. and Hussain , O.K. ( 2014 ) Maintaining Trust in Cloud Computing through SLA Monitoring. Int. Conf. Neural Information Processing, Kuching, Malaysia, pp. 690–697. Springer International Publishing, Switzerland. 7 Mittal , S. , Joshi , K.P. , Pearce , C. and Joshi , A. ( 2016 ) Automatic Extraction of Metrics from SLAs for Cloud Service Management. 2016 IEEE Int. Conf. Cloud Engineering (IC2E), Berlin, Germany, pp. 139–142. IEEE. 8 Karim , B. , Qing , T. , Villar , J.R. and de la Cal , E. ( 2017 ) Resource Brokerage Ontology for Vendor-independent Cloud Service Management. 2017 IEEE 2nd Int. Conf. Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, pp. 466–472. IEEE. 9 García , J.M. , Fernández , P. , Pedrinaci , C. , Resinas , M. , Cardoso , J. and Ruiz-Cortés , A. ( 2017 ) Modeling service level agreements with linked USDL agreement . IEEE Trans. Serv. Comput. , 10 , 52 – 65 . Google Scholar CrossRef Search ADS 10 Jaramillo , G.E. , Ardagna , C.A. and Anisetti , M. ( 2015 ) A Hybrid Representation Model for Service Contracts. 2015 Int. Conf. Information and Communication Technology Research (ICTRC), Abu Dhabi, United Arab Emirates, pp. 246–249. IEEE. 11 Messina , F. , Pappalardo , G. , Santoro , C. , Rosaci , D. and Sarné , G.M. ( 2016 ) A multi-agent protocol for service level agreement negotiation in cloud federations . Int. J. Grid Utility Comput. , 7 , 101 – 112 . Google Scholar CrossRef Search ADS 12 Feng , G. and Buyya , R. ( 2016 ) Maximum revenue-oriented resource allocation in cloud . Int. J. Grid Utility Comput. , 7 , 12 – 21 . Google Scholar CrossRef Search ADS 13 Hussain , W. , Hussain , F. , Hussain , O. and Chang , E. ( 2016 ) Provider-based optimized personalized viable SLA (OPV-SLA) framework to prevent SLA violation . Comput. J. , 59 , 1760 – 1783 . Google Scholar CrossRef Search ADS 14 Hussain , W. , Hussain , F.K. and Hussain , O.K. ( 2016 ) SLA Management Framework to Avoid Violation in Cloud. Int. Conf. Neural Information Processing, Kyoto, Japan, pp. 309–316. Springer. 15 Wood , T. , Shenoy , P. , Venkataramani , A. and Yousif , M. ( 2009 ) Sandpiper: black-box and gray-box resource management for virtual machines . Comput. Netw. , 53 , 2923 – 2938 . Google Scholar CrossRef Search ADS 16 Emeakaroha , V.C. , Brandic , I. , Maurer , M. and Dustdar , S. ( 2010 ) Low Level Metrics to High Level SLAs-LoM2HiS Framework: Bridging the Gap between Monitored Metrics and SLA Parameters in Cloud Environments. In 2010 Int. Conf. High Performance Computing and Simulation (HPCS), pp. 48–54. IEEE. 17 Emeakaroha , V.C. , Netto , M.A. , Calheiros , R.N. , Brandic , I. , Buyya , R. and De Rose , C.A. ( 2012 ) Towards autonomic detection of SLA violations in cloud infrastructures . Future Generation Comput. Syst. , 28 , 1017 – 1029 . Google Scholar CrossRef Search ADS 18 Brandic , I. , Emeakaroha , V.C. , Maurer , M. , Dustdar , S. , Acs , S. , Kertesz , A. et al. . ( 2010 ) Laysi: A Layered Approach for sla-violation Propagation in Self-manageable Cloud Infrastructures. 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops (COMPSACW), Seoul, South Korea, pp. 365–370. IEEE. 19 Haq , I.U. , Brandic , I. and Schikuta , E. ( 2010 ) Sla Validation in Layered Cloud Infrastructures . Int. Conf. Economics of Grids, Clouds, Systems, and Services, Ischia, Italy, pp. 153–164. ACM. 20 Cheetham , W. , Varma , A. and Goebel , K. ( 2001 ) Case-Based Reasoning at General Electric. Proc. Fourteenth Int. Florida Artificial Intelligence Research Society Conference, Florida, USA, pp. 93–97. 21 Al Falasi , A. , Serhani , M.A. and Dssouli , R. ( 2013 ) A Model for Multi-levels SLA Monitoring in Federated Cloud Environment. 2013 IEEE 10th Int. Conf. Ubiquitous Intelligence & Computing and 2013 IEEE 10th Int. Conf. Autonomic & Trusted Computing (UIC/ATC), Vietri sul Mere, Italy, pp. 363–370. 22 Mosallanejad , A. and Atan , R. ( 2013 ) HA-SLA: a hierarchical autonomic SLA model for SLA monitoring in cloud computing . J. Softw. Eng. Appl. , 6 , 114 . Google Scholar CrossRef Search ADS 23 Lu , K. , Yahyapour , R. , Wieder , P. , Yaqub , E. , Abdullah , M. , Schloer , B. et al. ( 2015 ) Fault-tolerant service level agreement lifecycle management in clouds using actor system . Future Generation Comput. Syst. , 54 , 247 – 259 . Google Scholar CrossRef Search ADS 24 Katsaros , G. , Kousiouris , G. , Gogouvitis , S.V. , Kyriazis , D. , Menychtas , A. and Varvarigou , T. ( 2012 ) A self-adaptive hierarchical monitoring mechanism for Clouds . J. Syst. Softw. , 85 , 1029 – 1041 . Google Scholar CrossRef Search ADS 25 Lee , J. , Kim , J. , Kang , D.-J. , Kim , N. and Jung , S. ( 2014 ) Cloud Service Broker Portal: Main Entry Point for Multi-cloud Service Providers and Consumers. 2014 16th Int. Conf. Advanced Communication Technology (ICACT), Pyeongchang, South Korea, pp. 1108–1112. IEEE. 26 Jrad , F. , Tao , J. and Streit , A. ( 2012 ) SLA Based Service Brokering in Intercloud Environments. CLOSER, pp. 76–81. https://s3.amazonaws.com/academia.edu.documents/42086411/SLA_based_Service_Brokering_in_Interclou20160204-30232-ruvxfi.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1512550771&Signature=qL6vwB7ws31nCBxcZ0wHuKejzx4%3D&response-content-disposition=inline%3B%20filename%3DSLA_based_Service_Brokering_in_Interclou.pdf. 27 Noor , T.H. and Sheng , Q.Z. ( 2011 ) Trust as a Service: A Framework for Trust Management in Cloud Environments. Web Information System Engineering–WISE 2011, Sydney, Australia, pp. 314–321. Springer. 28 Fan , W. and Perros , H. ( 2013 ) A Reliability-Based Trust Management Mechanism for Cloud Services. 2013 12th IEEE Int. Conf. Trust, Security and Privacy in Computing and Communications (TrustCom), Melbourne, VIC, Australia, pp. 1581–1586. IEEE. 29 Zhang , Y. , Zheng , Z. and Lyu , M.R. ( 2011 ) Exploring Latent Features for Memory-Based QoS Prediction in Cloud Computing. 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), Madrid, Spain, pp. 1–10. IEEE. 30 Romano , L. , De Mari , D. , Jerzak , Z. and Fetzer , C. ( 2011 ) A Novel Approach to QoS Monitoring in the Cloud. 2011 First Int. Conf. Data Compression, Communications and Processing (CCP), Palinuro, Italy, pp. 45–51. IEEE. 31 Cicotti , G. , Coppolino , L. , D’Antonio , S. and Romano , L. ( 2015 ) How to monitor QoS in cloud infrastructures: the QoSMONaaS approach . Int. J. Comput. Sci. Eng. , 1 , 29 – 45 . 32 Cardellini , V. , Casalicchio , E. , Lo Presti , F. and Silvestri , L. ( 2011 ) Sla-Aware Resource Management for Application Service Providers in the Cloud. 2011 First International Symposium on Network Cloud Computing and Applications (NCCA), Toulouse, France, pp. 20–27. IEEE. 33 Emeakaroha , V.C. , Ferreto , T.C. , Netto , M.A. , Brandic , I. and De Rose , C.A. ( 2012 ) Casvid: Application Level Monitoring for SLA Violation Detection in Clouds. 2012 IEEE 36th Annual Computer Software and Applications Conference (COMPSAC), Izmir, Turkey, pp. 499–508. IEEE. 34 Chandrasekar , A. , Chandrasekar , K. , Mahadevan , M. and Varalakshmi , P. ( 2012 ) QoS Monitoring and Dynamic Trust Establishment in the Cloud. Advances in Grid and Pervasive Computing, Springer, pp. 289–301. Springer. 35 Alhamad , M. , Dillon , T. and Chang , E. ( 2010 ) SLA-Based Trust Model for Cloud Computing. 2010 13th Int. Conf. Network-Based Information Systems (NBiS), Takayama, Japan, pp. 321–324. IEEE. 36 Wang , M. , Wu , X. , Zhang , W. , Ding , F. , Zhou , J. and Pei , G. ( 2011 ) A Conceptual Platform of SLA in Cloud Computing. 2011 IEEE Ninth Int. Conf. Dependable, Autonomic and Secure Computing (DASC), Sydney, Australia, pp. 1131–1135. IEEE. 37 Hammadi , A.M. and Hussain , O. ( 2012 ) A Framework for SLA Assurance in Cloud Computing. 2012 26th Int. Conf. Advanced Information Networking and Applications Workshops (WAINA), Fukuoka, Japan, pp. 393–398. IEEE. 38 Muchahari , M.K. and Sinha , S.K. ( 2012 ) A New Trust Management Architecture for Cloud Computing Environment. 2012 International Symposium on Cloud and Services Computing (ISCOS), Mangalore, India, pp. 136–140. 39 Sun , Y. , Tan , W. , Li , L. , Lu , G. and Tang , A. ( 2013 ) SLA Detective Control Model for Workflow Composition of Cloud Services. 2013 IEEE 17th Int. Conf. Computer Supported Cooperative Work in Design (CSCWD), Whistler, BC, Canada, pp. 165–171. 40 Hussain , O.K. , Hussain , F.K. , Singh , J. , Janjua , N.K. and Chang , E. ( 2014 ) A user-based early warning service management framework in cloud computing . Comput. J. , 58 , 472 – 496 . Google Scholar CrossRef Search ADS 41 Leitner , P. , Wetzstein , B. , Rosenberg , F. , Michlmayr , A. , Dustdar , S. and Leymann , F. ( 2010 ) Runtime Prediction of Service Level Agreement Violations for Composite Services. Service-Oriented Computing. ICSOC/ServiceWave 2009 Workshops, Stockholm, Sweden, pp. 176–186. Springer. 42 Ciciani , B. , Didona , D. , Di Sanzo , P. , Palmieri , R. , Peluso , S. , Quaglia , F. et al. . ( 2012 ) Automated Workload Characterization in Cloud-based Transactional Data Grids. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pp. 1525–1533. IEEE. 43 Cardellini , V. , Casalicchio , E. , Lo Presti , F. and Silvestri , L. ( 2011 ) SLA-Aware Resource Management for Application Service Providers in the Cloud. 2011 First Int. Symposium on Network Cloud Computing and Applications (NCCA), pp. 20–27. IEEE. 44 Son , S. , Kang , D.-J. , Huh , S.P. , Kim , W.-Y. and Choi , W. ( 2016 ) Adaptive trade-off strategy for bargaining-based multi-objective SLA establishment under varying cloud workload . J. Supercomput. , 72 , 1597 – 1622 . Google Scholar CrossRef Search ADS 45 Silaghi , G.C. , ŞErban , L.D. and Litan , C.M. ( 2012 ) A time-constrained SLA negotiation strategy in competitive computational grids . Future Generation Comput. Syst. , 28 , 1303 – 1315 . Google Scholar CrossRef Search ADS 46 Badidi , E. ( 2013 ) A Cloud Service Broker for SLA-Based SAAS Provisioning. 2013 Int. Conf. Information Society (i-Society), Toronto, Canada, pp. 61–66. IEEE. 47 Pacheco-Sanchez , S. , Casale , G. , Scotney , B. , McClean , S. , Parr , G. and Dawson , S. ( 2011 ) Markovian Workload Characterization for QOS Prediction in the Cloud. 2011 IEEE Int. Conf. Cloud Computing (CLOUD), Washington, DC, USA, pp. 147–154. IEEE. 48 Schmieders , E. , Micsik , A. , Oriol , M. , Mahbub , K. and Kazhamiakin , R. ( 2011 ) Combining SLA Prediction and Cross Layer Adaptation for Preventing SLA Violations. http://eprints.sztaki.hu/6563/1/2ndwoss_Micsik.pdf (Accessed March 2017). 49 Hussain , W. , Hussain , F.K. and Hussain , O. ( 2016 ) Allocating Optimized Resources in the Cloud by a Viable SLA Model. 2016 IEEE Int. Conf. Fuzzy Systems (FUZZ-IEEE), Vancouver, Canada, pp. 1282–1287. IEEE. 50 Ciciani , B. , Didona , D. , Di Sanzo , P. , Palmieri , R. , Peluso , S. , Quaglia , F. et al. . ( 2012 ) Automated Workload Characterization in Cloud-Based Transactional Data Grids. 2012 IEEE 26th Int. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Shanghai, China, pp. 1525–1533. IEEE. 51 Kiran , M. , Jiang , M. , Armstrong , D.J. and Djemame , K. ( 2011 ) Towards a Service Lifecycle Based Methodology for Risk Assessment in Cloud Computing. 2011 IEEE Ninth Int. Conf. Dependable, Autonomic and Secure Computing (DASC), Sydney, NSW, Australia, pp. 449–456. IEEE. 52 Zhang , X. , Wuwong , N. , Li , H. and Zhang , X. ( 2010 ) Information Security Risk Management Framework for the Cloud Computing Environments. 2010 IEEE 10th Int. Conf. Computer and Information Technology (CIT), Bradford, UK, pp. 1328–1334. IEEE. 53 Cicotti , G. , Coppolino , L. , D’Antonio , S. and Romano , L. ( 2015 ) Runtime model checking for SLA compliance monitoring and QoS prediction . J. Wireless Mobile Netw. Ubiquitous Comput. Dependable Appl. (JoWUA) , 6 , 4 – 20 . 54 Albakri , S.H. , Shanmugam , B. , Samy , G.N. , Idris , N.B. and Ahmed , A. ( 2014 ) Security risk assessment framework for cloud computing environments . Secur. Commun. Netw. , 7 , 2114 – 2124 . Google Scholar CrossRef Search ADS 55 Hussain , W. , Hussain , F.K. and Hussain , O. ( 2016 ) QoS Prediction Methods to Avoid SLA Violation in Post-Interaction Time Phase. 11th IEEE Conf. Industrial Electronics and Applications (ICIEA 2016) Hefei, China, pp. 32–37. IEEE. 56 Box , G.E. , Jenkins , G.M. and Reinsel , G.C. ( 2011 ) Time Series Analysis: Forecasting and Control , vol. 734 . John Wiley & Sons ., US http://213.55.85.90:8080/bitstream/handle/123456789/8965/Time%20Series%20Analysis.pdf?sequence=1&isAllowed=y. 57 Calheiros , R.N. , Masoumi , E. , Ranjan , R. and Buyya , R. ( 2015 ) Workload prediction using ARIMA model and its impact on cloud applications’ QoS . IEEE Trans. Cloud Comput. , 3 , 449 – 458 . Google Scholar CrossRef Search ADS 58 ur Rehman , Z. , Hussain , O.K. , Hussain , F.K. , Chang , E. and Dillon , T. ( 2015 ) User-side QoS forecasting and management of cloud services . World Wide Web , 18 , 1677 – 1716 . Google Scholar CrossRef Search ADS 59 Mamdani , E.H. and Assilian , S. ( 1975 ) An experiment in linguistic synthesis with a fuzzy logic controller . Int. J. Man. Mach. Stud. , 7 , 1 – 13 . Google Scholar CrossRef Search ADS 60 CloudClimate . Watching the Cloud. http://www.cloudclimate.com (Accessed March 2017). 61 P. N. Monitor . https://prtg.paessler.com (Accessed March 2017). © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

Risk-based framework for SLA violation abatement from the cloud service provider’s perspective

Loading next page...
 
/lp/ou_press/risk-based-framework-for-sla-violation-abatement-from-the-cloud-v0jPOJ0cak
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxx118
Publisher site
See Article on Publisher Site

Abstract

Abstract The constant increase in the growth of the cloud market creates new challenges for cloud service providers. One such challenge is the need to avoid possible service level agreement (SLA) violations and their consequences through good SLA management. Researchers have proposed various frameworks and have made significant advances in managing SLAs from the perspective of both cloud users and providers. However, none of these approaches guides the service provider on the necessary steps to take for SLA violation abatement; that is, the prediction of possible SLA violations, the process to follow when the system identifies the threat of SLA violation, and the recommended action to take to avoid SLA violation. In this paper, we approach this process of SLA violation detection and abatement from a risk management perspective. We propose a Risk Management-based Framework for SLA violation abatement (RMF-SLA) following the formation of an SLA which comprises SLA monitoring, violation prediction and decision recommendation. Through experiments, we validate and demonstrate the suitability of the proposed framework for assisting cloud providers to minimize possible service violations and penalties. 1. INTRODUCTION Cloud computing has captured a huge customer base in enterprise and small business due to its ability to provide users with a wide range of flexible services at reduced cost. Due to its wide adoption, cloud computing is often referred to as a fifth utility for human beings [1]. Enterprises and businesses using the operational paradigm of cloud computing have drastically reduced their business costs by moving from capital expenditure (e.g. buying resources by building data centres) to operational expenditure, thus enabling them to focus on their core business activities [2]. Features such as the elastic scaling of resources, pay-as-you-go and metered resource usage have also enabled users of such enterprises and businesses to reduce their operational costs [3]. However, while such features are beneficial from the user’s perspective, they create the illusion that businesses have an infinite quantity of resources that can be accessed as required. This may not be true in all cases, especially when the business is a small to medium (SME) cloud service provider. Unlike large scale cloud service providers, such as Amazon, an SME has a finite quantity of computing resources with which to manage their users’ requests [4]. As shown in the literature, these issues between a service provider and service user are addressed by defining and managing a service level agreement (SLA). An SLA describes all the service level objectives (SLOs) and agreed quality of service (QoS) parameters [5] and shows the commitment and obligations of each party, including the deliverability and penalties to be applied in the case of SLA violation [6]. As is the case in any business activity, the primary aim of a service provider in cloud computing is to fulfill its commitment to the many users with whom it has formed an SLA, to avoid violations. This falls within the broad domain of cloud service management. Recent contributions in this area [7–10] have looked at different methods, such as the automatic extraction of metrics, ontology-based semantic reasoning and linked unified service description language to manage the SLA and avoid service violations. However, these approaches consider the management of a service after a violation has taken place; in other words, they adopt a reactive approach to service management, which may be detrimental to the cloud service provider’s reputation and may negatively impact the likelihood of attracting future business from existing or new cloud service users. This can be avoided if service providers proactively manage their services. In this form of service management, service providers constantly monitor the SLOs after the SLA has been formed to ensure that possible violations are averted before they occur. In our previous work [4], we observed that this proactive management after an SLA has been formed but before violation occurs works well for large scale cloud service providers. This is because large providers have abundant resources and can easily obtain additional resources if/when required to avert possible violations as they are detected. However, for an SME cloud service provider that has a finite quantity of computing resources, obtaining such additional resources at the time and in the quantity required after an SLA has been formed may not be possible. For SLA violations to be proactively managed by such cloud service providers, we emphasize that the service management process should start before the formation of the SLA [4], during the SLA negotiation/formation phase (referred to here as the pre-interaction phase) in which the cloud service provider pre-allocates its available resources to users after conducting a vetting process. In the SLA execution phase (referred to here as the post-interaction phase), which includes SLA monitoring, SLA violation prediction and decisions on violation abatement, the SLOs are constantly monitored to ensure that possible violations are averted. From the perspective of SME cloud service providers, therefore, active service management in both the pre-interaction phase and the post-interaction phase will lead to the better administration of the SLA, maximizing the likely commitment of the service provider, reducing the prospect of SLA violation, and achieving maximum financial returns [4, 11, 12]. Our previous work proposed the provider-based optimized personalized viable-SLA (OPV-SLA) framework for service management [13, 14]. OPV-SLA is divided into two parts, namely the pre-interaction phase and the post-interaction phase. In the pre-interaction phase, the provider starts the process of SLA management by negotiating and forming a viable SLA, which is then proactively managed in the post-interaction phase. In this paper, we explain the workings of the OPV-SLA post-interaction phase, which we term the Risk Management Framework for SLA violation abatement (RMF-SLA). In this framework, the runtime performance of the SLOs is captured and predicted, and the service provider recommends the appropriate actions to take to proactively mitigate the risk of SLA violation. The rest of the paper is organized as follows. Section 2 describes the related literature on SLA management. Sections 3 and 4 detail the components of the RMF-SLA along with their workings. Section 5 describes the evaluation of RMF-SLA and Section 6 concludes the paper. 2. LITERATURE REVIEW The activities in SLA management can be broadly categorized into two time periods, namely the pre-interaction phase and the post-interaction phase, as mentioned in the Introduction. The activities in the pre-interaction phase are the SLA negotiation and formation, while the activities in the post-interaction phase are QoS prediction for future intervals, runtime QoS monitoring, the comparison between actual and promised QoS parameters and determining the best course of action for SLA management in the event of observed differences [4]. As our focus in this paper is on the post-interaction phase, we present a summary of some of the existing approaches to SLA management and violation abatement in the literature. Wood et al. [15] proposed the Sandpiper approach for SLA monitoring and resource management to detect hotspots that indicate a possibility of violation. To eliminate a hotspot, Sandpiper resizes and shifts the virtual machine or adjusts resources. It gathers the usage records of virtual and physical servers and flags a hotspot when resource usage exceeds a defined threshold. The proposed approach manages the runtime workload of the servers. Other approaches such as [16–19] map low-level resource metrics to SLA parameters. This is done by mapping the service status to the predefined threshold and identifying the deviation between the agreed and actual behavior to detect SLA violations using case-based reasoning (CBR) approaches. Although the proposed idea of mapping service resource metrics to SLA parameters helps the service provider to identify potential violation on current performance measures, it may not guarantee commitment to the requirements of all customers, as the performance measures are not formed and agreed in the pre-interaction phase. These approaches do not describe what needs to be done when the system identifies a likely violation. Some approaches offer a limited set of rules and use a CBR approach, which has its own limitations such as adaption, processing time, and storage, and usually does not produce optimal results [20]. Another work in this category by Falasi et al. [21] presents the Sky framework, which adaptively implements SLAs to manage changes in a federated cloud environment. The framework is capable of managing multilevel SLAs but does not describe the process for handling SLA violation. Also, SLAs are not formed during the negotiation process of the pre-interaction phase in this framework, which may not guarantee the requirement commitment of customers in the post-interaction phase. There are a number of approaches with self-management features which try to manage SLA violation before end users are affected. Brandic et al. [18] proposed a bottom-up hierarchical layered approach for the propagation of SLA violation when a violation threat is found. Mosallanejad and Atan [22] proposed a hierarchical self-healing approach in which each layer of cloud is responsible for managing the problem by itself. If the problem cannot be, the framework informs the upper layer for possible remedial action. Lu et al. [23] proposed an actor system framework that adopts a parent-child relationship for managing SLA violation. When the actor system detects a possible SLA violation, it first tries to resolve it, and then sends the error information to the upper parent actor if it is unable to do so. A multilayer monitoring approach was proposed by Katsaros et al. [24] that monitors SLAs based on observing time intervals and SLA parameters. The proposed approach has the features of runtime adaptability of resource provision, estimation, and decision taking. Although the self-management approaches [18, 22, 23] attempt to adjust violations when they are detected by the system, they do not suggest what action to take to avoid violation occurring. Moreover, these approaches lack the agreement process in the pre-interaction phase of SLA management. Other approaches in the literature use a third-party broker to manage SLAs. Lee et al. [25] proposed a cloud service broker portal that provides a gateway for cloud service providers and users to interact with each other. The portal has a single entry point for a cloud service broker, a cloud service provider and a cloud service user. It interacts with a unique interface designed for each stakeholder, and has a brokerage API that integrates various cloud service providers into the cloud service broker portal. The cloud service brokerage model has five components [26]. The framework helps cloud users to select a suitable cloud provider that satisfies the functional and non-functional requirements of the SLA; from a provider’s perspective, however, they lack the pre-interaction processing steps and actions to be taken if an SLA violation is detected. Other SLA management approaches focus on trusted relationships between the provider and the user. Noor and Sheng [27] proposed an adaptive credibility model that offers trust as a service to the service provider. The proposed approach helps the service provider to differentiate between biased and unbiased feedback. Although this approach is helpful for the service provider, it is only effective for a system that has existing users. It lacks a process for differentiating between possible users and the recommended action (RAc) to be taken on SLA formation and violation. Fan and Perros [28] differentiated between biased and unbiased feedback based on the familiarity and consistency of the feedback. They proposed a trust value range and ranked users based on that value. However, without a bootstrapping mechanism, this method cannot be applied to new users who have only recently subscribed to services. Another category of SLA management approaches uses proactive mechanisms to identify and predict likely SLA violations. QoS parameters are predicted using a user-based collaborative filtering (CF) mechanism, item-based CF mechanism, and stream processing framework [29–31]. Cardellini et al. [32] proposed heuristic policies to predict QoS parameters and determine the resources needed in future intervals using the recursive least squares method; however the process of managing SLA violations when they are predicted by the system is not defined. It can be seen from the above discussion that even though many approaches have been proposed in the literature for cloud SLA management, not all of them guide the service provider on the steps required for SLA violation abatement. In Table 1, we compare SLA management approaches on the three criteria required for SLA abatement, namely the ability to predict possible SLA violations, a description of the process to be followed when the system identifies an SLA violation threat, and the SLA violation abatement recommendation. It is important to mention that most of the existing approaches focus on the post-interaction phase of SLA management, that is, after a user and provider have formed their SLA. As mentioned in Section 1, this is not beneficial for SME cloud service providers, since the careful prior negotiation of SLOs is necessary to maximize the likelihood of a consumer commitment, reduce the possibility of SLA violation, and gain maximum financial returns. To address these drawbacks, we proposed the OPV-SLA management framework, shown in Fig. 1, in our previous work [13, 14, 49]. This framework first assists the user and provider to agree on QoS expectations and then monitors the terms of the agreement for violations. Table 1. Comparison of SLA management approaches in the post-interaction phase. Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Table 1. Comparison of SLA management approaches in the post-interaction phase. Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Source Predicts SLA/SLO/QoS Defines procedure for when a violation threat is detected SLA violation recommendation Emeakaroha et al. [16, 17] ✓ ✗ ✗ Brandic et al. [18] ✓ ✓ ✗ Haq et al. [19] ✓ ✓ ✗ Emeakaroha et al. [33] ✓ ✓ ✗ Mosallanejad and Atan [22] ✗ ✓ ✗ Katsaros et al. [24] ✗ ✗ ✗ Al Falasi et al. [21] ✗ ✗ ✗ Chandrasekar et al. [34] ✓ ✗ ✗ Alhamad et al. [35] ✗ ✗ ✗ Wang et al. [36] ✗ ✗ ✗ Hammadi and Hussain [37] ✗ ✗ ✗ Muchahari and Sinha [38] ✗ ✗ ✗ Cicotti et al. [31] ✓ ✗ ✗ Romano et al. [30] ✓ ✗ ✗ Sun et al. [39] ✓ ✗ ✗ Hussain et al. [40] ✓ ✓ ✓ Leitner et al. [41] ✓ ✗ ✗ Ciciani et al. [42] ✓ ✗ ✗ Cardellini et al. [43] ✓ ✗ ✗ Son et al. [44] ✗ ✗ ✗ Silaghi et al. [45] ✗ ✗ ✗ Badidi et al. [46] ✗ ✗ ✗ Pacheco-Sanchez et al. [47] ✓ ✗ ✗ Wood et al. [15] ✗ ✓ ✗ Schmieders et al. [48] ✓ ✓ ✗ Noor and Sheng [27] ✗ ✗ ✗ Fan and Perros [28] ✗ ✗ ✗ Figure 1. View largeDownload slide OPV-SLA management framework (reproduced from [13]). Figure 1. View largeDownload slide OPV-SLA management framework (reproduced from [13]). The process of forming customized SLAs between providers and users in the pre-interaction phase is explained in our previous work [13]. In this paper, we explain the process for determining and abating SLA violation, which forms part of the post-interaction phase. It is important to mention that various approaches in the literature have used techniques such as QoS prediction [29], workflow detection control model [39], machine learning regression technique [41] and workload analyzer [50] to ascertain the possibility of SLA violation. In our method, we analyze the notion of risk as the criterion for ascertaining possible SLA violation and the subsequent actions to take for its abatement. A related work in this category that uses risk as the criterion for SLA management is Kiran et al. [51], who proposed a risk assessment framework for cloud service provisioning. Their proposed framework assists both SaaS and IaaS providers to identify, evaluate and mitigate risk in service provisioning. The risk assessment between a service provider and an infrastructure provider consists of six steps: the infrastructure provider’s business dealings, the service provider’s business dealings, the potential for service failure under the SLA, the reliability of the services offered under the SLA, the service provider for runtime operation and monitoring of QoS parameters, and lastly, the infrastructure provider for potential infrastructure failure. Zhang et al. [52] proposed a risk management framework that analyzes, assesses and mitigates risk to help the service provider to achieve better management of SLAs. Risk assessment in this approach is comprised of four steps: to define the likelihood of vulnerabilities and associated threats, to determine the magnitude of risk, to find the level of risk, and to take all necessary actions to mitigate risk. Cicotti et al. [53] proposed a model that predicts future QoS based on runtime monitoring data and data from a probabilistic model-checking method. The system generates an alert when it detects probable QoS violation and helps service providers to stop or minimize possible service violation. Albakri et al. [54] proposed a security risk management framework that allows users to evaluate risk and contribute to the risk assessment process. The framework permits users to define the legal requirements, identify the risk factors, and obtain feedback from a service provider. While there are approaches that consider the notion of risk in SLA management, most of them are unable to guide a service provider in relation to the steps to be taken to determine and address possible SLA violation. In the next section, we define our RMF-SLA, which assists cloud service providers to identify and assess the risk of SLA violation occurring in the post-interaction phase and to manage it by considering a set of decision parameters. 3. RISK MANAGEMENT FRAMEWORK FOR SLA VIOLATION ABATEMENT As shown in Fig. 1, RMF-SLA is a combination of five modules that address the detection and abatement of SLA violation. They are Module 1: Threshold Formation Module (TFM) Module 2: Runtime QoS Monitoring Module (RQoSMM) Module 3: QoS Prediction Module (QoSPM) Module 4: Risk Identification Module (RIM) Module 5: Risk Management Module (RMM). The workings of each module of RMF-SLA are explained in the following subsections. 3.1. Module 1: TFM This is the first module of the RMF-SLA framework, as shown in Fig. 2. It takes the QoS values of the SLOs determined between the cloud provider and the user in the pre-interaction phase and forms two thresholds for determining and managing violations. These two thresholds are the Agreed threshold (Ta) and the Safe threshold (Ts). Agreed threshold (Ta): This threshold value is described in the SLA and is mutually agreed by the user and the provider. When both parties have finalized their SLAs, they agree on certain thresholds for each service level objective (SLO) and QoS parameter. A service provider that does not comply with the agreed QoS parameters commits a service violation and is liable for violation penalties. Safe threshold (Ts): To avoid possible service violation and penalties, we propose that a provider should define a safe threshold (Ts) that is stricter than the agreed threshold (Ta). This is a customized threshold defined by the provider. It raises an alarm of possible SLA violation when a runtime QoS reaches or exceeds the threshold and invokes Module 5, the RMM, to take necessary action to avert the violation. Figure 2. View largeDownload slide Provider-based RMF-SLA. Figure 2. View largeDownload slide Provider-based RMF-SLA. To explain the importance of Ts and Ta, let us consider a provider and user forming an SLA in the pre-interaction phase who agree on having 80% availability of a resource (memory). This 80% availability of memory is the Ta value agreed by both parties. For service management and SLA violation abatement, the provider defines a customized threshold for the total memory, say 90%, which is the Ts value. When the availability at runtime falls below this 90% threshold, the framework alerts the service provider and activates the RMM to manage the risk of the QoS value of the SLO falling below the Ta value. 3.2. Module 2: RQoSMM This is the second module of RMF-SLA, which is responsible for monitoring the runtime QoS parameters of each SLO in the SLA. The captured runtime QoS values are sent to Module 3—QoSPM where the QoS values of the SLOs in the near future are determined. 3.3. Module 3: QoSPM The QoSPM is the third module of RMF-SLA, which predicts users’ resource usage in each SLO. The module uses an optimal prediction algorithm for each SLO to predict the user’s likely resource usage based on his or her usage history. The choice of an optimal prediction algorithm plays a key role in decision making, since the accuracy of the prediction method depends on the choice of dataset. In our previous work [55], we considered the stochastic, neural network, and different time series prediction methods and analyzed their prediction accuracy on a dataset from the Amazon EC2 cloud. We observed from the evaluation results that an optimal prediction result is obtained by considering small intervals for prediction and using the autoregressive integrated moving average (ARIMA) method. ARIMA is one of the most efficient versions of the autoregressive moving average (ARMA) method formulated by mathematical statisticians George Box and Gwilym Jenkins in the 1970s [56] for use with business and economic data. It has been widely used as an optimal prediction method in the cloud service domain. For example, Calheiros et al. [57] developed a cloud virtual machine workload prediction model and observed 91% prediction accuracy when using the ARIMA method. Other researchers, such as Rehman et al. [58], have used the ARIMA method to forecast the QoS values from the user perspective with good accuracy. Hence, for the prediction of QoS we use the ARIMA method in QoSPM. To enhance the accuracy of the prediction result, RQoSMM constantly inputs the value of the SLOs in previous time intervals to QoSPM. For example, RQoSMM captures the QoS values from time interval 1 to time interval 10 (t1–t10) to predict them over time interval 11–14 (t11–t14). When the QoS values over the interval t15 to t18 are predicted, RQoSMM gives QoSPM the captured QoS values up to t14, to ensure that an optimal prediction result closely related to the observed data with minimum deviation [55] is achieved. The pseudocode of the workings in QoSPM is as follows: for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); for (i = start limit; i <= endlimit; i++) if (RQoSMM is empty)  input[i] = prev_observation[i]; else  input[i] = RQoSMM[i] + prev_observation[i]; Pred_output = Prediction_algo(input); The algorithm starts by ensuring that the runtime data are available. If a transaction has just started and runtime QoS data are not available, QoSPM considers the user’s data from the Identity Management Module (IMM) in the pre-interaction phase. As explained in our previous work [13], this module stores the interaction history of a user. If a user is new and has no previous transaction records, the commitment of the user to the QoS values is ascertained by IMM using top-K nearest neighbors and their transactions. Once these values have been obtained, the QoSPM prediction is not made by taking the relative values of the SLOs, but by taking the percentage value of the level of the SLO commitment to the level of the SLO requested. This is because the level of resources requested by a user in the current SLO interaction may be different from what was requested in the past, so a standardized scale on which to represent these values is needed for fair analysis. To obtain the standardized scale, we take the relative values of the previous transaction, as presented in Equation (1). Rpred=∑i=1m(RusedRrequest)i∗Rci (1) where Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA Rused is the amount of resources used for the previous SLA, Rrequest is the amount of resources requested for the previous SLA, i is the time interval from 1 to m, m is the total time interval, Rci is the predicted resource for the new SLA at ith interval and Rpred is the total predicted amount of resources for the current requested SLA The output of the QoSPM is forwarded to Module 4—RIM which invokes Module 5—RMM when the possibility of SLA violation is detected. 3.4. Module 4: RIM RIM is responsible for comparing the predicted values from QoSPM with the formed Ts value in Module 1. If the value of QoSPM reaches or exceeds the Ts value, Module 5—RMM is activated to abate possible SLA violation. 3.5. Module 5: RMM As mentioned earlier, RMM is invoked when RIM determines the possibility of an SLA violation occurring. Once invoked, RMM estimates the severity of the risk of SLA violation and determines how to manage it. As presented in Fig. 3, the RMM takes three inputs - risk attitude of the provider, reputation of the consumer and the transaction trend to take appropriate actions and to manage the risk of SLA violation. RMM as shown in Fig. 5 is comprised of two sub-modules, the risk estimation module (REM) and the risk mitigation module (RMtM). REM: This sub-module is responsible for estimating the risk of an SLA violation occurring. The notion of risk is subjective, as is the process for managing it, so to determine the severity of the risk from the subjective viewpoint of the provider, the following three inputs are considered in REM: Risk attitude (RA) of the provider: The RA of the provider represents its capacity to deal with risk. A provider’s RA is risk averse, risk neutral or risk taking. A provider with a risk averse attitude is more reluctant to take a risk (in this case, to allow an SLA violation to occur) than a provider with a risk neutral or risk taking attitude [13]. Reputation of the user for whom the possibility of SLA violation is being determined: Reputation is the reliability or trust value a provider places on a user to uphold the terms of the SLA. The reputation of a user shows the user’s commitment to previously formed SLAs with the provider and is represented as being bronze, silver or gold. The process for determining the class of reputation is described in our previous work [13]. The reputation of a user is an input of REM, because we consider that if a provider values a user highly (represented as being silver or gold class), the provider will prefer to take immediate action to minimize the possible risk of SLA violation, in contrast to a similar situation with a user whose reputation is bronze class. Transaction trend (TT) curve over future time intervals: The third input to REM is the TT curve that shows a user’s use of an SLO over future time intervals. This shows the prevailing use of resources by the user over a period of time (from QoSPM) and how this usage maps against the formed Ta and Ts values. When the TT curve exceeds Ts, it may either move towards Ta, as shown in Fig. 4a, or away from Ta, as shown in Fig. 4b. REM captures the direction of the TT curve to ascertain the risk of SLA violation and estimate the steps required to mitigate the risk. RMtM: As discussed above, REM estimates the risk of possible SLA violation occurring by considering the relevant inputs. Subsequently, RMtM recommends an appropriate action to manage and mitigate the risk. A fuzzy inference system is used to perform the computation with the recommendation to take immediate action, delayed action or no action. When a risk of violation is assessed as high, RMtM recommends that the service provider should take immediate action. In taking this action, the service provider stops accepting new requests and arranges for sufficient resources to be provided in the fastest possible time to avoid service violation. When the risk of violation is estimated as medium or low, RMtM decides and recommends whether to take delayed action or no action. Here, it is implied that the provider accepts the risk but keeps the situation under observation, with the intention of taking any necessary action within a certain timeframe. Figure 3. View largeDownload slide Working of RMM in RMF-SLA. Figure 3. View largeDownload slide Working of RMM in RMF-SLA. Figure 4. View largeDownload slide TT curve moving towards and away from Ta. Figure 4. View largeDownload slide TT curve moving towards and away from Ta. To summarize, the working of RMF-SLA is as follows and as shown in Fig. 5: Step 1: After forming the Ts and Ta thresholds, QoSPM collects data from IMM and RQoSMM. Step 2: QoSPM predicts the QoS usage values in future time intervals. Step 3: RIM compares the predicted values from the QoSPM with the Ts value. Step 4: If the value from QoSPM is below Ts, then no action is taken and the runtime QoS parameters of the SLO are monitored. However, if the value of the QoSPM reaches or exceeds Ts, RMM is activated to manage the risk of SLA violation. Step 5: REM of RMM estimates the risk of SLA violation by capturing the RA of the provider, the reputation of the user and the TT curve. A fuzzy inference system (FIS) is used to estimate the risk of SLA violation occurring. Step 6: Depending on the estimated risk in REM, RMtM suggests the appropriate action that the cloud provider should take to mitigate the risk of violation occurring. The type of action is immediate action, delayed action, or no action. Figure 5. View largeDownload slide The working of RMF-SLA in the post-interaction phase of OPV-SLA. Figure 5. View largeDownload slide The working of RMF-SLA in the post-interaction phase of OPV-SLA. The fuzzy inference system for estimating the possible risk of SLA violation and determining the appropriate mitigation action is explained in the next section. 4. FIS FOR DETERMINING POSSIBLE SLA VIOLATIONS AND THEIR ABATEMENT IN RMF-SLA To assess the possible risk of SLA violation and manage its abatement, we use a Mamdani type FIS [59] to combine the various inputs. Figure 6 represents the input and the output used to manage possible SLA violations. The FIS and the membership functions of each of its inputs and outputs are explained in the following subsections. Figure 6. View largeDownload slide FIS for assessing and managing the risk of SLA violation in RMF-SLA. Figure 6. View largeDownload slide FIS for assessing and managing the risk of SLA violation in RMF-SLA. 4.1. Defining the fuzzy sets and membership function for input—RA of the provider The RA of the provider defines the provider’s propensity to take risk. Depending on its RA, a service provider may be risk averse, risk neutral or risk taking. These are the fuzzy sets over which the RA will be represented. A risk averse provider attempts to avoid any risk, whether it is small or large. A risk neutral provider takes the middle ground; depending on the nature of the risk, it may decide to take action or to ignore the risk. A risk taking provider has a bold attitude, ignoring small risks and taking action only for risks that will have a significant effect. We consider 1–5 as the Universe of Discourse over which the fuzzy sets of this input will be represented. The membership function for this input is as shown in Fig. 7, and the corresponding membership function for each fuzzy set is as follows: μriskaverse(RA)=3−x2if1<x≤3;0if3<x≤5 μriskneutral(RA)=x−12if1<x≤3;5−x2if3<x≤5 μrisktaking(RA)=0if0<x≤3;x−32if3<x≤5 Figure 7. View largeDownload slide RA of a provider in assessing the possibility of SLA violation occurring. Figure 7. View largeDownload slide RA of a provider in assessing the possibility of SLA violation occurring. 4.2. Defining the fuzzy sets and membership function for input—reputation of the user User reputation (R) is the trustworthiness of the user’s commitment in previous transactions to the defined SLAs with the service provider. The fuzzy set over which the reputation value of a user is represented is bronze, silver or gold, and the universe of discourse is from 0 to 100. The membership function for this input is as shown in Fig. 8, and the corresponding membership function for each fuzzy set is as follows: μBronze(R)=1if0<x≤40;45−x5if41<x≤45;0if46<x≤100 μSilver(R)=0if0<x≤40,x−405if41<x≤45,1if45<x≤70,75−x5if71<x≤75,0if76<x≤100 μGold(R)=0if0<x≤70x−705if71<x≤75,1if76<x≤100 Figure 8. View largeDownload slide Membership function for the reliability of a user. Figure 8. View largeDownload slide Membership function for the reliability of a user. 4.3. Defining the fuzzy sets and membership function for input—TT TT shows the trajectory of the predicted resource usage in future intervals. The values of the predicted trajectory are obtained from the QoSPM. The fuzzy sets used to represent input TT are towards the Ta or away from the Ta. The universe of discourse over which the input TT is represented is from 0 to 1. The membership function for this input is shown in Fig. 9, and the corresponding membership function for each fuzzy set is as follows: μAway(TT)=1−x1if0<x≤1 μTowards(TT)=x−11if0<x≤1 Figure 9. View largeDownload slide Membership function for the TT. Figure 9. View largeDownload slide Membership function for the TT. 4.4. Defining the fuzzy sets and membership function for output—RAc The output RAc is the appropriate action to be taken to manage the possible risk of violation occurring, and the recommended output of Immediate Action, Delayed Action or No Action. These are the fuzzy sets used to represent the output, and the universe of discourse over which these fuzzy sets are represented is 0 to 1. The membership function for this input is as shown in Fig. 10 and the corresponding membership function for each fuzzy set is as follows: μNoAction(RAc)=1if0<x≤0.01,0if0.01<x≤1 μDelayedAction(RAc)=0if0<x≤0.01;1ifx=0.01;x−0.010.99if0.01<x≤1; μDelayedAction(RAc)=0ifx=0.01;1−x0.99if0.01<x≤1 Figure 10. View largeDownload slide Membership function for recommended risk mitigation action. Figure 10. View largeDownload slide Membership function for recommended risk mitigation action. 4.5. Fuzzy rules for possible risk of SLA violation occurring and mitigation action to be taken The combination of linguistic variables for the inputs, resulting in a total of eighteen rules, is presented in Table 2. The variables are: the RA of the provider [risk averse (Ra), risk neutral (Rn), or risk taking (Rt)], the reputation of the user [bronze (B), silver (S), or gold (G)] and the TT [towards (T) or away (A)] Table 2. FIS rules for the assessment and abatement of SLA violation risk. Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Table 2. FIS rules for the assessment and abatement of SLA violation risk. Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA Rule # RA Reputation TT then Recommended risk mitigation action 1 If Ra and B and T then IA 2 If Ra and B and A then IA 3 If Ra and S and T then IA 4 If Ra and S and A then DA 5 If Ra and G and T then DA 6 If Ra and G and A then DA 7 If Rn and B and T then IA 8 If Rn and B and A then DA 9 If Rn and S and T then IA 10 If Rn and S and A then NA 11 If Rn and G and T then DA 12 If Rn and G and A then NA 13 If Rt and B and T then DA 14 If Rt and B and A then NA 15 If Rt and S and T then DA 16 If Rt and S and A then NA 17 If Rt and G and T then NA 18 If Rt and G and A then NA In the next section, we demonstrate how the service provider can assess and manage the risk of possible SLA violation with the consumer using the RMF-SLA framework of OPV-SLA. 5. VALIDATION OF RMF-SLA FRAMEWORK FOR MANAGING POSSIBLE SLA VIOLATION ABATEMENT To demonstrate the applicability of the RMF-SLA framework for service providers in the abatement of possible SLA violations, we utilize the dataset from Amazon EC2 IaaS cloud services—EC2 US West collected from CloudClimate [60] through the PRTG monitoring service [61]. This dataset is used for QoS prediction and for managing SLAs and the abatement of possible violation. The prototype was built using Microsoft Visual Studio 2010 to develop the interface, Microsoft SQL Server Management Studio 2008 for the databases and MATLAB to design the FIS application. To implement RMF-SLA, we first need to form an SLA between the user and provider in the OPV-SLA pre-interaction phase. Readers should refer to our previous work in [13], in which we explain these computations in detail. The outcome of this phase is a well-formed SLA between the service provider and service user which maximizes the likelihood of the service provider’s commitment to the formed SLOs, reduces the potential for SLA violation, and achieves the maximum financial return for the available resources. To ensure the successful fulfillment of the SLA, the service provider needs to undertake the following management steps in the post-interaction phase: prediction, monitoring and decision-making, which are assisted by RMF-SLA, as explained next. Using the EC2 US West dataset from Amazon EC2 IaaS cloud services, we adopt CPU usage as the SLO we want to monitor to proactively pre-determine possible SLA violations. As discussed in Section 4, the first module of RMF-SLA is TFM, which defines the Ts value for the SLO being monitored. This is different from the Ta value, which is decided during the formation of the SLA. The next two stages in the RMF-SLA are RQoSMM and QoSPM. QoSPM predicts the QoS values over a future period. A number of prediction methods are available, each of which generates a different output depending on the nature of the dataset being used. RMF-SLA uses ten prediction methods, namely cascade forward backpropagation (CFBP), Elman backpropagation (EBP), generalized regression (GR), nonlinear autoregressive neural network with external input (NARX), simple exponential smoothing (SES), simple moving average (SMA), weighted moving average (WMA), extrapolation (EXP), Holt-Winters double exponential smoothing (HDES) and autoregressive integrated moving average method (ARIMA). Root mean square error (RMSE) and mean absolute deviation (MAD) are used as the benchmark to measure prediction accuracy, and the method which gives the least error is used for prediction. We use an example to explain the process. Figure 11 shows the observed QoS of the SLO CPU usage for the period of 1 h on 6 September 2016 from 06:35 AM to 7:30 AM. To test the accuracy of the prediction methods, we use the QoS values for that SLO from a previous time period and use them to predict the QoS values for 06:35 AM to 7:30 AM on 6 September 2016. The neural network-based methods were trained by considering 1002 datasets from the previous 6 days. The results of the observed and predicted QoS values are shown in Table 3 and Fig. 12. The prediction results at 5-min intervals are given, and all units are measured in millisecond (ms). The accuracy of each method is measured using RMSE and MAD. The prediction accuracy of all methods is presented in Table 4 and Fig. 13. Table 3. Prediction results of 10 methods at 5-min intervals. Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Table 3. Prediction results of 10 methods at 5-min intervals. Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Time 6:35:00 6:40:00 6:45:00 6:50:00 6:55:00 7:00:00 7:05:00 7:10:00 7:15:00 7:20:00 7:25:00 7:30:00 Observed data 577 561 577 576 577 561 560 561 577 562 561 577 CFBP 565.7835 569.1289 565.7835 565.9902 565.7835 569.1289 569.3407 569.1289 565.7835 568.9174 569.1289 565.7835 EBP 561.1551 560.3656 561.1551 560.921 561.1551 560.3656 560.4925 560.3656 561.1551 560.2562 560.3656 561.1551 GR 563.7587 562.58 568.6284 566.7982 563.7587 560.2106 560.2518 562.58 563.9038 559.747 562.58 563.7587 NARX 579.6209 578.983 561.8996539 575.5140582 564.1123463 563.7748057 561.8996539 576.9943986 577.0424372 575.5140582 562.7078841 576.8213273 SES 577.0000 577.0000 569.0000 573.0000 574.5000 575.7500 568.3750 564.1875 562.5938 569.7969 565.8984 563.4492 SMA 577.0000 569.0000 571.6667 571.3333 576.6667 571.3333 566.0000 560.6667 566.0000 566.6667 566.6667 566.6667 WMA 577.0000 566.3333 571.6667 576.3333 576.6667 566.3333 560.3333 560.6667 571.6667 567.0000 561.3333 571.6667 EXP 577.0000 561.0000 545.0000 593.0000 575.0000 578.0000 545.0000 559.0000 562.0000 593.0000 547.0000 560.0000 HWDES 577.0000 572.2000 572.9680 573.4475 574.2363 570.1543 566.4477 563.7661 566.5231 564.5816 562.7676 566.1918 ARIMA 576.5432 561.0000 577.0000 576.9987 577.0009 561.0887 560.0087 561.0000 577.0000 562.0000 561.0000 577.0594 Table 4. Prediction accuracy of all methods. Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Table 4. Prediction accuracy of all methods. Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Prediction method RMSE MAD CFBP 9.489859312 9.365921429 EBP 10.31080812 7.167585714 GR 7.873491133 5.757521429 NARX 9.959856064 7.185961724 SES 9.317532303 7.878696987 SMA 6.52346193 5.523809524 WMA 3.743777907 2.761904762 EXP 18.48937919 15.00000000 HWDES 6.216755602 4.999325847 ARIMA 0.461865174 0.249580313 Figure 11. View largeDownload slide CPU usage data for 1 h [60]. Figure 11. View largeDownload slide CPU usage data for 1 h [60]. Figure 12. View largeDownload slide Prediction output of each approach at 5-min intervals. Figure 12. View largeDownload slide Prediction output of each approach at 5-min intervals. Figure 13. View largeDownload slide Prediction accuracy of all methods using RMSE and MAD as a benchmark. Figure 13. View largeDownload slide Prediction accuracy of all methods using RMSE and MAD as a benchmark. We evaluate CPU usage every 5 min for 1 h, starting on 6 September 2016 at 06:35 AM and ending on 6 September 2016 at 7:30 AM. Figure 12 presents the CPU usage for the period. From Table 4, we can see that of all the prediction methods, ARIMA gives the optimal prediction result with an RMSE value of 0.461865174 and a MAD value of 0.249580313. Extending our example, QoSPM uses the ARIMA method to predict the QoS of the CPU usage for the next hour from 7:40 AM to 8:35 AM, as shown in Table 5. To determine the possibility of SLA violation and its abatement, we consider that the values for Ts and Ta are 575 ms and 599 ms, respectively, as shown in Fig. 14. Ta is the value of the SLO determined on the formation of the SLA and Ts is the safe threshold defined by the provider. RIM compares the predicted QoS value with these values, and if the Ts value is exceeded, the RMM is activated to ascertain and manage the risk of SLA violation. Table 5. Prediction of the SLO over a period of 1 h using the ARIMA method. Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Table 5. Prediction of the SLO over a period of 1 h using the ARIMA method. Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Time 7:40:00 7:45:00 7:50:00 7:55:00 8:00:00 8:05:00 8:10:00 8:15:00 8:20:00 8:25:00 8:30:00 8:35:00 ARIMA 567 569 577 579 580 582 585 589 571 591 593 594 Figure 14. View largeDownload slide Showing the Ts and Ta values of the predicted QoS over a future period. Figure 14. View largeDownload slide Showing the Ts and Ta values of the predicted QoS over a future period. From Fig. 14, we see that at the third time interval (7:50 AM) the predicted result exceeds the Ts threshold. RMM at this stage considers the RA of the provider, the reputation of the user, and the projected TT to suggest an appropriate action. In this scenario, we consider that the reputation of the user at the pre-interaction phase is 45 (silver), the RA of the provider is risk neutral and the TT is moving towards the agreed threshold value. These inputs are processed by the FIS rules and the recommended output is immediate action. This is because the provider is risk neutral and the TT has exceeded the Ts and is moving towards the Ta value, so if the provider does not take action, there is a high risk of SLA violation. The provider needs to take immediate action by arranging supply of the deficient resources, either itself or from external resources, to avoid possible violation. Similarly, at 8:15 AM we see that the predicted QoS value moves towards Ts and drops below it. In this scenario, the output from the FIS recommends no action to be taken, as no likelihood of SLA violation is determined. From the above example, we see that RMF-SLA suggests the appropriate action to be taken to manage potential SLA violation according to the RA of the provider, the user’s reputation, and the TT. The combination of RMF-SLA with the pre-interaction phase module of OPV-SLA assists an SME service provider to first form viable SLA and then manage the risk associated with possible SLA violations. 6. CONCLUSION The SLA is the key agreement made between a service provider and a service user in a cloud computing environment. To increase and maintain their reputation, service providers need a viable SLA management framework that helps them to first form viable SLAs and then intelligently predict the occurrence of possible SLA violations before recommending an appropriate action to be taken. Our proposed OPV-SLA management framework helps service providers, particularly SME providers with limited resources, to achieve this. In this paper, we have briefly explained the OPV-SLA framework and focused on its post-interaction phase module, namely the RMF-SLA, which is responsible for QoS prediction, detecting the possible occurrence of SLA violations, and recommending the best possible decision to avert violation. We have demonstrated the application of RMF-SLA with an example and have shown how the proposed method assists cloud service providers in SLA management. In our future work, we will find the hidden patterns between SLOs and low-level metrics to predict likely violation for SLA management. REFERENCES 1 Weinhardt , C. , Anandasivam , D.-I.-W.A. , Blau , B. , Borissov , D.-I.N. , Meinl , D.-M.T. , Michalk , D.-I.-W.W. et al. ( 2009 ) Cloud computing—a classification, business models, and research directions . Bus. Inf. Syst. Eng. , 1 , 391 – 399 . Google Scholar CrossRef Search ADS 2 Armbrust , M. , Fox , A. , Griffith , R. , Joseph , A.D. , Katz , R.H. , Konwinski , A. et al. ., Above the Clouds: A Berkeley View of Cloud Computing, Electrical Engineering and Computer Sciences University of California at Berkeley, http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html, Technical Report No. UCB/EECS-2009-282009. 3 Rhoton , J. ( 2013 ) Cloud Computing Explained: Implementation Handbook for Enterprises . Recursive Press , London, UK . 4 Hussain , W. , Hussain , F.K. , Hussain , O.K. , Damiani , E. and Chang , E. ( 2017 ) Formulating and managing viable SLAs in cloud computing from a small to medium service provider's viewpoint: A state-of-the-art review . Information Systems , 71 , 240 – 259 . http://www.sciencedirect.com/science/article/pii/S0306437917302697. Google Scholar CrossRef Search ADS 5 Ludwig , H. , Keller , A. , Dan , A. , King , R. and Franck , R. ( 2003 ) A service level agreement language for dynamic electronic services . Electron. Commer. Res , 3 , 43 – 59 . https://link.springer.com/article/10.1023/A:1021525310424. Google Scholar CrossRef Search ADS 6 Hussain , W. , Hussain , F.K. and Hussain , O.K. ( 2014 ) Maintaining Trust in Cloud Computing through SLA Monitoring. Int. Conf. Neural Information Processing, Kuching, Malaysia, pp. 690–697. Springer International Publishing, Switzerland. 7 Mittal , S. , Joshi , K.P. , Pearce , C. and Joshi , A. ( 2016 ) Automatic Extraction of Metrics from SLAs for Cloud Service Management. 2016 IEEE Int. Conf. Cloud Engineering (IC2E), Berlin, Germany, pp. 139–142. IEEE. 8 Karim , B. , Qing , T. , Villar , J.R. and de la Cal , E. ( 2017 ) Resource Brokerage Ontology for Vendor-independent Cloud Service Management. 2017 IEEE 2nd Int. Conf. Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, pp. 466–472. IEEE. 9 García , J.M. , Fernández , P. , Pedrinaci , C. , Resinas , M. , Cardoso , J. and Ruiz-Cortés , A. ( 2017 ) Modeling service level agreements with linked USDL agreement . IEEE Trans. Serv. Comput. , 10 , 52 – 65 . Google Scholar CrossRef Search ADS 10 Jaramillo , G.E. , Ardagna , C.A. and Anisetti , M. ( 2015 ) A Hybrid Representation Model for Service Contracts. 2015 Int. Conf. Information and Communication Technology Research (ICTRC), Abu Dhabi, United Arab Emirates, pp. 246–249. IEEE. 11 Messina , F. , Pappalardo , G. , Santoro , C. , Rosaci , D. and Sarné , G.M. ( 2016 ) A multi-agent protocol for service level agreement negotiation in cloud federations . Int. J. Grid Utility Comput. , 7 , 101 – 112 . Google Scholar CrossRef Search ADS 12 Feng , G. and Buyya , R. ( 2016 ) Maximum revenue-oriented resource allocation in cloud . Int. J. Grid Utility Comput. , 7 , 12 – 21 . Google Scholar CrossRef Search ADS 13 Hussain , W. , Hussain , F. , Hussain , O. and Chang , E. ( 2016 ) Provider-based optimized personalized viable SLA (OPV-SLA) framework to prevent SLA violation . Comput. J. , 59 , 1760 – 1783 . Google Scholar CrossRef Search ADS 14 Hussain , W. , Hussain , F.K. and Hussain , O.K. ( 2016 ) SLA Management Framework to Avoid Violation in Cloud. Int. Conf. Neural Information Processing, Kyoto, Japan, pp. 309–316. Springer. 15 Wood , T. , Shenoy , P. , Venkataramani , A. and Yousif , M. ( 2009 ) Sandpiper: black-box and gray-box resource management for virtual machines . Comput. Netw. , 53 , 2923 – 2938 . Google Scholar CrossRef Search ADS 16 Emeakaroha , V.C. , Brandic , I. , Maurer , M. and Dustdar , S. ( 2010 ) Low Level Metrics to High Level SLAs-LoM2HiS Framework: Bridging the Gap between Monitored Metrics and SLA Parameters in Cloud Environments. In 2010 Int. Conf. High Performance Computing and Simulation (HPCS), pp. 48–54. IEEE. 17 Emeakaroha , V.C. , Netto , M.A. , Calheiros , R.N. , Brandic , I. , Buyya , R. and De Rose , C.A. ( 2012 ) Towards autonomic detection of SLA violations in cloud infrastructures . Future Generation Comput. Syst. , 28 , 1017 – 1029 . Google Scholar CrossRef Search ADS 18 Brandic , I. , Emeakaroha , V.C. , Maurer , M. , Dustdar , S. , Acs , S. , Kertesz , A. et al. . ( 2010 ) Laysi: A Layered Approach for sla-violation Propagation in Self-manageable Cloud Infrastructures. 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops (COMPSACW), Seoul, South Korea, pp. 365–370. IEEE. 19 Haq , I.U. , Brandic , I. and Schikuta , E. ( 2010 ) Sla Validation in Layered Cloud Infrastructures . Int. Conf. Economics of Grids, Clouds, Systems, and Services, Ischia, Italy, pp. 153–164. ACM. 20 Cheetham , W. , Varma , A. and Goebel , K. ( 2001 ) Case-Based Reasoning at General Electric. Proc. Fourteenth Int. Florida Artificial Intelligence Research Society Conference, Florida, USA, pp. 93–97. 21 Al Falasi , A. , Serhani , M.A. and Dssouli , R. ( 2013 ) A Model for Multi-levels SLA Monitoring in Federated Cloud Environment. 2013 IEEE 10th Int. Conf. Ubiquitous Intelligence & Computing and 2013 IEEE 10th Int. Conf. Autonomic & Trusted Computing (UIC/ATC), Vietri sul Mere, Italy, pp. 363–370. 22 Mosallanejad , A. and Atan , R. ( 2013 ) HA-SLA: a hierarchical autonomic SLA model for SLA monitoring in cloud computing . J. Softw. Eng. Appl. , 6 , 114 . Google Scholar CrossRef Search ADS 23 Lu , K. , Yahyapour , R. , Wieder , P. , Yaqub , E. , Abdullah , M. , Schloer , B. et al. ( 2015 ) Fault-tolerant service level agreement lifecycle management in clouds using actor system . Future Generation Comput. Syst. , 54 , 247 – 259 . Google Scholar CrossRef Search ADS 24 Katsaros , G. , Kousiouris , G. , Gogouvitis , S.V. , Kyriazis , D. , Menychtas , A. and Varvarigou , T. ( 2012 ) A self-adaptive hierarchical monitoring mechanism for Clouds . J. Syst. Softw. , 85 , 1029 – 1041 . Google Scholar CrossRef Search ADS 25 Lee , J. , Kim , J. , Kang , D.-J. , Kim , N. and Jung , S. ( 2014 ) Cloud Service Broker Portal: Main Entry Point for Multi-cloud Service Providers and Consumers. 2014 16th Int. Conf. Advanced Communication Technology (ICACT), Pyeongchang, South Korea, pp. 1108–1112. IEEE. 26 Jrad , F. , Tao , J. and Streit , A. ( 2012 ) SLA Based Service Brokering in Intercloud Environments. CLOSER, pp. 76–81. https://s3.amazonaws.com/academia.edu.documents/42086411/SLA_based_Service_Brokering_in_Interclou20160204-30232-ruvxfi.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1512550771&Signature=qL6vwB7ws31nCBxcZ0wHuKejzx4%3D&response-content-disposition=inline%3B%20filename%3DSLA_based_Service_Brokering_in_Interclou.pdf. 27 Noor , T.H. and Sheng , Q.Z. ( 2011 ) Trust as a Service: A Framework for Trust Management in Cloud Environments. Web Information System Engineering–WISE 2011, Sydney, Australia, pp. 314–321. Springer. 28 Fan , W. and Perros , H. ( 2013 ) A Reliability-Based Trust Management Mechanism for Cloud Services. 2013 12th IEEE Int. Conf. Trust, Security and Privacy in Computing and Communications (TrustCom), Melbourne, VIC, Australia, pp. 1581–1586. IEEE. 29 Zhang , Y. , Zheng , Z. and Lyu , M.R. ( 2011 ) Exploring Latent Features for Memory-Based QoS Prediction in Cloud Computing. 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), Madrid, Spain, pp. 1–10. IEEE. 30 Romano , L. , De Mari , D. , Jerzak , Z. and Fetzer , C. ( 2011 ) A Novel Approach to QoS Monitoring in the Cloud. 2011 First Int. Conf. Data Compression, Communications and Processing (CCP), Palinuro, Italy, pp. 45–51. IEEE. 31 Cicotti , G. , Coppolino , L. , D’Antonio , S. and Romano , L. ( 2015 ) How to monitor QoS in cloud infrastructures: the QoSMONaaS approach . Int. J. Comput. Sci. Eng. , 1 , 29 – 45 . 32 Cardellini , V. , Casalicchio , E. , Lo Presti , F. and Silvestri , L. ( 2011 ) Sla-Aware Resource Management for Application Service Providers in the Cloud. 2011 First International Symposium on Network Cloud Computing and Applications (NCCA), Toulouse, France, pp. 20–27. IEEE. 33 Emeakaroha , V.C. , Ferreto , T.C. , Netto , M.A. , Brandic , I. and De Rose , C.A. ( 2012 ) Casvid: Application Level Monitoring for SLA Violation Detection in Clouds. 2012 IEEE 36th Annual Computer Software and Applications Conference (COMPSAC), Izmir, Turkey, pp. 499–508. IEEE. 34 Chandrasekar , A. , Chandrasekar , K. , Mahadevan , M. and Varalakshmi , P. ( 2012 ) QoS Monitoring and Dynamic Trust Establishment in the Cloud. Advances in Grid and Pervasive Computing, Springer, pp. 289–301. Springer. 35 Alhamad , M. , Dillon , T. and Chang , E. ( 2010 ) SLA-Based Trust Model for Cloud Computing. 2010 13th Int. Conf. Network-Based Information Systems (NBiS), Takayama, Japan, pp. 321–324. IEEE. 36 Wang , M. , Wu , X. , Zhang , W. , Ding , F. , Zhou , J. and Pei , G. ( 2011 ) A Conceptual Platform of SLA in Cloud Computing. 2011 IEEE Ninth Int. Conf. Dependable, Autonomic and Secure Computing (DASC), Sydney, Australia, pp. 1131–1135. IEEE. 37 Hammadi , A.M. and Hussain , O. ( 2012 ) A Framework for SLA Assurance in Cloud Computing. 2012 26th Int. Conf. Advanced Information Networking and Applications Workshops (WAINA), Fukuoka, Japan, pp. 393–398. IEEE. 38 Muchahari , M.K. and Sinha , S.K. ( 2012 ) A New Trust Management Architecture for Cloud Computing Environment. 2012 International Symposium on Cloud and Services Computing (ISCOS), Mangalore, India, pp. 136–140. 39 Sun , Y. , Tan , W. , Li , L. , Lu , G. and Tang , A. ( 2013 ) SLA Detective Control Model for Workflow Composition of Cloud Services. 2013 IEEE 17th Int. Conf. Computer Supported Cooperative Work in Design (CSCWD), Whistler, BC, Canada, pp. 165–171. 40 Hussain , O.K. , Hussain , F.K. , Singh , J. , Janjua , N.K. and Chang , E. ( 2014 ) A user-based early warning service management framework in cloud computing . Comput. J. , 58 , 472 – 496 . Google Scholar CrossRef Search ADS 41 Leitner , P. , Wetzstein , B. , Rosenberg , F. , Michlmayr , A. , Dustdar , S. and Leymann , F. ( 2010 ) Runtime Prediction of Service Level Agreement Violations for Composite Services. Service-Oriented Computing. ICSOC/ServiceWave 2009 Workshops, Stockholm, Sweden, pp. 176–186. Springer. 42 Ciciani , B. , Didona , D. , Di Sanzo , P. , Palmieri , R. , Peluso , S. , Quaglia , F. et al. . ( 2012 ) Automated Workload Characterization in Cloud-based Transactional Data Grids. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pp. 1525–1533. IEEE. 43 Cardellini , V. , Casalicchio , E. , Lo Presti , F. and Silvestri , L. ( 2011 ) SLA-Aware Resource Management for Application Service Providers in the Cloud. 2011 First Int. Symposium on Network Cloud Computing and Applications (NCCA), pp. 20–27. IEEE. 44 Son , S. , Kang , D.-J. , Huh , S.P. , Kim , W.-Y. and Choi , W. ( 2016 ) Adaptive trade-off strategy for bargaining-based multi-objective SLA establishment under varying cloud workload . J. Supercomput. , 72 , 1597 – 1622 . Google Scholar CrossRef Search ADS 45 Silaghi , G.C. , ŞErban , L.D. and Litan , C.M. ( 2012 ) A time-constrained SLA negotiation strategy in competitive computational grids . Future Generation Comput. Syst. , 28 , 1303 – 1315 . Google Scholar CrossRef Search ADS 46 Badidi , E. ( 2013 ) A Cloud Service Broker for SLA-Based SAAS Provisioning. 2013 Int. Conf. Information Society (i-Society), Toronto, Canada, pp. 61–66. IEEE. 47 Pacheco-Sanchez , S. , Casale , G. , Scotney , B. , McClean , S. , Parr , G. and Dawson , S. ( 2011 ) Markovian Workload Characterization for QOS Prediction in the Cloud. 2011 IEEE Int. Conf. Cloud Computing (CLOUD), Washington, DC, USA, pp. 147–154. IEEE. 48 Schmieders , E. , Micsik , A. , Oriol , M. , Mahbub , K. and Kazhamiakin , R. ( 2011 ) Combining SLA Prediction and Cross Layer Adaptation for Preventing SLA Violations. http://eprints.sztaki.hu/6563/1/2ndwoss_Micsik.pdf (Accessed March 2017). 49 Hussain , W. , Hussain , F.K. and Hussain , O. ( 2016 ) Allocating Optimized Resources in the Cloud by a Viable SLA Model. 2016 IEEE Int. Conf. Fuzzy Systems (FUZZ-IEEE), Vancouver, Canada, pp. 1282–1287. IEEE. 50 Ciciani , B. , Didona , D. , Di Sanzo , P. , Palmieri , R. , Peluso , S. , Quaglia , F. et al. . ( 2012 ) Automated Workload Characterization in Cloud-Based Transactional Data Grids. 2012 IEEE 26th Int. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Shanghai, China, pp. 1525–1533. IEEE. 51 Kiran , M. , Jiang , M. , Armstrong , D.J. and Djemame , K. ( 2011 ) Towards a Service Lifecycle Based Methodology for Risk Assessment in Cloud Computing. 2011 IEEE Ninth Int. Conf. Dependable, Autonomic and Secure Computing (DASC), Sydney, NSW, Australia, pp. 449–456. IEEE. 52 Zhang , X. , Wuwong , N. , Li , H. and Zhang , X. ( 2010 ) Information Security Risk Management Framework for the Cloud Computing Environments. 2010 IEEE 10th Int. Conf. Computer and Information Technology (CIT), Bradford, UK, pp. 1328–1334. IEEE. 53 Cicotti , G. , Coppolino , L. , D’Antonio , S. and Romano , L. ( 2015 ) Runtime model checking for SLA compliance monitoring and QoS prediction . J. Wireless Mobile Netw. Ubiquitous Comput. Dependable Appl. (JoWUA) , 6 , 4 – 20 . 54 Albakri , S.H. , Shanmugam , B. , Samy , G.N. , Idris , N.B. and Ahmed , A. ( 2014 ) Security risk assessment framework for cloud computing environments . Secur. Commun. Netw. , 7 , 2114 – 2124 . Google Scholar CrossRef Search ADS 55 Hussain , W. , Hussain , F.K. and Hussain , O. ( 2016 ) QoS Prediction Methods to Avoid SLA Violation in Post-Interaction Time Phase. 11th IEEE Conf. Industrial Electronics and Applications (ICIEA 2016) Hefei, China, pp. 32–37. IEEE. 56 Box , G.E. , Jenkins , G.M. and Reinsel , G.C. ( 2011 ) Time Series Analysis: Forecasting and Control , vol. 734 . John Wiley & Sons ., US http://213.55.85.90:8080/bitstream/handle/123456789/8965/Time%20Series%20Analysis.pdf?sequence=1&isAllowed=y. 57 Calheiros , R.N. , Masoumi , E. , Ranjan , R. and Buyya , R. ( 2015 ) Workload prediction using ARIMA model and its impact on cloud applications’ QoS . IEEE Trans. Cloud Comput. , 3 , 449 – 458 . Google Scholar CrossRef Search ADS 58 ur Rehman , Z. , Hussain , O.K. , Hussain , F.K. , Chang , E. and Dillon , T. ( 2015 ) User-side QoS forecasting and management of cloud services . World Wide Web , 18 , 1677 – 1716 . Google Scholar CrossRef Search ADS 59 Mamdani , E.H. and Assilian , S. ( 1975 ) An experiment in linguistic synthesis with a fuzzy logic controller . Int. J. Man. Mach. Stud. , 7 , 1 – 13 . Google Scholar CrossRef Search ADS 60 CloudClimate . Watching the Cloud. http://www.cloudclimate.com (Accessed March 2017). 61 P. N. Monitor . https://prtg.paessler.com (Accessed March 2017). © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

The Computer JournalOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off