Survivability Modeling and Analysis of Cloud Service in Distributed Data Centers

Survivability Modeling and Analysis of Cloud Service in Distributed Data Centers Abstract Analyzing the survivability of a cloud service is critical as the application or service migration from local to cloud is an irresistible trend. However, former research on cloud service or virtual system (VS) availability and/or reliability was only carried out from the perspective of steady state. This paper aims to analyze the survivability of the cloud service after a service breakdown occurrence by presenting a model and the closed-form solutions with the use of continuous-time Markov chain. The service breakdown may be caused by virtual machine (VM) and/or VM monitor (VMM) bugs or software rejuvenation and/or host failures and NAS (Network Area Storage) failures. In order to improve the cloud service survivability, the VS applies two techniques: VM failover and VM live-migration. Through the model proposed and the survivability metrics defined in this paper, we are able to quantitatively assess the system survivability while providing insights into the investment efforts in system recovery strategies. In order to study the impact of key parameters on system survivability, this paper also provides a parameter sensitivity analysis through numerical experiments. INTRODUCTION System virtualization (SV) technology has been widely used for academic and industrial purposes. In SV-based virtualized systems [1], a virtual machine monitor (VMM) is a layer of software between one or more operating systems and the hardware, which performs the emulation of hardware of a physical machine and thus is a key component in the virtualized systems. It often becomes the single point of failure. A virtual machine (VM) runs on the top of VMM for emulating a particular computer system and the cloud service is deployed on a VM. Like traditional softwares, VM and VMM are also subject to the problems of software aging [2, 3], bugs, crashes and so on. The underlying host and the storage system may also face problems which could lead to failures, not only software problems but also hardware problems. Most virtualization systems are deployed at a Cloud Service Provider (CSP)’s data center; the entire data center infrastructure is likely to be destroyed in some extreme cases, like earthquake or hurricane. All these problems not only degrade the performance but also reduce the cloud service availability and then increase service downtime. Software rejuvenation [3], failover and live VM-migration [4] are the common high-availability (HA) techniques adopted in a data center. Because of the tremendous growth of the cloud service, the need for the performability analysis [5] of a cloud service with these HA techniques increases. A large downtime of cloud service can cause productivity loss and even business loss. For example, Amazon AWS S3 outage on 28 February 2017, lead to large-scale application and service collapse [6]. Steady-state performance and dependability of the virtualized environment have been widely studied. The survivability, a transient measure, is generally defined as the ability of the system to recover pre-defined service in a timely manner after the occurrence of disaster [7, 8] which could be any kind of undesired occurrence to the system of interest. However, there is less work on survivability analysis, which could help improve the systems’ capability to provide critical services when system damage occurs. This paper is an extension of our previous work originally reported in The 41nd IEEE Conference on Local Computer Networks (LCN) [9]. We improved the previous article from the following aspects: In the previous work, we ignored the situations that the physical machine (Host) and the Network Area Storage (NAS) will face software mistakes or hardware failure which could also lead to a service breakdown. In order to simplify the model, we did not consider the failure of VM failover and VM live migration before. But in the practical environment, the process of VM failover or VM-migration may also fail due to software mistakes or network failures. In the previous work, we only considered a single data center while in this paper the backup server and backup data center are all considered. Those make our model much closer to practical systems. This paper considers cloud service breakdown which may be caused by VM rejuvenation, VMM rejuvenation, VM bugs, VMM bugs, host failures or NAS failures. We quantify the survivability as the transient performance of the cloud from the instant when the service breakdown occurs until the application services deployed in the data center perform recovery. By service recovery, we mean that the application deployed in the data center can still provide service in their original place or move into another virtual machines. The cloud service system considered in this paper consists of three main components: Two Data Center (DC)s (namely, DC1 and DC2) and a Backup Server, as shown in Fig. 1. There are more than one host in DC1: Host1, acts as main host and Host2. Host3 is deployed at DC2, act as a backup host, as well as Host2. Host1 contains an active VM (VM1) and a standby VM (VM2). There are another two standby VMs (VM3, VM4) on Host2 and Host3, respectively. The cloud service, denoted as application (App) in the following, is deployed in the active VM. The virtualized system applies two techniques to improve service survivability: VM failover and live VM-migration. When the active VM fails, a standby VM on the same host is used or migrated to the other host to continue the cloud service. We will describe the system architecture in more detail in section System Description and Model. The unique features of system virtualization highlight the challenges to address the survivability for the virtualized system. Figure 1. View largeDownload slide The system architecture. Figure 1. View largeDownload slide The system architecture. This paper aims to assess the survivability of the cloud service and the virtualized system after a service breakdown occurrence. We employ continuous-time Markov chain (CTMC) to present a survivability model of the cloud service. Survivability metrics of interest are furthermore defined and the closed-form solutions to these metrics are derived in order to quantify the attributes of the cloud service during the virtualized system recovery. The attributes include (i) the downtime of the cloud service by time t; (ii) the probabilities of the cloud service recovery by time t; (iii) meantime to recover the cloud service. Sensitivity analyses are also carried out through numerical experiments. As far as we know, no work has been carried out on survivability analysis and evaluation to such a virtualized system exists. The proposed model and metrics could analyze not only the survivability of cloud service but also the survivability of the whole data center. Therefore, compared with the existing transient analysis of the virtualized system, more insights could be provided for system designers and operators to decide and deploy appropriate survivability techniques to optimize their benefits. Note that although the distribution of all relevant event times is assumed to be exponential, a number of techniques are available to relax this assumption if needed [10]. We leave the relaxation of this assumption for future work. The main contributions of this paper are summarized as follows: We present a CTMC-based survivability modeling method and analyses of a distributed data centers. Our model takes into account all possible cloud service breakdown situations. Backup server and backup data center are considered. The closed-form solutions of our model can not only assess the survivability of the cloud service but also carry out the parameter sensitivity analysis. The paper is organized as follows. Section Background and Related Work describes some background knowledge and related work. Section System Description and Model provides the system model and gives the closed-form solutions. In section Numerical Analysis and Discussion, we present evaluation results. The conclusion and future work are in section Conclusion and Future Work. BACKGROUND AND RELATED WORK This section first presents HA techniques and then the stochastic analysis of the virtualized system. Software failures may be caused by inherent software design defects or by improper usage of users. For long-running VMMs and VMs, one of the major causes of software failures is software aging. This issue could not only increase the failure rate, degrade the cloud service performance but also lead to system crash [2, 3]. Software rejuvenation [3] is a software fault tolerance technique to defend against software aging. This technique gracefully stops the execution of an application/system and periodically restarts it at a clean internal state in a proactive manner. Failover and live VM-migration are also common techniques used for achieving VM high availability in real virtualized systems, such as Amazon EC2. Failover is a backup operational mechanism, in which secondary system components perform the functions of a system component (such as a processor, a server, a network or a database) when the primary component cannot work. In a virtualized system, it is achieved by creating an active VM and several standby VMs. When the active VM suffers a failure or gets ready to be rejuvenated, one standby VM can take over the role of the active VM to continue the task execution. Live VM-migration mechanism can move a running VM or application between different physical machines. The information of memory, storage and network connectivity of the original VM is transferred from the original host to the destination host. Recently, studies have been carried out to the cloud service availability analysis by adopting analytic modeling approach. Most of them focused on steady-state analysis, see [11] and references therein. Although survivability analysis has been conducted in the other fields [12, 13], few researchers analyzed the survivability of the virtualized environment. In [14], the authors analyzed the survivability of three VM-based architectures through simulations. Their survivability definition is different from the definition of this paper. They defined it as the probability for the system to deliver the pre-defined service under attacks. Actually, their survivability is a steady-state metric. Although they carried out transient analysis, their purpose is to check how quickly can the steady state be reached. In addition, they only analyzed from the viewpoint of the cloud service while our analysis was carried out from the viewpoint of both the cloud service and the whole virtualized system recovery. Therefore, our model and metrics could provide more guidelines for system designers and administrators to manage the virtualized system. In [15], the authors proposed a discrete-time Markov chain-based stochastic model to analyze the survivability of the system in the presence of intrusion. Their survivability is also a steady-state attribute. In [16], the authors considered two data centers, one of which performed the role of backup. This work presented stochastic Petri Net models to evaluate survivability metrics in IaaS systems. The key difference between our model and theirs is that our model provides not only cloud service survivability analysis but also the whole virtualized system recovery. They did not consider the recovery of the whole data centers. In addition, more survivability metrics are defined and closed-form solutions are given in this paper. In [17], the authors proposed to use survivability modeling approach to assess the impact of failure which causes system unavailability and use software rejuvenation to pro-actively recover from soft failures. The proposed model can be used computer meantime to repair the reliable systems. The authors in [18] used semi-Markov model to analyze availability of IaaS cloud system. The authors studied the impact of sudden and hidden failures of physical machines on the availability of the whole cloud system. In [19], the authors adopted a formal hybrid modeling method to evaluate the performance, availability and survivability of cloud computing systems. Their work combines stochastic Petri nets, Markov chain, reliability block diagrams and other high-level models to estimate the metrics related to the system disaster recovery. The authors in [20] presented a study on quantifying the availability of an IaaS cloud system using Petri nets and Markov chains. They gave closed-form equations to assess the system and reduce the solution time. The authors also gave quantitative performability analysis of cloud system in [21], the effects of workload variation, failure rate and system QoS are considered in their work. In [22], the authors showed four different sensitivity analysis techniques to analyze the parameters that affect the availability of a mobile cloud system most. The different analysis results show that reduce some system parameters could improve the whole availability. In [23], the authors proposed a stochastic reward nets model for availability evaluation of a cloud computing system. They proposed a calibration phase to improve the accuracy of results. This work also focused on the evaluation of failure recovery. The authors in [24] introduced a failure mitigation technique in distributed systems. They proposed to use a so-called cloud standby method for disaster recovery. A standby system could replace the failure system whenever a disaster occurs. Despite these works’ remarkable quality, most of them only did steady-state evaluation and they didn’t consider the distributed cloud date center. SYSTEM DESCRIPTION AND MODEL This section gives an introduction to the virtualized system and then presents the CTMC model. The system architecture Two Data Center (DC)s (namely, DC1 and DC2) and a backup server are the three main components of the virtualized system considered in this paper, see Fig. 1. The two DCs are located at different places which are generally far away from each other. The definitions of the parameters used in this paper are listed in Table 1. Table 1. Parameter definition. Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Table 1. Parameter definition. Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 DC1, as the main data center, contains two hosts (Host1 and Host2) and one Network Area Storage (NAS1). The two hosts have the same system architecture and are both connected with NAS1 to store VM images. Host1, as the main host contains a VMM, which runs an active VM (VM1) with a desired application and one standby VM (VM2) also runs the same application with which in VM1 for the purpose of VM-failover mechanism. We use dotted line to represent Host2 as a backup host having a VM vacancy (VM3) which is capable for running a VM for the sake of migration of VMs from main host (Host1). It is a spare host which could perform the Main host role when a VM migrates to it successfully. Finally, there is a Management host in DC1 which is a component responsible for controlling the entire cloud environment to detect the VMM failure by means of a specific cloud management tool. We do not show it in Fig. 1 as it has no effect on our model. All VM images and VMM code files are stored in network attached storage (NAS1). In addition, all the running information of the active VMs is stored in NAS1 periodically. Then these data can be used for failover, VM restart, VMM restart and live VM-migration. An agent is installed in each VM for the VM rejuvenation operation. Similarly, an agent is deployed in the VMM in order to analyze all the VMs running on this VMM, detect their abnormal behaviors and trigger the rejuvenation of a VM when it detects anomalies of the VM. DC2 is geographically far away from DC1. It is used for disaster tolerance. As is shown in Fig. 1, the infrastructure of DC2 is basically the same as DC1. We use a dashed box to represent VM4 in Host3 which is also a VM vacancy used for running the VM migrated from DC1 or loaded from Backup Server. It is important to note that there are so many hosts and VMs in the data center in a real-world environment. Since only one application or service is considered in our model, only three hosts and four VMs are described in Fig. 1. Both DC1 and DC2 are connected with the Backup Server. The backup server is used to perform disaster tolerance as it can store VMs images from data centers as a backup. Once one data center suffered a disaster, the VM images will be immediately transferred to the other data center. The application can be restarted after reloading the backup VM images. There are dependencies among system components. That is, the change of a component state will trigger the state changing of dependent component. We define A ≻ B as A depends on B. Therefore, in our virtualized system, the component dependency is App ≻ VM ≻ VMM ≻ Host ≻ Storage. We assume that there is no other service breakdown before the cloud service recovered from one service breakdown. A non-aging Mandelbug-related failure and an aging-related failure could happen to all VMs (including primary and standby) and VMMs of both Main and Backup hosts. When the above failures happened to the active VM (VM1), the application deployed in it will be broken. Meantime, the active VM’s running state stored in the NAS will be sent to the standby VM (VM2). Then the VM-failover mechanism will be triggered and the standby VM2 is selected to take charge of the application and then act as an active VM during its repair period. We assume that at least one VM on a VMM could be used as a standby VM at any time. That is to say, the failed VM will be repaired or rejuvenated in time and turn to be the new standby VM. When the VMM in Host1 fails due to VMM rejuvenation or aging-related bugs or non-aging Mandelbugs, the VM live-migration mechanism will be triggered as soon as the application service goes down. The VM images will be transferred to Host2 and loaded as VM3. The VM live-migration mechanism can also be used when main Host1 itself faild due to software bugs or hardware problems. In the extreme case, where DC1 sufferd disasters or NAS1 suffered software or hardware problems and lead to NAS1 failure or breakdown, Host3 will play its role to transfer backup VM image from the Backup Server and load it as VM4 and then restart the application service. Noted that every high-availability (HA) mechanism like migration or failover has the possibility to fail. Once the HA mechanisms fail, the system can only wait for the broken parts to be repaired and restart the service after that. VM live-migration is another technique used to migrate the active VM to a Backup host when the VMM needs to be rejuvenated or crashes or the main host fails. When VMM crashes, the VM information stored in NAS will be migrated to a Backup host. The live-migration technique can move the active VM with all the requests and sessions from the Main host to a Backup host without losing any in-flight request or session data during the VMM rejuvenation or repairing. Survivability is originally defined by ANSI T1A1.2 committee as the transient performance of a system after an undesirable event [25]. The metrics used to quantify survivability vary according to the system and system attributes of interest. In this paper, we classify the metrics into two categories: Instantaneous metrics are transient metrics that capture the state of the system at time t after the occurrence of an undesired event. An example of an instantaneous metric is the probability that the cloud service is recovered by time t. Cumulative metrics are integrals of instantaneous metrics, that is, expected accumulated rewards in the interval (0, t]. The metrics considered in this paper include Metric m1. Probability that App service is recovered by time t. Metric m2. Mean accumulated loss of App sevice breakdown in the interval [0, t]. Metric m3. Meantime up to full App service recovery. Note that survivability metrics are transient metrics computed after the announcement of a service breakdown. In the remainder of this paper, time t refers to the time immediately after a service breakdown and is measured in seconds. The survivability model Figure 2 is the description of our survivability model. It is a Markov model with an absorbing state. Survivability aims to capture the evolution of the system under consideration after an unexpected event occurs. Thus, the model in Fig. 2 does not include the failure process leading to service breakdown. The initial state of our model is conditioned to be the service breakdown state. The meantime to determine the cause of the service breakdown is denoted as γ. This meantime value is very smaller than the other intervals of time considered in this paper. Hence, we assume γ=0, that is the average time spent in state ‘App service breakdown’ is zero. The initial probabilities for States 1, 2, 3, 4, 5 and 6 are then equal to q1, q2, q3, q4, q5 and q6, respectively. Figure 2. View largeDownload slide Phased recovery model of virtualized system. Figure 2. View largeDownload slide Phased recovery model of virtualized system. We assume that no other failure occurs during the system recovery. All the states except the absorbing state are transient and have a steady-state probability. The definition of each state is given in Table 2. When the cloud service breakdown occurs due to the active VM (VM1) rejuvenation or the active VM reboot caused by an aging-related bug in VM, a standby VM (VM2) on the same host is selected for handling the cloud service. This failover process requires a very small delay, denoted as 1/α. Meanwhile, the VM must rejuvenate with time 1/a. After rejuvenation, the system decides whether to restart the application according to whether VM-failover is successful or not. The probability of a successful VM-failover is p1, so that the rate from State 8 to State 20 is k*(1−p1) as the mean application restart time is 1/k. VMC represents that service breakdown is due to VM crash caused by non-aging Mandelbug in VM. The failover process is just the same as that from VMR. Meanwhile, the VM must be repaired with time 1/b. After the repair, the system goes to ‘VM ready to reboot’ (State 7) and it takes 1/j time to boot the VM system. Similarly, after the reboot, the system determines whether to start the application service according to whether VM-failover is successful or not. Table 2. State Notations. Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Table 2. State Notations. Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service VMMR denotes the situation where the cloud service breakdown occurs due to the VMM rejuvenation or the VMM reboot caused by an aging-related bug in VMM. VM live-migration is applied with time 1/β and the probability of success is p2. Before the VMM rejuvenation, the VM in the VMM will be suspended so that the App can re-provide service as soon as the VM is rebooted after the rejuvenation. Like VM rejuvenation, after VMM rejuvenation the system will decide whether to reboot the VM according to whether the live-migration was successful or not. Thus, the rate from State 14 to State 20 is j*(1−p2). VMMC represents that the cloud service breakdown is caused by VMM crash caused by a non-aging Mandelbug in the VMM. The crashed VMM is repaired with the repair time of 1/d. After the repair, the system goes to ‘VMM ready to reboot’ (State 9) and it takes 1/i time to restart the VMM. The difference between VMMR is that the system cannot be suspended as the crash is unpredictable. Hence, after rebooting the VM, we shoud restart the App as the system goes to state 12 instead of State 20. Host Fails denotes the situation where the cloud service breakdown occurs due to host software bugs or hardware problems. The VM live-migration is the same as previous mentioned. Also, the host needs to be repaired with time 1/e. After that, reboot the host with meantime 1/h. Since the failure of a host is also unpredictable we need to restart the App service after rebooting the VM system like the VMMC situation. The last case is NAS Fails. This could happen due to the network storage malfunction or disasters at the data center. As the storage fails, we cannot carry out VM live-migration. However, due to the existence of a backup server, we can transfer the VM images from the backup server to the other data center and then reboot the images. The meantime to transfer a VM image from the backup center to a data center is σ. The transmission may also fail due to network problems so we have a success probability p3. The meantime to rebuild and restart the NAS is 1/f and 1/g, respectively. Since the images in the backup server are not always the latest state of the VM, a reboot is required after the image transfer is complete and the App also needs to be restarted. With the above assumptions, the phased recovery process is mathematically modeled as a continuous-time Markov chain (CTMC) on the state space S. The Markov chain infinitesimal generator is as follows: Q=[A1000000a00000000000αp10A20000b000000000000αp100A30000000000c00000βp2000A40000d0000000000βp20000A50000e000000000βp200000A60000f0σp30000000000000−jj0000000000000000000A700000000000-A700000000−i00000i00000000000000−h00000h00000000000000−g00000g00000000000000−k0000000k00000000000j−j00000000000000000000A800000-A800000000000-A800A80000000000000000000i−i00000000000000000000−hh0000000000000000000−ii000000000000-A9000000A9000000000000000000000] where A1=−a−αp1, A2=−b−αp1, A3=−c−βp2, A4=−d−βp2, A5=−e−βp2, A6=−f−σp3, A7=−k(1−p1), A8=−j(1−p2), A9=−j(1−p3). For each t≥0andi∈S and i∈S define the probabilities: πi(t)=Pr{X(t)=i} (1) be the transient probability that the system is in state i at time t and then ∑i=120πi(t)=1. Suppose π(t)=[π1(t),π2(t),..,π20(t)] (2) denote a row vector of transient state probabilities of X(t). So that the initial state probabilities of each state are set as π1(0)=q1,π2(0)=p2,π3(0)=q3,π4(0)=q4,π5(0)=q5,π6(0)=q6andπi(0)=0fori∈[7,8,..,20]. Generally, a row vector of transient state probabilities of at time t can be defined as π(t)={πi(t),i∈S} (3) With infinitesimal generator Q, we can derive the closed-form expressions for πi(t) at time t using Kolmogorov differential–difference equation: dπ(t)dt=π(t)Q (4) Associating each state with a reward rate allows the computation of the survivability metrics of interest. We can compute the meantime to system recovery. ϖi is defined to denote the average reward of state i. For example, when ϖi=1 for i=20 and ϖi=0 for i∈[1,19], m2 is downtime of cloud service and m3 is the meantime to full App recovery. Li(τ)=∫0τϖiπi(x)dx (5) is defined to denote the mean reward accumulated by time τ at state i after a service breakdown. Based on these formulas, we obtain the formulas for computing the metrics mentioned in section The System Architecture: Probability that App service is recovered by time t (value of m1) is π20(t). Mean accumulated loss of App service breakdown by time t (the value of m2) is ∑i=119Li(t). NUMERICAL ANALYSIS AND DISCUSSION This section gives a detailed evaluation to our model solutions. Assuming that all time intervals in the model are exponentially distributed. Unless otherwise specified, all the given parameter values in the model are based on the existing related literature [11], and the references therein. The default parameter values are listed in Table 1. We first evaluate the probability of the App service recovered by time t by varying the initial probabilities of the first six states. The results are shown in Fig. 3. ‘0.5999_0.1_0.1_0.1_0.1_0.0001’ denotes the results under p1=0.5999,p2=p3=p4=p5=0.1andp6=0.0001. Similarly, ‘0.1_0.5999_0.1_0.1_0.1_0.0001’ denotes the results under p2=0.5999,p1=p3=p4=p5=0.1andp6=0.0001. These results indicate that the probabilities that the App service have recovered increase over time. We also observe that App can recover quickly in case of p1=0.5999andp2=0.5999, denoting that the service breakdown is caused mainly by VM aging-related bugs/VM rejuvenation or VM crash. Since the VM-failover is much faster than VM repair or rejuvenation, so even if the rejuvenation is faster than the repair, their availability is similar. As we can see that the red line and the blue line in Fig. 3 are almost overlap. The same is true for VMMR and VMMC. The corresponding recovery probability reaches 0.9 in less than 30 seconds. However, when the service breakdown is majorly caused by Host Fails (that is, 0.1_0.1_0.1_0.1_0.5999_0.0001), it takes much more time for App service to recover, compared with VM failure and VMM failure. This is demonstrated by the results of 0.1_0.1_0.1_0.1_0.5999_0.0001. We could see that App recovery probability has not reached 0.9 after 50 s. Also as VM live-migration is slower than VM-failover, the green and yellow lines are slower in getting to the stabilization than the red and blue lines. The slow increase in the App recovery probability leads to the larger service downtime, as shown in Fig. 4. We did not show the probabilities by vary q6, as the NAS failing probability is much slower than the other five cases. Figure 3. View largeDownload slide Probability of App being recovered by time t by varying q1, q2, q3, q4, q5 and q6. Figure 3. View largeDownload slide Probability of App being recovered by time t by varying q1, q2, q3, q4, q5 and q6. Figure 4. View largeDownload slide Mean accumulated downtime of App by time t. Figure 4. View largeDownload slide Mean accumulated downtime of App by time t. The system performance is sensitive to the system parameters, listed in Table 1. The results of the instantaneous probability of each state could help identify the performance bottleneck and then determine the related parameters leading to the bottleneck. Given the initial probability distribution that q1=q2=q3=q4=q5=q6=1/6, Figs. 5 and 6 show the probabilities that the App service have recovered by time t varied with different parameters. Figure 5 shows the effect of different VM-failover rates on the system performance. It can be seen from the figure that the faster the system reaches a steady state, the shorter the time is required as the green line represents the fastest VM-failover rate. Figure 6 shows the effect of different VM rejuvenation rates and VM repair rates on the system performance. We could see that there is no significant difference on the probabilities when changing the VM repair rates and rejuvenation rates. In a nutshell, the VM-failover rate has a greater impact on system performance. Similarly, the VM-migration rate may have a greater impact on system performance than the VMM repair rates or rejuvenation rate or Host repair rates. The above results could help determine the best system parameters to optimize the revenue of service providers, and help them make the best use of the limited budget. Figure 5. View largeDownload slide Effect of VM-failover rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 5. View largeDownload slide Effect of VM-failover rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 6. View largeDownload slide Effect of VM repair rate and VM rejuvenation rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 6. View largeDownload slide Effect of VM repair rate and VM rejuvenation rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. CONCLUSION AND FUTURE WORK Services are migrated from local to cloud. Survivability issues must be addressed before the cloud technique could be used for crucial services. This paper explores the CTMC model-based survivability analysis of a cloud service after a service breakdown occurrence. We propose the model and define the survivability metrics. A numerical analysis is carried out to study the impact of the underlying parameters on the system survivability. These results could provide insights on the cost/benefit trade-offs of investment efforts in system recovery strategies. Note that this paper treated the virtualized system restoration process as CTMC. However, the virtualized system may be time-variant after the service breakdown. That is, the values of the parameters in Table 1 could be changeable and sometimes unpredictable. How to characterize the time-variant virtualized system restoration process after a service breakdown is one of our future work. Furthermore, the modeling in this paper is an approximation of the real restoration process. For example, we only consider one backup VM in VM-failover and live-migration. However, in practical systems, the number of hosts and VMs in a DC could reach hundreds to thousands upon its pre-designed configuration. We shall extend the current study to approximate the model in a more accurate way to the real system behaviors. FUNDING The research of Zhi Chen and Xiaolin Chang was supported by National Natural Science Foundation of China (No. 61572066). REFERENCES 1 Vallée , G. , Naughton , T. , Ong , H. , Tikotekar , A. , Engelmann , C. , Bland , W. , Aderholdt , F. and Scott , S.L. ( 2008 ) Virtual system environments. Systems and virtualization management . Stand. New Technol , 18 , 72 – 83 . 2 Grottke , M. , Matias , R. and Trivedi , K.S. ( 2008 ). The Fundamentals of Software Aging. IEEE Int. Conf. Software Reliability Engineering Workshops, 2008. ISSRE Wksp, Seattle, WA, USA, 11–14 Nov. 2008, Vol. 44, pp. 1–6. IEEE. 3 Huang , Y. , Kintala , C. , Kolettis , N. and Fulton , N.D. ( 1995 ). Software Rejuvenation: Analysis, Module and Applications. Int. Symp. Fault-Tolerant Computing, 1995. Ftcs-25. Digest of Papers, Pasadena, CA, USA, 27–30 June 1995, Vol. 137, pp. 381–390. IEEE. 4 Clark , C. , Fraser , K. , Hand , S. , Hansen , J.G. , Jul , E. , Limpach , C. , Pratt , I. and Warfield , A. ( 2005 ) Live Migration of Virtual Machines. Proc. 2nd Conf. Symp. Networked Systems Design & Implementation, Vol. 2, pp. 273–286. USENIX Association, Berkeley, CA, USA. 5 Trivedi , K.S. , Andrade , E.C. and Machida , F. ( 2012 ) Combining Performance and Availability Analysis in Practice. In Advances in Computers , 84 , pp. 1 – 38 . Elsevier , Oxford, UK. 6 Amazon Web Services, I . ( 2017 ) Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region 7 Ellison , R.J. , Fisher , D.A. , Linger , R.C. , Lipson , H.F. , Longstaff , T. and Mead , N.R. ( 1997 ). Survivable Network Systems: An Emerging Discipline. 8 Trivedi , K.S. and Xia , R. ( 2015 ) Quantification of system survivability . Telecommun. Syst. , 60 ( 4 ), 451 – 470 . Google Scholar CrossRef Search ADS 9 Changa , X. , Zhang , Z. , Li , X. and Trivedi , K.S. ( 2016 ) Model-Based Survivability Analysis of a Virtualized System. IEEE 41st Conference on Local Computer Networks (LCN), Dubai, 7–10 Nov. 2016. IEEE. 10 Bolch , G. , Greiner , S. , de Meer , H. and Trivedi , K.S. ( 2006 ) Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications . John Wiley & Sons , Hoboken, NJ, USA. 11 Nguyen , T.A. , Kim , D.S. and Park , J.S. ( 2016 ) Availability modeling and analysis of a data center for disaster tolerance . Future Gener. Comput. Syst. , 56 , 27 – 50 . Google Scholar CrossRef Search ADS 12 Heegaard , P.E. and Trivedi , K.S. ( 2009 ) Network survivability modeling . Comput. Netw. , 53 ( 8 ), 1215 – 1234 . Google Scholar CrossRef Search ADS 13 Liu , Y. and Trivedi , K.S. ( 2004 ). A General Framework for Network Survivability Quantification. Mmb & Pgts 2004, Gi/itg Conf. Measuring and Evaluation of Computer and Communication Systems, pp. 369–378. 14 Yang , Y. , Zhang , Y. , Wang , A.H. , Yu , M. , Zang , W. , Liu , P. and Jajodia , S. ( 2013 ) Quantitative survivability evaluation of three virtual machine-based server architectures . J. Netw. Comput. Appl. , 36 ( 2 ), 781 – 790 . Google Scholar CrossRef Search ADS 15 Zheng , J. , Okamura , H. and Dohi , T. ( 2015 ) Survivability analysis of VM-based intrusion tolerant systems . IEICE Trans. Inf. Syst. , 98 ( 12 ), 2082 – 2090 . Google Scholar CrossRef Search ADS 16 Silvaa , B. , Maciela , P.R.M. , Zimmermannb , A. and Brilhantea , J. ( 2014 ) Survivability Evaluation of Disaster Tolerant Cloud Computing Systems. Proc. Probabilistic Safety Assessment & Management conference. 17 Jacques-Silva , G. , Avritzer , A. , Menasché , D.S. , Koziolek , A. , Happe , L. and Suresh , S. ( 2015 ) Survivability Modeling to Assess Deployment Alternatives Accounting for Rejuvenation. IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Gaithersburg, MD, USA, 2–5 Nov. 2015. IEEE. 18 Ivanchenko , O. , Kharchenko , V. , Ponochovny , Y. , Blindyuk , I. and Smoktii , O. Semi-Markov Availability Model for Infrastructure as a Service Cloud Considering Hidden Failures of Physical Machines. 19 SILVA , B. ( 2016 ) A Framework for Availability, Performance and Survivability Evaluation of Disaster Tolerant Cloud Computing Systems. 20 Ghosh , R. , Longo , F. , Frattini , F. , Russo , S. and Trivedi , K.S. ( 2014 ) Scalable analytics for IaaS cloud availability . IEEE Trans. Cloud Comput. , 2 ( 1 ), 57 – 70 . Google Scholar CrossRef Search ADS 21 Ghosh , R. , Trivedi , K.S. , Naik , V.K. and Kim , D.S. ( 2010 ) End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud: An Interacting Stochastic Models Approach. IEEE 16th Pacific Rim Int. Symp. Dependable Computing (PRDC), Tokyo, Japan, 13–15 Dec. 2010. IEEE. 22 Matos , R. , Araujo , J. , Oliveira , D. , Maciel , P. and Trivedi , K. ( 2015 ) Sensitivity analysis of a hierarchical model of mobile cloud computing . Simul. Model Pract. Theory , 50 , 151 – 164 . Google Scholar CrossRef Search ADS 23 Xu , X. , Lu , Q. , Zhu , L. , Li , Z. , Sakr , S. , Wada , H. and Webber , I. ( 2013 ) Availability Analysis for Deployment of In-Cloud Applications. Proc. 4th Int. ACM Sigsoft symp. Architecting critical systems. Vancouver, Canada, June 17–21, 2013, ACM, New York, NY, USA. 24 Lenk , A. and Tai , S. ( 2014 ) Cloud Standby: Disaster Recovery of Distributed Systems in the Cloud. Eur. Conf. Service-Oriented and Cloud Computing. Manchester, UK, Sep. 2–4, 2014. pp. 32–46, Springer, Berlin. 25 Technical Report 68 ( 2001 ) Enhanced Network Survivability Performance. ANSI T1A1.2 Working Group on Network Survivability Performance. © The British Computer Society 2017. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

Survivability Modeling and Analysis of Cloud Service in Distributed Data Centers

Loading next page...
 
/lp/ou_press/survivability-modeling-and-analysis-of-cloud-service-in-distributed-LfP6APOTh4
Publisher
Oxford University Press
Copyright
© The British Computer Society 2017. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxx116
Publisher site
See Article on Publisher Site

Abstract

Abstract Analyzing the survivability of a cloud service is critical as the application or service migration from local to cloud is an irresistible trend. However, former research on cloud service or virtual system (VS) availability and/or reliability was only carried out from the perspective of steady state. This paper aims to analyze the survivability of the cloud service after a service breakdown occurrence by presenting a model and the closed-form solutions with the use of continuous-time Markov chain. The service breakdown may be caused by virtual machine (VM) and/or VM monitor (VMM) bugs or software rejuvenation and/or host failures and NAS (Network Area Storage) failures. In order to improve the cloud service survivability, the VS applies two techniques: VM failover and VM live-migration. Through the model proposed and the survivability metrics defined in this paper, we are able to quantitatively assess the system survivability while providing insights into the investment efforts in system recovery strategies. In order to study the impact of key parameters on system survivability, this paper also provides a parameter sensitivity analysis through numerical experiments. INTRODUCTION System virtualization (SV) technology has been widely used for academic and industrial purposes. In SV-based virtualized systems [1], a virtual machine monitor (VMM) is a layer of software between one or more operating systems and the hardware, which performs the emulation of hardware of a physical machine and thus is a key component in the virtualized systems. It often becomes the single point of failure. A virtual machine (VM) runs on the top of VMM for emulating a particular computer system and the cloud service is deployed on a VM. Like traditional softwares, VM and VMM are also subject to the problems of software aging [2, 3], bugs, crashes and so on. The underlying host and the storage system may also face problems which could lead to failures, not only software problems but also hardware problems. Most virtualization systems are deployed at a Cloud Service Provider (CSP)’s data center; the entire data center infrastructure is likely to be destroyed in some extreme cases, like earthquake or hurricane. All these problems not only degrade the performance but also reduce the cloud service availability and then increase service downtime. Software rejuvenation [3], failover and live VM-migration [4] are the common high-availability (HA) techniques adopted in a data center. Because of the tremendous growth of the cloud service, the need for the performability analysis [5] of a cloud service with these HA techniques increases. A large downtime of cloud service can cause productivity loss and even business loss. For example, Amazon AWS S3 outage on 28 February 2017, lead to large-scale application and service collapse [6]. Steady-state performance and dependability of the virtualized environment have been widely studied. The survivability, a transient measure, is generally defined as the ability of the system to recover pre-defined service in a timely manner after the occurrence of disaster [7, 8] which could be any kind of undesired occurrence to the system of interest. However, there is less work on survivability analysis, which could help improve the systems’ capability to provide critical services when system damage occurs. This paper is an extension of our previous work originally reported in The 41nd IEEE Conference on Local Computer Networks (LCN) [9]. We improved the previous article from the following aspects: In the previous work, we ignored the situations that the physical machine (Host) and the Network Area Storage (NAS) will face software mistakes or hardware failure which could also lead to a service breakdown. In order to simplify the model, we did not consider the failure of VM failover and VM live migration before. But in the practical environment, the process of VM failover or VM-migration may also fail due to software mistakes or network failures. In the previous work, we only considered a single data center while in this paper the backup server and backup data center are all considered. Those make our model much closer to practical systems. This paper considers cloud service breakdown which may be caused by VM rejuvenation, VMM rejuvenation, VM bugs, VMM bugs, host failures or NAS failures. We quantify the survivability as the transient performance of the cloud from the instant when the service breakdown occurs until the application services deployed in the data center perform recovery. By service recovery, we mean that the application deployed in the data center can still provide service in their original place or move into another virtual machines. The cloud service system considered in this paper consists of three main components: Two Data Center (DC)s (namely, DC1 and DC2) and a Backup Server, as shown in Fig. 1. There are more than one host in DC1: Host1, acts as main host and Host2. Host3 is deployed at DC2, act as a backup host, as well as Host2. Host1 contains an active VM (VM1) and a standby VM (VM2). There are another two standby VMs (VM3, VM4) on Host2 and Host3, respectively. The cloud service, denoted as application (App) in the following, is deployed in the active VM. The virtualized system applies two techniques to improve service survivability: VM failover and live VM-migration. When the active VM fails, a standby VM on the same host is used or migrated to the other host to continue the cloud service. We will describe the system architecture in more detail in section System Description and Model. The unique features of system virtualization highlight the challenges to address the survivability for the virtualized system. Figure 1. View largeDownload slide The system architecture. Figure 1. View largeDownload slide The system architecture. This paper aims to assess the survivability of the cloud service and the virtualized system after a service breakdown occurrence. We employ continuous-time Markov chain (CTMC) to present a survivability model of the cloud service. Survivability metrics of interest are furthermore defined and the closed-form solutions to these metrics are derived in order to quantify the attributes of the cloud service during the virtualized system recovery. The attributes include (i) the downtime of the cloud service by time t; (ii) the probabilities of the cloud service recovery by time t; (iii) meantime to recover the cloud service. Sensitivity analyses are also carried out through numerical experiments. As far as we know, no work has been carried out on survivability analysis and evaluation to such a virtualized system exists. The proposed model and metrics could analyze not only the survivability of cloud service but also the survivability of the whole data center. Therefore, compared with the existing transient analysis of the virtualized system, more insights could be provided for system designers and operators to decide and deploy appropriate survivability techniques to optimize their benefits. Note that although the distribution of all relevant event times is assumed to be exponential, a number of techniques are available to relax this assumption if needed [10]. We leave the relaxation of this assumption for future work. The main contributions of this paper are summarized as follows: We present a CTMC-based survivability modeling method and analyses of a distributed data centers. Our model takes into account all possible cloud service breakdown situations. Backup server and backup data center are considered. The closed-form solutions of our model can not only assess the survivability of the cloud service but also carry out the parameter sensitivity analysis. The paper is organized as follows. Section Background and Related Work describes some background knowledge and related work. Section System Description and Model provides the system model and gives the closed-form solutions. In section Numerical Analysis and Discussion, we present evaluation results. The conclusion and future work are in section Conclusion and Future Work. BACKGROUND AND RELATED WORK This section first presents HA techniques and then the stochastic analysis of the virtualized system. Software failures may be caused by inherent software design defects or by improper usage of users. For long-running VMMs and VMs, one of the major causes of software failures is software aging. This issue could not only increase the failure rate, degrade the cloud service performance but also lead to system crash [2, 3]. Software rejuvenation [3] is a software fault tolerance technique to defend against software aging. This technique gracefully stops the execution of an application/system and periodically restarts it at a clean internal state in a proactive manner. Failover and live VM-migration are also common techniques used for achieving VM high availability in real virtualized systems, such as Amazon EC2. Failover is a backup operational mechanism, in which secondary system components perform the functions of a system component (such as a processor, a server, a network or a database) when the primary component cannot work. In a virtualized system, it is achieved by creating an active VM and several standby VMs. When the active VM suffers a failure or gets ready to be rejuvenated, one standby VM can take over the role of the active VM to continue the task execution. Live VM-migration mechanism can move a running VM or application between different physical machines. The information of memory, storage and network connectivity of the original VM is transferred from the original host to the destination host. Recently, studies have been carried out to the cloud service availability analysis by adopting analytic modeling approach. Most of them focused on steady-state analysis, see [11] and references therein. Although survivability analysis has been conducted in the other fields [12, 13], few researchers analyzed the survivability of the virtualized environment. In [14], the authors analyzed the survivability of three VM-based architectures through simulations. Their survivability definition is different from the definition of this paper. They defined it as the probability for the system to deliver the pre-defined service under attacks. Actually, their survivability is a steady-state metric. Although they carried out transient analysis, their purpose is to check how quickly can the steady state be reached. In addition, they only analyzed from the viewpoint of the cloud service while our analysis was carried out from the viewpoint of both the cloud service and the whole virtualized system recovery. Therefore, our model and metrics could provide more guidelines for system designers and administrators to manage the virtualized system. In [15], the authors proposed a discrete-time Markov chain-based stochastic model to analyze the survivability of the system in the presence of intrusion. Their survivability is also a steady-state attribute. In [16], the authors considered two data centers, one of which performed the role of backup. This work presented stochastic Petri Net models to evaluate survivability metrics in IaaS systems. The key difference between our model and theirs is that our model provides not only cloud service survivability analysis but also the whole virtualized system recovery. They did not consider the recovery of the whole data centers. In addition, more survivability metrics are defined and closed-form solutions are given in this paper. In [17], the authors proposed to use survivability modeling approach to assess the impact of failure which causes system unavailability and use software rejuvenation to pro-actively recover from soft failures. The proposed model can be used computer meantime to repair the reliable systems. The authors in [18] used semi-Markov model to analyze availability of IaaS cloud system. The authors studied the impact of sudden and hidden failures of physical machines on the availability of the whole cloud system. In [19], the authors adopted a formal hybrid modeling method to evaluate the performance, availability and survivability of cloud computing systems. Their work combines stochastic Petri nets, Markov chain, reliability block diagrams and other high-level models to estimate the metrics related to the system disaster recovery. The authors in [20] presented a study on quantifying the availability of an IaaS cloud system using Petri nets and Markov chains. They gave closed-form equations to assess the system and reduce the solution time. The authors also gave quantitative performability analysis of cloud system in [21], the effects of workload variation, failure rate and system QoS are considered in their work. In [22], the authors showed four different sensitivity analysis techniques to analyze the parameters that affect the availability of a mobile cloud system most. The different analysis results show that reduce some system parameters could improve the whole availability. In [23], the authors proposed a stochastic reward nets model for availability evaluation of a cloud computing system. They proposed a calibration phase to improve the accuracy of results. This work also focused on the evaluation of failure recovery. The authors in [24] introduced a failure mitigation technique in distributed systems. They proposed to use a so-called cloud standby method for disaster recovery. A standby system could replace the failure system whenever a disaster occurs. Despite these works’ remarkable quality, most of them only did steady-state evaluation and they didn’t consider the distributed cloud date center. SYSTEM DESCRIPTION AND MODEL This section gives an introduction to the virtualized system and then presents the CTMC model. The system architecture Two Data Center (DC)s (namely, DC1 and DC2) and a backup server are the three main components of the virtualized system considered in this paper, see Fig. 1. The two DCs are located at different places which are generally far away from each other. The definitions of the parameters used in this paper are listed in Table 1. Table 1. Parameter definition. Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Table 1. Parameter definition. Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 Notation Definition Value in default q1 Probability that service breakdown is due to the active VM rejuvenation or its aging-related failure occurrence. 0.5999 q2 Probability that service breakdown is caused by VM crashes. 0.1 q3 Probability that service breakdown is due to the active VMM rejuvenation or its aging-related failure occurrence. 0.1 q4 Probability that VMM crashes 0.1 q5 Probability that Host fails 0.1 q6 Probability that NAS fails 0.0001 γ γ is the decision time 0 α 1/α is mean VM-failover time 3.0 s β 1/β is mean VM live-migration time 5.0 s σ 1/σ is mean VM image transfer time from backup server to DC2 20 s a 1/a is mean VM rejuvenation time 1 min b 1/b is mean VM repair time 30 min c 1/c is mean VMM rejuvenation time 2 min d 1/d is mean VMM repair time 1 h e 1/e is mean Host repair time 3 days f 1/f is mean NAS repair time 3 days g 1/g is mean NSA restart time 5 min h 1/h is mean Host restart time 2 min i 1/i is mean VMM restart time 1 min j 1/j is mean VM restart time 30 s k 1/k is mean App restart time 1 s p1 Probability that VM-failover successfully 0.9 p2 Probability that VM live-migration successfully 0.95 p3 Probability that VM reload at DC2 successfully 0.99 DC1, as the main data center, contains two hosts (Host1 and Host2) and one Network Area Storage (NAS1). The two hosts have the same system architecture and are both connected with NAS1 to store VM images. Host1, as the main host contains a VMM, which runs an active VM (VM1) with a desired application and one standby VM (VM2) also runs the same application with which in VM1 for the purpose of VM-failover mechanism. We use dotted line to represent Host2 as a backup host having a VM vacancy (VM3) which is capable for running a VM for the sake of migration of VMs from main host (Host1). It is a spare host which could perform the Main host role when a VM migrates to it successfully. Finally, there is a Management host in DC1 which is a component responsible for controlling the entire cloud environment to detect the VMM failure by means of a specific cloud management tool. We do not show it in Fig. 1 as it has no effect on our model. All VM images and VMM code files are stored in network attached storage (NAS1). In addition, all the running information of the active VMs is stored in NAS1 periodically. Then these data can be used for failover, VM restart, VMM restart and live VM-migration. An agent is installed in each VM for the VM rejuvenation operation. Similarly, an agent is deployed in the VMM in order to analyze all the VMs running on this VMM, detect their abnormal behaviors and trigger the rejuvenation of a VM when it detects anomalies of the VM. DC2 is geographically far away from DC1. It is used for disaster tolerance. As is shown in Fig. 1, the infrastructure of DC2 is basically the same as DC1. We use a dashed box to represent VM4 in Host3 which is also a VM vacancy used for running the VM migrated from DC1 or loaded from Backup Server. It is important to note that there are so many hosts and VMs in the data center in a real-world environment. Since only one application or service is considered in our model, only three hosts and four VMs are described in Fig. 1. Both DC1 and DC2 are connected with the Backup Server. The backup server is used to perform disaster tolerance as it can store VMs images from data centers as a backup. Once one data center suffered a disaster, the VM images will be immediately transferred to the other data center. The application can be restarted after reloading the backup VM images. There are dependencies among system components. That is, the change of a component state will trigger the state changing of dependent component. We define A ≻ B as A depends on B. Therefore, in our virtualized system, the component dependency is App ≻ VM ≻ VMM ≻ Host ≻ Storage. We assume that there is no other service breakdown before the cloud service recovered from one service breakdown. A non-aging Mandelbug-related failure and an aging-related failure could happen to all VMs (including primary and standby) and VMMs of both Main and Backup hosts. When the above failures happened to the active VM (VM1), the application deployed in it will be broken. Meantime, the active VM’s running state stored in the NAS will be sent to the standby VM (VM2). Then the VM-failover mechanism will be triggered and the standby VM2 is selected to take charge of the application and then act as an active VM during its repair period. We assume that at least one VM on a VMM could be used as a standby VM at any time. That is to say, the failed VM will be repaired or rejuvenated in time and turn to be the new standby VM. When the VMM in Host1 fails due to VMM rejuvenation or aging-related bugs or non-aging Mandelbugs, the VM live-migration mechanism will be triggered as soon as the application service goes down. The VM images will be transferred to Host2 and loaded as VM3. The VM live-migration mechanism can also be used when main Host1 itself faild due to software bugs or hardware problems. In the extreme case, where DC1 sufferd disasters or NAS1 suffered software or hardware problems and lead to NAS1 failure or breakdown, Host3 will play its role to transfer backup VM image from the Backup Server and load it as VM4 and then restart the application service. Noted that every high-availability (HA) mechanism like migration or failover has the possibility to fail. Once the HA mechanisms fail, the system can only wait for the broken parts to be repaired and restart the service after that. VM live-migration is another technique used to migrate the active VM to a Backup host when the VMM needs to be rejuvenated or crashes or the main host fails. When VMM crashes, the VM information stored in NAS will be migrated to a Backup host. The live-migration technique can move the active VM with all the requests and sessions from the Main host to a Backup host without losing any in-flight request or session data during the VMM rejuvenation or repairing. Survivability is originally defined by ANSI T1A1.2 committee as the transient performance of a system after an undesirable event [25]. The metrics used to quantify survivability vary according to the system and system attributes of interest. In this paper, we classify the metrics into two categories: Instantaneous metrics are transient metrics that capture the state of the system at time t after the occurrence of an undesired event. An example of an instantaneous metric is the probability that the cloud service is recovered by time t. Cumulative metrics are integrals of instantaneous metrics, that is, expected accumulated rewards in the interval (0, t]. The metrics considered in this paper include Metric m1. Probability that App service is recovered by time t. Metric m2. Mean accumulated loss of App sevice breakdown in the interval [0, t]. Metric m3. Meantime up to full App service recovery. Note that survivability metrics are transient metrics computed after the announcement of a service breakdown. In the remainder of this paper, time t refers to the time immediately after a service breakdown and is measured in seconds. The survivability model Figure 2 is the description of our survivability model. It is a Markov model with an absorbing state. Survivability aims to capture the evolution of the system under consideration after an unexpected event occurs. Thus, the model in Fig. 2 does not include the failure process leading to service breakdown. The initial state of our model is conditioned to be the service breakdown state. The meantime to determine the cause of the service breakdown is denoted as γ. This meantime value is very smaller than the other intervals of time considered in this paper. Hence, we assume γ=0, that is the average time spent in state ‘App service breakdown’ is zero. The initial probabilities for States 1, 2, 3, 4, 5 and 6 are then equal to q1, q2, q3, q4, q5 and q6, respectively. Figure 2. View largeDownload slide Phased recovery model of virtualized system. Figure 2. View largeDownload slide Phased recovery model of virtualized system. We assume that no other failure occurs during the system recovery. All the states except the absorbing state are transient and have a steady-state probability. The definition of each state is given in Table 2. When the cloud service breakdown occurs due to the active VM (VM1) rejuvenation or the active VM reboot caused by an aging-related bug in VM, a standby VM (VM2) on the same host is selected for handling the cloud service. This failover process requires a very small delay, denoted as 1/α. Meanwhile, the VM must rejuvenate with time 1/a. After rejuvenation, the system decides whether to restart the application according to whether VM-failover is successful or not. The probability of a successful VM-failover is p1, so that the rate from State 8 to State 20 is k*(1−p1) as the mean application restart time is 1/k. VMC represents that service breakdown is due to VM crash caused by non-aging Mandelbug in VM. The failover process is just the same as that from VMR. Meanwhile, the VM must be repaired with time 1/b. After the repair, the system goes to ‘VM ready to reboot’ (State 7) and it takes 1/j time to boot the VM system. Similarly, after the reboot, the system determines whether to start the application service according to whether VM-failover is successful or not. Table 2. State Notations. Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Table 2. State Notations. Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service Notation Definition VMR (1) Service breakdown due to VM rejuvenation or VM aging-related failure VMC (2) Service breakdown due to VM non-aging Mandelbug-related failure VMMR (3) Service breakdown due to VMM rejuvenation VMMC (4) Service breakdown due to VMM non-aging Mandelbug-related failure Host Fails (5) Service breakdown due to Host fails NAS Fails (6) Service breakdown due to NAS fails VM Ready1 (7) VM is recovered from the breakdown which is caused by the avtive VM crash, and is ready to reboot VM Good1 (8) VM is fresh and the App service is ready to restart if the VM rejuvenation is failed VMM Ready (9) VMM is recovered from the breakdown which is caused by the VMM crash, and is ready to restart VM Ready (10) Host is recovered from the breakdown and is ready to restart NAS Ready (11) NAS is recovered from the breakdown and is ready to restart VM Good2 (12) VM has already reboot and the App service is ready to restart VM Ready2 (13) Backup VM image is reload at DC2 and ready to boot VMM Good1 (14) VMM is recovered from the breakdown which is caused by the VMM rejuvenation and ready to load the suspended VM image VMM Good2 (15) VMM is recovered from the breakdown which is caused by VMM crash or Host fails and ready to boot a fresh VM image Host Good (16) Host has already reboot and ready to restart VMM NAS Good (17) NAS has already restart and ready to reboot a host Host Good2 (18) Host has already reboot from a fresh NAS and ready to restart VMM VMM Good3 (19) VMM has already restart from a fresh host and ready to load a fresh VM image App working (20) The Application are health and can provider cloud service VMMR denotes the situation where the cloud service breakdown occurs due to the VMM rejuvenation or the VMM reboot caused by an aging-related bug in VMM. VM live-migration is applied with time 1/β and the probability of success is p2. Before the VMM rejuvenation, the VM in the VMM will be suspended so that the App can re-provide service as soon as the VM is rebooted after the rejuvenation. Like VM rejuvenation, after VMM rejuvenation the system will decide whether to reboot the VM according to whether the live-migration was successful or not. Thus, the rate from State 14 to State 20 is j*(1−p2). VMMC represents that the cloud service breakdown is caused by VMM crash caused by a non-aging Mandelbug in the VMM. The crashed VMM is repaired with the repair time of 1/d. After the repair, the system goes to ‘VMM ready to reboot’ (State 9) and it takes 1/i time to restart the VMM. The difference between VMMR is that the system cannot be suspended as the crash is unpredictable. Hence, after rebooting the VM, we shoud restart the App as the system goes to state 12 instead of State 20. Host Fails denotes the situation where the cloud service breakdown occurs due to host software bugs or hardware problems. The VM live-migration is the same as previous mentioned. Also, the host needs to be repaired with time 1/e. After that, reboot the host with meantime 1/h. Since the failure of a host is also unpredictable we need to restart the App service after rebooting the VM system like the VMMC situation. The last case is NAS Fails. This could happen due to the network storage malfunction or disasters at the data center. As the storage fails, we cannot carry out VM live-migration. However, due to the existence of a backup server, we can transfer the VM images from the backup server to the other data center and then reboot the images. The meantime to transfer a VM image from the backup center to a data center is σ. The transmission may also fail due to network problems so we have a success probability p3. The meantime to rebuild and restart the NAS is 1/f and 1/g, respectively. Since the images in the backup server are not always the latest state of the VM, a reboot is required after the image transfer is complete and the App also needs to be restarted. With the above assumptions, the phased recovery process is mathematically modeled as a continuous-time Markov chain (CTMC) on the state space S. The Markov chain infinitesimal generator is as follows: Q=[A1000000a00000000000αp10A20000b000000000000αp100A30000000000c00000βp2000A40000d0000000000βp20000A50000e000000000βp200000A60000f0σp30000000000000−jj0000000000000000000A700000000000-A700000000−i00000i00000000000000−h00000h00000000000000−g00000g00000000000000−k0000000k00000000000j−j00000000000000000000A800000-A800000000000-A800A80000000000000000000i−i00000000000000000000−hh0000000000000000000−ii000000000000-A9000000A9000000000000000000000] where A1=−a−αp1, A2=−b−αp1, A3=−c−βp2, A4=−d−βp2, A5=−e−βp2, A6=−f−σp3, A7=−k(1−p1), A8=−j(1−p2), A9=−j(1−p3). For each t≥0andi∈S and i∈S define the probabilities: πi(t)=Pr{X(t)=i} (1) be the transient probability that the system is in state i at time t and then ∑i=120πi(t)=1. Suppose π(t)=[π1(t),π2(t),..,π20(t)] (2) denote a row vector of transient state probabilities of X(t). So that the initial state probabilities of each state are set as π1(0)=q1,π2(0)=p2,π3(0)=q3,π4(0)=q4,π5(0)=q5,π6(0)=q6andπi(0)=0fori∈[7,8,..,20]. Generally, a row vector of transient state probabilities of at time t can be defined as π(t)={πi(t),i∈S} (3) With infinitesimal generator Q, we can derive the closed-form expressions for πi(t) at time t using Kolmogorov differential–difference equation: dπ(t)dt=π(t)Q (4) Associating each state with a reward rate allows the computation of the survivability metrics of interest. We can compute the meantime to system recovery. ϖi is defined to denote the average reward of state i. For example, when ϖi=1 for i=20 and ϖi=0 for i∈[1,19], m2 is downtime of cloud service and m3 is the meantime to full App recovery. Li(τ)=∫0τϖiπi(x)dx (5) is defined to denote the mean reward accumulated by time τ at state i after a service breakdown. Based on these formulas, we obtain the formulas for computing the metrics mentioned in section The System Architecture: Probability that App service is recovered by time t (value of m1) is π20(t). Mean accumulated loss of App service breakdown by time t (the value of m2) is ∑i=119Li(t). NUMERICAL ANALYSIS AND DISCUSSION This section gives a detailed evaluation to our model solutions. Assuming that all time intervals in the model are exponentially distributed. Unless otherwise specified, all the given parameter values in the model are based on the existing related literature [11], and the references therein. The default parameter values are listed in Table 1. We first evaluate the probability of the App service recovered by time t by varying the initial probabilities of the first six states. The results are shown in Fig. 3. ‘0.5999_0.1_0.1_0.1_0.1_0.0001’ denotes the results under p1=0.5999,p2=p3=p4=p5=0.1andp6=0.0001. Similarly, ‘0.1_0.5999_0.1_0.1_0.1_0.0001’ denotes the results under p2=0.5999,p1=p3=p4=p5=0.1andp6=0.0001. These results indicate that the probabilities that the App service have recovered increase over time. We also observe that App can recover quickly in case of p1=0.5999andp2=0.5999, denoting that the service breakdown is caused mainly by VM aging-related bugs/VM rejuvenation or VM crash. Since the VM-failover is much faster than VM repair or rejuvenation, so even if the rejuvenation is faster than the repair, their availability is similar. As we can see that the red line and the blue line in Fig. 3 are almost overlap. The same is true for VMMR and VMMC. The corresponding recovery probability reaches 0.9 in less than 30 seconds. However, when the service breakdown is majorly caused by Host Fails (that is, 0.1_0.1_0.1_0.1_0.5999_0.0001), it takes much more time for App service to recover, compared with VM failure and VMM failure. This is demonstrated by the results of 0.1_0.1_0.1_0.1_0.5999_0.0001. We could see that App recovery probability has not reached 0.9 after 50 s. Also as VM live-migration is slower than VM-failover, the green and yellow lines are slower in getting to the stabilization than the red and blue lines. The slow increase in the App recovery probability leads to the larger service downtime, as shown in Fig. 4. We did not show the probabilities by vary q6, as the NAS failing probability is much slower than the other five cases. Figure 3. View largeDownload slide Probability of App being recovered by time t by varying q1, q2, q3, q4, q5 and q6. Figure 3. View largeDownload slide Probability of App being recovered by time t by varying q1, q2, q3, q4, q5 and q6. Figure 4. View largeDownload slide Mean accumulated downtime of App by time t. Figure 4. View largeDownload slide Mean accumulated downtime of App by time t. The system performance is sensitive to the system parameters, listed in Table 1. The results of the instantaneous probability of each state could help identify the performance bottleneck and then determine the related parameters leading to the bottleneck. Given the initial probability distribution that q1=q2=q3=q4=q5=q6=1/6, Figs. 5 and 6 show the probabilities that the App service have recovered by time t varied with different parameters. Figure 5 shows the effect of different VM-failover rates on the system performance. It can be seen from the figure that the faster the system reaches a steady state, the shorter the time is required as the green line represents the fastest VM-failover rate. Figure 6 shows the effect of different VM rejuvenation rates and VM repair rates on the system performance. We could see that there is no significant difference on the probabilities when changing the VM repair rates and rejuvenation rates. In a nutshell, the VM-failover rate has a greater impact on system performance. Similarly, the VM-migration rate may have a greater impact on system performance than the VMM repair rates or rejuvenation rate or Host repair rates. The above results could help determine the best system parameters to optimize the revenue of service providers, and help them make the best use of the limited budget. Figure 5. View largeDownload slide Effect of VM-failover rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 5. View largeDownload slide Effect of VM-failover rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 6. View largeDownload slide Effect of VM repair rate and VM rejuvenation rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. Figure 6. View largeDownload slide Effect of VM repair rate and VM rejuvenation rate on virtualized App service recovery probability by time t under 1/6_1/6_1/6_1/6_1/6_1/6. CONCLUSION AND FUTURE WORK Services are migrated from local to cloud. Survivability issues must be addressed before the cloud technique could be used for crucial services. This paper explores the CTMC model-based survivability analysis of a cloud service after a service breakdown occurrence. We propose the model and define the survivability metrics. A numerical analysis is carried out to study the impact of the underlying parameters on the system survivability. These results could provide insights on the cost/benefit trade-offs of investment efforts in system recovery strategies. Note that this paper treated the virtualized system restoration process as CTMC. However, the virtualized system may be time-variant after the service breakdown. That is, the values of the parameters in Table 1 could be changeable and sometimes unpredictable. How to characterize the time-variant virtualized system restoration process after a service breakdown is one of our future work. Furthermore, the modeling in this paper is an approximation of the real restoration process. For example, we only consider one backup VM in VM-failover and live-migration. However, in practical systems, the number of hosts and VMs in a DC could reach hundreds to thousands upon its pre-designed configuration. We shall extend the current study to approximate the model in a more accurate way to the real system behaviors. FUNDING The research of Zhi Chen and Xiaolin Chang was supported by National Natural Science Foundation of China (No. 61572066). REFERENCES 1 Vallée , G. , Naughton , T. , Ong , H. , Tikotekar , A. , Engelmann , C. , Bland , W. , Aderholdt , F. and Scott , S.L. ( 2008 ) Virtual system environments. Systems and virtualization management . Stand. New Technol , 18 , 72 – 83 . 2 Grottke , M. , Matias , R. and Trivedi , K.S. ( 2008 ). The Fundamentals of Software Aging. IEEE Int. Conf. Software Reliability Engineering Workshops, 2008. ISSRE Wksp, Seattle, WA, USA, 11–14 Nov. 2008, Vol. 44, pp. 1–6. IEEE. 3 Huang , Y. , Kintala , C. , Kolettis , N. and Fulton , N.D. ( 1995 ). Software Rejuvenation: Analysis, Module and Applications. Int. Symp. Fault-Tolerant Computing, 1995. Ftcs-25. Digest of Papers, Pasadena, CA, USA, 27–30 June 1995, Vol. 137, pp. 381–390. IEEE. 4 Clark , C. , Fraser , K. , Hand , S. , Hansen , J.G. , Jul , E. , Limpach , C. , Pratt , I. and Warfield , A. ( 2005 ) Live Migration of Virtual Machines. Proc. 2nd Conf. Symp. Networked Systems Design & Implementation, Vol. 2, pp. 273–286. USENIX Association, Berkeley, CA, USA. 5 Trivedi , K.S. , Andrade , E.C. and Machida , F. ( 2012 ) Combining Performance and Availability Analysis in Practice. In Advances in Computers , 84 , pp. 1 – 38 . Elsevier , Oxford, UK. 6 Amazon Web Services, I . ( 2017 ) Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region 7 Ellison , R.J. , Fisher , D.A. , Linger , R.C. , Lipson , H.F. , Longstaff , T. and Mead , N.R. ( 1997 ). Survivable Network Systems: An Emerging Discipline. 8 Trivedi , K.S. and Xia , R. ( 2015 ) Quantification of system survivability . Telecommun. Syst. , 60 ( 4 ), 451 – 470 . Google Scholar CrossRef Search ADS 9 Changa , X. , Zhang , Z. , Li , X. and Trivedi , K.S. ( 2016 ) Model-Based Survivability Analysis of a Virtualized System. IEEE 41st Conference on Local Computer Networks (LCN), Dubai, 7–10 Nov. 2016. IEEE. 10 Bolch , G. , Greiner , S. , de Meer , H. and Trivedi , K.S. ( 2006 ) Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications . John Wiley & Sons , Hoboken, NJ, USA. 11 Nguyen , T.A. , Kim , D.S. and Park , J.S. ( 2016 ) Availability modeling and analysis of a data center for disaster tolerance . Future Gener. Comput. Syst. , 56 , 27 – 50 . Google Scholar CrossRef Search ADS 12 Heegaard , P.E. and Trivedi , K.S. ( 2009 ) Network survivability modeling . Comput. Netw. , 53 ( 8 ), 1215 – 1234 . Google Scholar CrossRef Search ADS 13 Liu , Y. and Trivedi , K.S. ( 2004 ). A General Framework for Network Survivability Quantification. Mmb & Pgts 2004, Gi/itg Conf. Measuring and Evaluation of Computer and Communication Systems, pp. 369–378. 14 Yang , Y. , Zhang , Y. , Wang , A.H. , Yu , M. , Zang , W. , Liu , P. and Jajodia , S. ( 2013 ) Quantitative survivability evaluation of three virtual machine-based server architectures . J. Netw. Comput. Appl. , 36 ( 2 ), 781 – 790 . Google Scholar CrossRef Search ADS 15 Zheng , J. , Okamura , H. and Dohi , T. ( 2015 ) Survivability analysis of VM-based intrusion tolerant systems . IEICE Trans. Inf. Syst. , 98 ( 12 ), 2082 – 2090 . Google Scholar CrossRef Search ADS 16 Silvaa , B. , Maciela , P.R.M. , Zimmermannb , A. and Brilhantea , J. ( 2014 ) Survivability Evaluation of Disaster Tolerant Cloud Computing Systems. Proc. Probabilistic Safety Assessment & Management conference. 17 Jacques-Silva , G. , Avritzer , A. , Menasché , D.S. , Koziolek , A. , Happe , L. and Suresh , S. ( 2015 ) Survivability Modeling to Assess Deployment Alternatives Accounting for Rejuvenation. IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Gaithersburg, MD, USA, 2–5 Nov. 2015. IEEE. 18 Ivanchenko , O. , Kharchenko , V. , Ponochovny , Y. , Blindyuk , I. and Smoktii , O. Semi-Markov Availability Model for Infrastructure as a Service Cloud Considering Hidden Failures of Physical Machines. 19 SILVA , B. ( 2016 ) A Framework for Availability, Performance and Survivability Evaluation of Disaster Tolerant Cloud Computing Systems. 20 Ghosh , R. , Longo , F. , Frattini , F. , Russo , S. and Trivedi , K.S. ( 2014 ) Scalable analytics for IaaS cloud availability . IEEE Trans. Cloud Comput. , 2 ( 1 ), 57 – 70 . Google Scholar CrossRef Search ADS 21 Ghosh , R. , Trivedi , K.S. , Naik , V.K. and Kim , D.S. ( 2010 ) End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud: An Interacting Stochastic Models Approach. IEEE 16th Pacific Rim Int. Symp. Dependable Computing (PRDC), Tokyo, Japan, 13–15 Dec. 2010. IEEE. 22 Matos , R. , Araujo , J. , Oliveira , D. , Maciel , P. and Trivedi , K. ( 2015 ) Sensitivity analysis of a hierarchical model of mobile cloud computing . Simul. Model Pract. Theory , 50 , 151 – 164 . Google Scholar CrossRef Search ADS 23 Xu , X. , Lu , Q. , Zhu , L. , Li , Z. , Sakr , S. , Wada , H. and Webber , I. ( 2013 ) Availability Analysis for Deployment of In-Cloud Applications. Proc. 4th Int. ACM Sigsoft symp. Architecting critical systems. Vancouver, Canada, June 17–21, 2013, ACM, New York, NY, USA. 24 Lenk , A. and Tai , S. ( 2014 ) Cloud Standby: Disaster Recovery of Distributed Systems in the Cloud. Eur. Conf. Service-Oriented and Cloud Computing. Manchester, UK, Sep. 2–4, 2014. pp. 32–46, Springer, Berlin. 25 Technical Report 68 ( 2001 ) Enhanced Network Survivability Performance. ANSI T1A1.2 Working Group on Network Survivability Performance. © The British Computer Society 2017. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

The Computer JournalOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off