TY - JOUR AU - Zhang, Xingjun AB - Abstract A mismatch between resource supply and demand in cloud computing leads to inefficient utilization of resources or performance degradation. Therefore, this paper establishes a runtime model to evaluate the available capacity of computing resources on the basis of similar tasks. This model takes advantage of a characteristic of cloud workload; that is, similar tasks in cloud computing have a similar execution logic. The model evaluates the available resource capacity according to task similarity, thus avoiding any impact on the resource consumption of existing benchmarks. We apply the model to propose a resource capacity evaluation method called Caipan, which considers numerous factors according to resource type. This method obtains accurate results in a timely manner at little cost. We use the results of Caipan to develop some algorithms that aim to match resource supply and demand, and improve cloud platform performance. We test the Caipan method and the Caipan-based algorithms in both dedicated and real-world cloud environments. The test results show that the Caipan method obtains the available resource capacity both accurately and in a timely manner, and effectively supports the optimization of both algorithms and platforms. Moreover, algorithms based on Caipan reduce the mismatch between resource supply and demand, and significantly improve cloud platform performance. 1. INTRODUCTION Currently, the resource managers and schedulers that are widely used in cloud computing platforms consider many factors including resource type, hardware architecture and isolation when they are allocating resource. However, they always allocate fixed resource quantity to the same task ignoring the difference in resource quantity between servers. For example, Hadoop [1] uses a ‘slot’ as the resource partition unit. A slot represents quantitative resources of CPU and memory. The Yarn platform [2] packages virtual cores and memory into containers, and allocates containers to jobs. Neither Hadoop nor Yarn guarantees the quality of the allocated resource. Cgroups can ensure the resource usage of processes, but it cannot aware and guarantee the quality of resource. The quality and the capacity of resource are just what Caipan focus on. So Caipan and cgroups are not in conflict but complementary. Resource quantity and quality are two aspects of resource availability. Resource quality is very important in resource management for the following reasons. First, resource quality may influence the resource demand of a task. For example, the same tasks demand different amounts of time on different models of CPU. Second, resource quality may represent a specific amount of a resource. For example, high-quality network resources have greater bandwidth than lower-quality network resources. Third, resource quality changes continually. Such changes are often caused by the way in which workloads utilize the resources and many other reasons including thermal throttling and the number of running tasks. For example, the workload determines the context-switching frequency of the CPU, which, in turn, affects the quality of the CPU at a given time. Therefore, ignoring resource quality in resource management leads to a mismatch between resource supply and demand. In addition, the widespread heterogeneity of servers in cloud computing facilities contributes to the diversity of resource quality [3], and the performance of the underlying machines is neither uniform nor constant for many reasons (such as thermal throttling and shared workload interference) [4], and the workload in cloud computing are heterogeneous [5, 6]. All of those consequently increase the frequency and degree of mismatch between resource supply and demand. Through a series of experiments, we found that the mismatches result in either a waste or a shortage of resources, and eventually cause inefficient utilization of resource or performance degradation. On the one hand, resource fragmentations occur when tasks are oversupplied, and the utilization of resource is inefficient. On the other hand, resource bottlenecks occur when the resources assigned to tasks are insufficient. This case results in resource competition, and eventually causes performance degradation. The primary reason for these mismatches between resource supply and demand is that current resource manager used in cloud platforms cannot effectively quantify the available capacity of the resources—including both resource quantity and quality. To resolve this problem, resource manager needs a method that can evaluate the available capacity of computing resources at runtime. The existing evaluation methods [7–9] are inapplicable to cloud platforms because of following problems. On the one hand, the methods that do not use test programs usually establish a mathematical model that is based on various performance indicators. However, this approach can retrieve only the resource quantity and not the resource quality. On the other hand, the methods that do use test programs cannot obtain the evaluation results until the tests are complete. Therefore, the cost of using this approach is high, and the evaluation results are not available in a timely fashion. To address these problems, this paper proposes a method for evaluating available resource capacity. Although this method, called Caipan, also uses test programs when new servers join the cloud, it primarily relies on the concept of ‘similar tasks’ (which are ubiquitous in cloud computing) to evaluate the quality of resources at runtime. It provides timely results by comparing similar tasks that are running on different resources. Consequently, the cost of this approach is very low, and Caipan can support both cloud platform and algorithm optimizations. The main points from this paper can be summarized as follows: We establish a model to evaluate the available capacity of cloud computing resources on the basis of the concept of similar tasks. This model takes advantage of the characteristic that cloud workloads design for large-scale computing clusters often have hundreds of thousands of similar tasks [10, 11]; therefore, the model can use similar tasks instead of test programs. The model evaluates the available resource capacity on the basis of runtime information that it collects from similar tasks that are running on different resources, as well as other information. We propose a method called Caipan, which is based on the evaluation model described above and evaluates CPU, memory, storage and network resources. We implement a platform-independent service for this method. This evaluation service achieves accurate and timely results at runtime with little cost. Based on the evaluation results, we design some algorithms for estimating task resource requirements, for resource allocation, for load balancing and for abnormality recognition. We implement all the algorithms in the Yarn platform to reducing mismatches and optimize performance. The tests show that the evaluation results are both effective and timely. Algorithms based on Caipan reduce the mismatch between resource supply and demand, and improve resource utilization and performance. Caipan provides good support for algorithms and platform optimization. The remainder of this paper is organized as follows. In Section 2, we establish an evaluation model that is based on similar tasks. Section 3 proposes a method called Caipan that is based on the previously developed model. Section 4 introduces the architecture and implementation of Caipan. Section 5 discusses the design of algorithms that are based on the results from Caipan. A variety of test results are provided in Section 6. Section 7 describes related work, and Section 8 provides conclusions and outlines ideas for future work. 2. A MODEL FOR EVALUATING AVAILABLE RESOURCE CAPACITY ON THE BASIS OF SIMILAR TASKS The workloads of cloud computing applications designed for large-scale parallel processing are typically composed of many tasks that use similar execution logic and have similar resource usage patterns. We define similar tasks on the basis of those that have similar execution logic, and we then use these similar tasks rather than test programs to evaluate available resource capacity. We establish a model for evaluating the available resource capacity that considers similar tasks, hardware and test information together. Definition 2.1 Resource quality θ. Suppose that the quantity of free resources, R, is μ, and total quantity is γ, and the current capacity of occupied resource R is C, the resource quality can be calculated by θ=C/(γ−μ). Definition 2.2 Capacity of computing resource Cr. Suppose that the quantity of free resources, R, is μ, and the current quality of R is θ. Then, the capacity of resource R can be expressed as (μ,θ), and its value can be calculated by Cr=μ×θ. Definition 2.3 Similar task. Let L represent the execution logic of a task, S represent the task stage and D represent the size of the data that the task processes. The value of L meets following constraints: There is a positive correlation between the value of L and complexity of execution logic of task. Liis equal to Ljif and only if the execution logic of task i is identical to the execution logic of task j. If there is a relationship such that Li=Lj∧Si=Sj∧Di=Djbetween task i and task j, then the two tasks can be designated as similar tasks. Theorem 2.4 The quality of a resource at runtime can be measured by the ratio between the indicator of a task and the indicators of similar tasks. Proof As shown in (1), when execution environment (except for resource quality) is the same, there is a mapping relationship, f, from the execution logic L and resource quality Q to the performance indicator P, where ρ is the correlation coefficient. Suppose that the execution logic of task i is Li. And we add some complex code to the execution logic of task i and get the task j. So the complexity of task j is larger than the complexity of task i and Lj>Li. When the two tasks run in the same execution environment, the performance indicator of task i is better. So P and L move in opposite directions, and we get ρL,P<0. The task performance indicator increases when the quality of allocated resource of the task is better, and we get ρQ,P>0:   f:L,Q→P,ρQ,P>0,ρL,P<0 (1) If task i and task j are similar, they have the same execution logic. And the performance indicator P has a positive correlation with the quality Q. Then, we can obtain the relationship shown in the following equation:   ∀PiTcthen  3:  if Cm−Mt×MIN(MAX(1−Cr−m,0)+Th0,Th1)>Tmthen  4:  if Cs>TsandCb−Ths×Bs/Cstorage>TbandTe/Tw>Thwthen  5:  if Cn−Thn×Bn/Cnetwork>Tnthen  6:  Allocate resources to task;  7:  R←R⧹{T};  8:  end if  9:  end if  10:  end if  11:  end if  12:  end for  Input:S: the set of servers  T: resource requirements of the task  C: resource information of server  θ: CPU quality  1:  fors in Sdo  2:  if Cidle×MIN(θ/M,1)>Tcthen  3:  if Cm−Mt×MIN(MAX(1−Cr−m,0)+Th0,Th1)>Tmthen  4:  if Cs>TsandCb−Ths×Bs/Cstorage>TbandTe/Tw>Thwthen  5:  if Cn−Thn×Bn/Cnetwork>Tnthen  6:  Allocate resources to task;  7:  R←R⧹{T};  8:  end if  9:  end if  10:  end if  11:  end if  12:  end for  Algorithm 2 Resource allocation. Input:S: the set of servers  T: resource requirements of the task  C: resource information of server  θ: CPU quality  1:  fors in Sdo  2:  if Cidle×MIN(θ/M,1)>Tcthen  3:  if Cm−Mt×MIN(MAX(1−Cr−m,0)+Th0,Th1)>Tmthen  4:  if Cs>TsandCb−Ths×Bs/Cstorage>TbandTe/Tw>Thwthen  5:  if Cn−Thn×Bn/Cnetwork>Tnthen  6:  Allocate resources to task;  7:  R←R⧹{T};  8:  end if  9:  end if  10:  end if  11:  end if  12:  end for  Input:S: the set of servers  T: resource requirements of the task  C: resource information of server  θ: CPU quality  1:  fors in Sdo  2:  if Cidle×MIN(θ/M,1)>Tcthen  3:  if Cm−Mt×MIN(MAX(1−Cr−m,0)+Th0,Th1)>Tmthen  4:  if Cs>TsandCb−Ths×Bs/Cstorage>TbandTe/Tw>Thwthen  5:  if Cn−Thn×Bn/Cnetwork>Tnthen  6:  Allocate resources to task;  7:  R←R⧹{T};  8:  end if  9:  end if  10:  end if  11:  end if  12:  end for  The algorithm traverses the server set S of a cloud platform to find a suitable server for the task (line 1). Initially, the algorithm determines whether the CPU resource in a given server meets the CPU requirement Tc(line 2). Cidle is the percentage of the time that the CPU is idle, and M is the median quality of CPUs that are of the same model. Because the memory bandwidth used by a task cannot be controlled, the algorithm considers only the size of the available memory, Cm, the reservation ratio, MIN(MAX(1−Cr−m,0)+Th0,Th1), and the memory requirements of the task, Tm (line 3). Mt is the total size of the memory resource, and Cr−m is the current memory capacity. The value of the reservation ratio ranges between Th0 and Th1 according to the memory capacity; it decreases when the memory capacity increases, but increases when the memory capacity decreases. Then, the algorithm checks the disk capacity, Cs, the storage requirements of the task, Ts, the amount of available disk bandwidth, Cb, that is reserved, Ths×Bs/Cstorage, and the bandwidth requirements of the task, Tb (line 4). If the resource requirements above are met and the disk is in good condition, the storage resources pass the check. Bs is the peak bandwidth (obtained during testing), and Cstorage is the storage capacity evaluated by Caipan. Te and Tw are the average execution time and wait time of an IO request, respectively. Finally, the algorithm determines whether the amount of available network bandwidth, Cn, that is reserved, Thn/Cnetwork, in relation to the peak network bandwidth, Bn, meets the network bandwidth requirements of the task, Tn (line 5). Thn is a threshold, and Cnetwork is the network capacity evaluated by Caipan. The reservation ratio is inversely proportional to the capacity of the network. In other words, the ratio increases to avoid network congestion when the network capacity is poor. After all the conditions have been met, the resource allocation algorithm allocates resources to the task and updates the stored resource information (lines 6 and 7). 5.3. The load-balancing algorithm It is an important factor in cloud platform performance that the workload is balanced. The load-balancing algorithm balances the load by selecting servers for tasks on the basis of the available resource capacity and the resource characteristics of tasks:   Rjr=Card(T)×tjr∑i∈Ttir (18) Definition 5.1 The degree of task requirement of resource r. Suppose that the requirement for resource r of task i is tirand the requirement degree for task j of resource r is Rjr, which can be calculated using Formula (18). In Formula (18), T is the set of all tasks and Card(T)obtains the number of elements in set T. Definition 5.2 The primary task resource. The resource for which the degree of requirement is the largest among all the requirement degrees for a task is defined as the primary resource of that task. The primary resource of a task can describe the resource characteristic of task. To start, the load-balancing algorithm, Algorithm 3, calculates the requirement degrees for resources used by a task (line 3) and ascertains the primary resource r of task T (line 4). Then, the algorithm traverses the set of servers, S, and finds a server that are available to be allocated. The algorithm filters the servers by their available capacity, Crs, of the primary resource r within the time threshold ThT (lines 9 and 10). In addition, it halts primary resource filtration if the elapsed time is longer than ThT (line 12). The algorithm uses the function Alloc() to allocate the resources of the selected server to the task. Alloc( T,s) checks the task resource requirements and the resources on server s to ensure that the supply of resources is sufficient to meet the demand. The checking and allocating procedures are the same as those used in resource allocation (Algorithm 2). Algorithm 3 Load balancing. Input:T: the task resource requirements  A: the average resource requirements of all tasks  X: resource types including CPU, memory, storage and network   1:  M←0;   2:  forx in Xdo   3:  Rx←Tx/Ax;   4:  r←MThpthen  3:  All timestamps ←0;  4:  G←G⧹{C};  5:  H←H⧹{C};  6:  Continue;  7:  end if  8:  CIPCThpfandCfreeThworCl>Thl?    T2←system time:T2←0;  11:  if T0+T1+T2≠0then  12:  G←G∪{C};  13:  else  14:  G←G⧹{C};  15:  H←H⧹{C};  16:  Continue;  17:  end if  18:  T←systemtime;  19:  if (T−T0)×T0>Tht×T0    or(T−T1)×T1>Tht×T1    or(T−T2)×T2>Tht×T2then  20:  Record abnormal information;  21:  H←H∪{C};  22:  end if  23:  if C∈GorC∈Hthen  24:  Stop resource allocation of C;  25:  end if  26:  end for  Input:S: set of servers  G: set of suspected abnormal servers  H: set of abnormal servers  1:  forC in Sdo  2:  if Cp>Thpthen  3:  All timestamps ←0;  4:  G←G⧹{C};  5:  H←H⧹{C};  6:  Continue;  7:  end if  8:  CIPCThpfandCfreeThworCl>Thl?    T2←system time:T2←0;  11:  if T0+T1+T2≠0then  12:  G←G∪{C};  13:  else  14:  G←G⧹{C};  15:  H←H⧹{C};  16:  Continue;  17:  end if  18:  T←systemtime;  19:  if (T−T0)×T0>Tht×T0    or(T−T1)×T1>Tht×T1    or(T−T2)×T2>Tht×T2then  20:  Record abnormal information;  21:  H←H∪{C};  22:  end if  23:  if C∈GorC∈Hthen  24:  Stop resource allocation of C;  25:  end if  26:  end for  Algorithm 4 Abnormal server recognition. Input:S: set of servers  G: set of suspected abnormal servers  H: set of abnormal servers  1:  forC in Sdo  2:  if Cp>Thpthen  3:  All timestamps ←0;  4:  G←G⧹{C};  5:  H←H⧹{C};  6:  Continue;  7:  end if  8:  CIPCThpfandCfreeThworCl>Thl?    T2←system time:T2←0;  11:  if T0+T1+T2≠0then  12:  G←G∪{C};  13:  else  14:  G←G⧹{C};  15:  H←H⧹{C};  16:  Continue;  17:  end if  18:  T←systemtime;  19:  if (T−T0)×T0>Tht×T0    or(T−T1)×T1>Tht×T1    or(T−T2)×T2>Tht×T2then  20:  Record abnormal information;  21:  H←H∪{C};  22:  end if  23:  if C∈GorC∈Hthen  24:  Stop resource allocation of C;  25:  end if  26:  end for  Input:S: set of servers  G: set of suspected abnormal servers  H: set of abnormal servers  1:  forC in Sdo  2:  if Cp>Thpthen  3:  All timestamps ←0;  4:  G←G⧹{C};  5:  H←H⧹{C};  6:  Continue;  7:  end if  8:  CIPCThpfandCfreeThworCl>Thl?    T2←system time:T2←0;  11:  if T0+T1+T2≠0then  12:  G←G∪{C};  13:  else  14:  G←G⧹{C};  15:  H←H⧹{C};  16:  Continue;  17:  end if  18:  T←systemtime;  19:  if (T−T0)×T0>Tht×T0    or(T−T1)×T1>Tht×T1    or(T−T2)×T2>Tht×T2then  20:  Record abnormal information;  21:  H←H∪{C};  22:  end if  23:  if C∈GorC∈Hthen  24:  Stop resource allocation of C;  25:  end if  26:  end for  5.5. The stragglers recognition algorithm In cloud computing, a job is completed when all tasks that belong to the job are completed. The completion time of a job is directly related to the existence of stragglers; therefore, it is important to avoid such tasks. The recognition algorithm in this section recognizes stragglers quickly and launches re-execute tasks as needed on the basis of information provided by Caipan. In addition, it filters resources for re-execute tasks to avoid the tasks being stragglers. The procedures for recognizing stragglers are shown in Algorithm 5. The algorithm monitors the average progress per second of task T, Ts, and the average number of IPC, TIPC. When the Ts and TIPC of task T are less than the thresholds Ths and ThIPC, respectively, T0 is recorded as the current system time and task T is added to set G (lines 2 and 3). The task is confirmed as a straggler task and added to set H if the time the task has stayed in set G is greater than the threshold Tt (lines 5 and 6). Then, the algorithm determines whether it is necessary to launch a re-execute task (line 7). The algorithm assumes that re-execute tasks can be completed in the average completion time TA and that the straggler is currently executing at speed Ts. The straggler has already made some progress, Tp; therefore, the algorithm applies resources and launches a re-execute task only if doing so will improve the completion time (relative to the straggler) by Tht (lines 7 and 8). When launching a re-execute task would improve the completion time, the algorithm selects a server in which the available capacity of the re-execute task’s primary resource is larger than a certain threshold. This ensures that the straggler is avoided. The threshold Tt can be adjusted to make a tradeoff between the time required to recognize stragglers and accuracy. The algorithm is very sensitive to stragglers when Tt is small, and a small Tt considerably reduces the time required for stragglers to be recognized. However, a small Tt can compromise recognition accuracy because the algorithm tends to mistake slightly sluggish tasks for stragglers. After recognizing a straggler, the algorithm weighs up the pros and cons to determine whether to launch a re-execute task. In other words, not all stragglers will launch a re-execute tasks. This factor largely compensates for declining accuracy. The conditions required to launch a re-execute task can be changed by adjusting the threshold Tht. Moreover, the total resources that are allowed to be occupied by re-execute tasks and the number of re-execute tasks are also configurable. These settings prevent the unchecked growth of re-execute tasks. Algorithm 5 Stragglers recognition. Input:S: set of tasks  G: set of suspected stragglers  H: set of stragglers  1:  forT in Sdo  2:  T0←(TsTt×T0then  6:  H←H∪{T};  7:  if (100%−Tp)/Ts>(100%+Tht)×TAthen  8:  Apply resources and run re-execute task T;  9:  end if  10:  end if  11:  end for  Input:S: set of tasks  G: set of suspected stragglers  H: set of stragglers  1:  forT in Sdo  2:  T0←(TsTt×T0then  6:  H←H∪{T};  7:  if (100%−Tp)/Ts>(100%+Tht)×TAthen  8:  Apply resources and run re-execute task T;  9:  end if  10:  end if  11:  end for  Algorithm 5 Stragglers recognition. Input:S: set of tasks  G: set of suspected stragglers  H: set of stragglers  1:  forT in Sdo  2:  T0←(TsTt×T0then  6:  H←H∪{T};  7:  if (100%−Tp)/Ts>(100%+Tht)×TAthen  8:  Apply resources and run re-execute task T;  9:  end if  10:  end if  11:  end for  Input:S: set of tasks  G: set of suspected stragglers  H: set of stragglers  1:  forT in Sdo  2:  T0←(TsTt×T0then  6:  H←H∪{T};  7:  if (100%−Tp)/Ts>(100%+Tht)×TAthen  8:  Apply resources and run re-execute task T;  9:  end if  10:  end if  11:  end for  The efficiency of Algorithm 5 depends on accurate thresholds. Therefore, the algorithm stops working when the thresholds are inaccurate. However, it resumes after Caipan has gathered sufficient information about the workload and platform. 6. EXPERIMENTS AND EVALUATION In this section, we present experiments that use both the Caipan method and Caipan-based algorithms. These experiments prove the validity, timeliness and scalability of the Caipan method, and they also verify that Caipan and the Caipan-based algorithms provide optimization support for cloud platforms. At the beginning of this section, we introduce the workload used by the experiments and the environment in which we performed the experiments. Then, we show the test results of the Caipan method and the algorithms individually. Finally, we integrate all the optimized algorithms in the Yarn platform and verify the overall improvement. 6.1. Introduction to experimental environment and workload The experiments were performed on a cluster at the China National High Performance Computing Center in Xi’an. The cluster contained 30 servers with heterogeneous hardware configurations, as shown in Table 1. The experiments that tested the validity and timeliness of Caipan used three servers of type A and 2 servers of type B. In these two experiments, the evaluation service of Caipan was deployed on one type A server, while the other four servers were working nodes. The scalability experiment for Caipan used all the servers to simulate a large-scale cluster. The experiments mentioned above also used specific programs as the workload. The primary resources needed by programs are CPU, memory and storage resources—they use other resources far less often. In other experiments, we used all servers of the cluster to deploy the cloud platform Yarn. The resource manager of the Yarn platform and the evaluation service of Caipan were deployed on the A-type servers. Table 1. Cluster server configurations. Type  Type A  Type B  Type C  Number  4  4  22  CPU  2×Intel Xeon E5430 @2.66 GHz  2×Intel Xeon E5310 @1.66 GHz  2×Intel Xeon E5410 @2.33 GHz  Core  8  Memory  16GB  8 GB  Storage  2×146 GB Ultra320 SCSI  146GB Ultra320 SCSI  Network  1×100 M and 1×1000 M  Type  Type A  Type B  Type C  Number  4  4  22  CPU  2×Intel Xeon E5430 @2.66 GHz  2×Intel Xeon E5310 @1.66 GHz  2×Intel Xeon E5410 @2.33 GHz  Core  8  Memory  16GB  8 GB  Storage  2×146 GB Ultra320 SCSI  146GB Ultra320 SCSI  Network  1×100 M and 1×1000 M  View Large Table 1. Cluster server configurations. Type  Type A  Type B  Type C  Number  4  4  22  CPU  2×Intel Xeon E5430 @2.66 GHz  2×Intel Xeon E5310 @1.66 GHz  2×Intel Xeon E5410 @2.33 GHz  Core  8  Memory  16GB  8 GB  Storage  2×146 GB Ultra320 SCSI  146GB Ultra320 SCSI  Network  1×100 M and 1×1000 M  Type  Type A  Type B  Type C  Number  4  4  22  CPU  2×Intel Xeon E5430 @2.66 GHz  2×Intel Xeon E5310 @1.66 GHz  2×Intel Xeon E5410 @2.33 GHz  Core  8  Memory  16GB  8 GB  Storage  2×146 GB Ultra320 SCSI  146GB Ultra320 SCSI  Network  1×100 M and 1×1000 M  View Large The version of Yarn deployed here was 2.6.0, and the version of JDK was 1.7.0_45. We configured numerous frameworks on the platform, including MapReduce, Spark and Storm. The workload consisted of many applications including word count, global searches of regular expressions, page rank, sort, Monte Carlo sampling, inverted index and log query. The workload scale indicates the size of the data being processed by the workload, which is defined as the data size processed once by all servers when the number of tasks is equal to the total number of computing cores. The data came from Wikipedia and random generation. The largest data size was 300 GB. The data size processed by the jobs was similar to the job size in Facebook [15], and the times at which the jobs were submitted followed a Poisson distribution. The experiments were performed both in a dedicated environment and in a typical cloud environment. In the dedicated environment, the cluster ran only the Yarn workload, but in the cloud environment, the cluster ran both the Yarn workload and scientific computing applications, which is similar to the usual real-world state of a cluster. Each experiment was repeated three times, and the average was taken as the final result. 6.2. Experiments with Caipan All the experimental results were recorded when Caipan was running stably. 6.2.1. Validity test The quantity of free resources was collected from system information. The hardware and test information concerning resources remained constant during the runtime. Therefore, the validity of the available resource capacity evaluated by Caipan depended on the validity of the resource quality calculations. Consequently, this experiment showed the validity of Caipan’s results regarding available resource capacity from two aspects. First, the resource qualities evaluated by Caipan reflect the actual status of the resources, and second, the relationships between the resource qualities evaluated by Caipan are consistent with the relationship between the average completion times of tasks running on the resources. We can conclude that Caipan’s evaluation results reflect the true resource capacity effectively when the evaluations of resource quality change in accordance with the actual status of resources and when its results are confirmed by the average completion times of tasks running on the resources. The tests were performed by resource type. Moreover, the experiments used different programs that have different primary resources and rarely need others. For example, the CPU test program performs many complicated calculations on a small amount of data. The memory test program accesses memory randomly. And the storage test repeats to read and write data on the disk. Because the network evaluation considers only free bandwidth, there is no need to prove its validity. At the beginning of the experiment in which resource quality was affected by workload, we submitted the same workloads to every working node. Then, we increased the workload scales of the type A servers, which have better performance. The resource quality values that changed with the workload scale are shown in Fig. 5. Lines A1 and A2 are the resource quality values of the two type A servers, while lines B1 and B2 are the resource quality values of the two type B servers. On the left side of each image, the values of A1 and A2 are larger than the values of B1 and B2—that is, the resource quality of the type A servers is better than the quality of the type B servers when the workload scales are same. This is consistent with the hardware configuration shown in Table 1. Then, in the middle of each image, lines A1 and A2 begin to fall, while lines B1 and B2 gradually rise as the workload scales of the type A servers increase. The higher workloads of the type A servers lead to a fall in their resource quality and, thus, a change in the median of their quality values. In addition, this results in a change in the resource quality of the type B servers. The resource quality values of type B servers become larger than those of type A servers as the workload scales of the type A servers continue to increase. At some point, the performance of the resources in the type B servers becomes better than that of those in the type A servers. At the right side of each image, the resource quality values become steady when the workload scales stop changing. From the graphs in Fig. 5, we can conclude that the assessed resource quality changes in accordance with the true resource status. Figure 5. View largeDownload slide Resource quality values change with workload. Figure 5. View largeDownload slide Resource quality values change with workload. In the experiment for resource quality and average completion time, we submitted the same workload to every working node. The resource qualities and normalized average task completion times are shown in Figs. 6 and 7. Figure 6. View largeDownload slide Runtime qualities of resource used by tasks. Figure 6. View largeDownload slide Runtime qualities of resource used by tasks. In Fig. 6, the lines A1, A2, B1 and B2 have the same meaning as the respective lines in Fig. 5. As can be seen in Fig. 6, the resource qualities of servers of the same type were roughly stable because the program type for each workload was monotonous and the workload scales were identical. The slight fluctuations in quality could have various causes such as changes in performance, median heap and sets of similar tasks. The three subgraphs of Fig. 6 show that fluctuation in the quality of Storage is larger than fluctuation in the quality of memory and that fluctuation in the quality of memory is larger than fluctuation in the quality of CPU. Some possible explanations are that the fluctuations are related to the number of factors considered during the evaluation, i.e. the fluctuations are small when the evaluation is comprehensive. The columns A′1, A′2, B′1 and B′2 in Fig. 7 are the normalized average completion times of tasks running on resources that have the qualities A1, A2, B1 and B2 shown in Fig. 6. Based on the results shown in Figs. 6 and 7, we can draw the following conclusions. First, the resource qualities of type A servers are still larger than those of type B servers, and the resource qualities of servers of the same type are roughly stable. Second, the average completion times of tasks run on type A servers are better than the completion times of those tasks run on type B servers. In addition, the average completion times of the same type of server are approximately equal. Finally, the average completion time of tasks is consistent with the quality of the resource used to run the tasks. Consequently, the available resource capacities as evaluated by Caipan are valid. Figure 7. View largeDownload slide Normalized average completion times for tasks using the resource. Figure 7. View largeDownload slide Normalized average completion times for tasks using the resource. 6.2.2. Timeliness test We added a network program into the test program set. This experiment measures the efficiency of Caipan on the basis of the time consumed during evaluation. We recorded timestamps when runtime information was collected and when evaluations were completed, and we then calculated the evaluation time by subtracting the collection timestamp from the completion timestamp. The evaluation times of Caipan when the cycles of information collection are every 50, 100, 200, 400 and 800 ms are shown in Fig. 8. As can be seen in Fig. 8, the evaluation times for resources of the same type increased slightly as the collection cycle time decreased from 800 ms to 50 ms. This slight increase occurs because Caipan evaluates the available resource capacity more frequently and the overall cost of collecting, processing and evaluating are greater when the collection cycle time is small than when it is large. The evaluation times for storage resources are roughly consistent with those for the network resources in the same cycle, while the evaluation times for CPU and memory resources are significantly longer than those for storage and network resources. In addition, the fluctuations are more extreme for the evaluation times of the CPU and memory resources than are the fluctuations in the evaluation times of the storage and network resources. The reason is that the evaluations for CPU and memory resources must process and access more data than are required by evaluations of storage and network resources. Figure 8. View largeDownload slide Evaluation times of Caipan in different information collection cycles. Figure 8. View largeDownload slide Evaluation times of Caipan in different information collection cycles. The average evaluation time for CPU resources is 13.65 ms; the average evaluation time for memory resources is 10.23 ms; and the average evaluation times of storage and network resources are both less than 6 ms. Therefore, the evaluations of available resource capacity as calculated by Caipan are timely, and the Caipan method is fast. 6.2.3. Scalability test In the scalability test of Caipan, the collection cycle was set to 6000 ms. The collection service sent the information many times in a cycle to simulate different cluster scales. The simulated clusters consisted of 600, 1200, 1800, 2400 and 3000 nodes. We recorded the time delay between collecting information and obtaining the evaluation results. The elapsed times when evaluating different cluster scales are shown in Fig. 9. The figure shows that each set of evaluation times and their fluctuations are larger than those of the previous test-in other words, the times and fluctuations increase with increasing cluster scale. The most likely reason for this is that increasing the cluster scale increases the costs of network transmission, processing the collected information, maintaining data structures and recognizing similar tasks. Again, the evaluation times for CPU and memory resources are larger than those for storage and network resources because the evaluation of CPU and memory resources uses the median heap more frequently than does the evaluation of storage and network resources, and the cost of maintaining the similar task set is larger when evaluating CPU and memory resources than when evaluating storage and network resources. Figure 9. View largeDownload slide Evaluation times at different cluster scales. Figure 9. View largeDownload slide Evaluation times at different cluster scales. When the scale of a simulated cluster reaches 3000 nodes, the average evaluation times of CPU, memory, storage and network resources are 25.15 ms, 17.55 ms, 8.43 ms and 6.4 ms, respectively, showing that Caipan can achieve timely results even for large-scale cluster. In addition, this experiment indicates that Caipan has good scalability and is suitable for application to large-scale cloud platforms. 6.3. Test of the resource allocation algorithm The workload scales of the resource allocation test ranged from 2 to 16 in increments of 0.5. The largest data size processed in this test was 240 GB when the workload was 16. This experiment tested both the Fair allocation algorithm [15] and the Caipan-based resource allocation algorithm, and the two algorithms were tested in both the dedicated environment and the cloud environment. We recorded the completion times of workloads at different scales. A shorter completion time means better performance when the workload scales are same. The completion times for different workload scales in the dedicated environment and the cloud environment are shown in Figs. 10 and 11, respectively. Figure 10. View largeDownload slide Completion times of resource allocation algorithms in the dedicated environment. Figure 10. View largeDownload slide Completion times of resource allocation algorithms in the dedicated environment. As shown in Fig. 10, the completion times of the Caipan-based allocation algorithm were shorter than those of the Fair allocation algorithm at the same workload scale. In addition, the gap between the completion times of the two allocation algorithms became increasingly obvious as the workload scale increased. The Caipan-based resource allocation algorithm improved the completion times of workloads by approximately 42% compared with the Fair allocation algorithm when the workload scale was larger than 14. The largest improvement in completion time of workload was 42.93%. In the dedicated environment, the improvements are attributable largely to reduced resource fragmentation and avoidance of resource bottlenecks. The Caipan-based algorithm allocates resources on the basis of the resource requirements of tasks and the available resource capacities as evaluated by Caipan. This approach reduces the mismatch between resource supply and demand. Figure 11 shows the completion times of the algorithms in the cloud environment. The completion times increased sharply compared with the test results from the dedicated environment at the same workload scale. This performance degradation is caused by other applications in the environment that share and compete with the test workload for resources. As shown in Fig. 11, the improvement in completion time became more significant as the workload scale increased—a result that is consistent with the results from the dedicated environment. The difference is that the Caipan-based resource allocation algorithm had a better performance, which reduced the workload completion times by approximately 55% when the workload scale was larger than 14. The reason for this is that the Fair allocation algorithm neither perceives the changes in the available resource capacity that are caused by other applications nor adjusts its allocation according to the situations of resource. This may cause resources to be wasted or to become bottlenecks, and the end result is a sharp increase in the workload completion times. In contrast, the Caipan-based resource allocation algorithm adapts to cloud environments in which many other applications are running because it adjusts its resource allocations according to changes in the available resource capacity. Figure 11. View largeDownload slide Completion times of resource allocation algorithms in the cloud environment. Figure 11. View largeDownload slide Completion times of resource allocation algorithms in the cloud environment. 6.4. Test of the load-balancing algorithm The load-balancing experiments were performed in both the dedicated environment and the cloud environment. We submitted a workload with a scale of 20 for this test and recorded the usage information of the CPU, memory, storage and network resources when servers in the cloud platform were running at full-load status. We calculated the average and standard deviation of the sampled utilizations as the test results. In this experiment, we used the percentage of time that the CPU was occupied(busy) as the measure of CPU resource utilization, and the percentage of occupied memory as the measure of memory resource utilization. In addition, we used the ratio between the current speed of disk IO and its actual peak IO speed as the measure of storage resource utilization, and the ratio between the current speed of network transmission and the actual peak bandwidth as the measure of network resource utilization. The average utilization percentages and the standard deviations of resources obtained by the two tested load-balancing algorithms in the dedicated environment are shown in Fig. 12. In the figure, the columns show the average utilization percentages, and the length of error bars is the double of the standard deviations. The details of the improvement in the dedicated environment achieved by the Caipan-based load-balancing algorithm when compared with the Fair allocation algorithm are shown in Table 2. The results show that the Caipan-based load-balancing algorithm significantly increases the average resource utilization of the servers and significantly decreases the standard deviations of resource utilization. This means that the resource usage of the servers in a cloud platform tends to be more uniform, which indicates that the Caipan-based load-balancing algorithm achieves good results. Figure 12. View largeDownload slide Resource utilization and standard deviations in the dedicated environment. Figure 12. View largeDownload slide Resource utilization and standard deviations in the dedicated environment. Table 2. Details of the improvements achieved by the Caipan-based algorithm compared with the Fair allocation algorithm in the dedicated environment.   CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  18.6  8.89  14.91  14.15  Pr  45.4  30.5  31.37  32.06    CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  18.6  8.89  14.91  14.15  Pr  45.4  30.5  31.37  32.06  Pi: percentage increase in average resource utilization. Pr: percentage reduction of standard deviations. View Large Table 2. Details of the improvements achieved by the Caipan-based algorithm compared with the Fair allocation algorithm in the dedicated environment.   CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  18.6  8.89  14.91  14.15  Pr  45.4  30.5  31.37  32.06    CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  18.6  8.89  14.91  14.15  Pr  45.4  30.5  31.37  32.06  Pi: percentage increase in average resource utilization. Pr: percentage reduction of standard deviations. View Large The average utilization percentages and the standard deviations of resources obtained by the two load-balancing algorithms in the cloud environment are shown in Fig. 13. In addition, the details of the improvement in cloud environment achieved by the Caipan-based load-balancing algorithm are shown in Table 3. The results show that the Caipan-based load-balancing algorithm also achieves good results in the cloud environment. The average resource utilization increase and the differences between the resource utilization of the servers in the cloud platform decrease. Consequently, the workload is more uniformly distributed among the servers. Figure 13. View largeDownload slide Resource utilization and standard deviations in the cloud environment. Figure 13. View largeDownload slide Resource utilization and standard deviations in the cloud environment. Table 3. Details of the improvements achieved by the Caipan-based algorithm compared with the Fair allocation algorithm in the cloud environment.   CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  15.63  5.67  8.18  5.16  Pr  52.65  24.05  28.94  34.44    CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  15.63  5.67  8.18  5.16  Pr  52.65  24.05  28.94  34.44  Pi: percentage increase in average resource utilization. Pr: percentage reduction of standard deviations. View Large Table 3. Details of the improvements achieved by the Caipan-based algorithm compared with the Fair allocation algorithm in the cloud environment.   CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  15.63  5.67  8.18  5.16  Pr  52.65  24.05  28.94  34.44    CPU (%)  Memory (%)  Storage (%)  Network (%)  Pi  15.63  5.67  8.18  5.16  Pr  52.65  24.05  28.94  34.44  Pi: percentage increase in average resource utilization. Pr: percentage reduction of standard deviations. View Large Comparing resource utilization in the two test environments, it is clear that the resource utilization rates in the cloud environment are higher than those in the dedicated environment. Moreover, the differences in utilization rate between the servers in the cloud environment are even larger than the differences between the servers in the dedicated environment. This occurs because scientific computing applications are sharing the computing resources in the cloud environment. On the one hand, the workload competes for limited resources on some servers and, consequently, the utilization of resources increase on those servers. On the other hand, the utilization of other servers does not change, which results in an increase in the standard deviations of the resource utilization percentages. However, the increase in resource utilization caused by this situation does not mean an improvement in performance. On the contrary, intense competition for resources leads to resource bottlenecks and eventually results in performance degradation. Comparing the results of the Caipan-based load-balancing algorithm (Figs. 12 and 13), we can see that the resource utilization percentages remain almost unchanged and that the standard deviations of these percentages increase only slightly in the cloud environment compared with the dedicated environment. This indicates that the Caipan-based load-balancing algorithm is well-adapted to coexist with other applications in the cloud environment and that it adjusts resource allocation appropriately. Comparing Table 2 with Table 3, the principal findings concerning the Caipan-based load-balancing algorithm are as follows. On the one hand, the increments in average resource utilization in the cloud environment are smaller than the increments in the dedicated environment. This occurs because the scientific computing applications running in the cloud environment increase resource utilization and limit the room for improvement. On the other hand, there are similar reductions in the standard deviations achieved by the Caipan-based load-balancing algorithm change in the two test environments. Again, this result highlights the effectiveness of the Caipan-based load-balancing algorithm. 6.5. Test of the abnormal server recognition algorithm The experiment to test the Caipan-based abnormal server recognition algorithm was performed only in the dedicated environment. We submitted a workload with a scale of 20 and randomly created numerous resource abnormalities in the servers. In addition, we canceled an abnormality when it was recognized, and we recorded the time required for recognition to measure the efficiency of the algorithm. The times required for the recognition of abnormal servers are shown in Fig. 14. The algorithm’s recognition rate of the created abnormalities was 96.25%, and the underlying reasons for the abnormalities were recorded. As can be seen in Fig. 14, the times required for abnormality recognition decreased gradually to approximately 50 s as the running time of the system increased, because the time required for abnormality recognition is related to the accuracy of the thresholds used in the algorithm. The evaluation results obtained by Caipan tend to become more accurate over time as the system continues to run. This means that the thresholds tend to become more accurate, which finally appears as a reduction in the time required for recognition. Figure 14. View largeDownload slide Time required for abnormal server recognition. Figure 14. View largeDownload slide Time required for abnormal server recognition. From the test results presented in Fig. 14, we can conclude that the Caipan-based abnormal server recognition algorithm recognizes server abnormalities rapidly. It then records the reason for the abnormality and stops allocating resources of the abnormal server. This avoids performance degradation caused by server abnormalities, which improves the stability of cloud platforms. 6.6. Test of the stragglers recognition algorithm The experiments to test the Caipan-based stragglers recognition algorithm were performed in the dedicated environment. We interfered with tasks randomly to create stragglers in the submitted workload, which had a scale of 20. The percentage of tasks that we caused to be stragglers was 5% of the total number of tasks [16]. And the stragglers are only caused by problems related to resources. We recorded the time that was required to recognize stragglers and the completion time of the workload as the test results. The straggler recognition times are shown in Fig. 15. All the stragglers were recognized in the experiment. Moreover, the recognition time decreased steadily to approximately 35 s. This is because the recognition time is related to the thresholds in the algorithm, and these thresholds tend to become more accurate as they obtain an increasing amount of execution information. Figure 15. View largeDownload slide Times required for stragglers recognition. Figure 15. View largeDownload slide Times required for stragglers recognition. In a cloud platform, recognizing stragglers quickly and taking effective measures are important methods to avoid slowing of workload completion times. The completion times for the workload when using different algorithms are shown in Fig. 16. From the columns in the figure, we can see that the completion time of the workload increased by 13.44% in the presence of stragglers compared with in the absence of stragglers when the cloud platform is using the Caipan-based stragglers recognition algorithm. However, the completion time increased by 18.93% and 35.05% in the presence of stragglers compared with in the absence of stragglers when the cloud platform used the dynamic threshold detection [17] algorithm and the LATE [18] algorithm, respectively. Therefore, the Caipan-based stragglers recognition algorithm improved performance by 21.61% compared with the LATE algorithm. And it also improved performance by 5.49% compared with the dynamic threshold detection algorithm. Figure 16. View largeDownload slide Completion times of a workload affected by stragglers. Figure 16. View largeDownload slide Completion times of a workload affected by stragglers. The Caipan-based stragglers recognition algorithm is more sensitive to the progress of tasks than the other two algorithms. However, it compromises on recognition accuracy to prioritize finding stragglers quickly. The algorithm decides whether to launch re-execute tasks to replace stragglers on the basis of information about resources and tasks. In addition, this approach helps to limit the number of re-execute tasks launched to replace stragglers. Therefore, the recognition accuracy does not have a visible impact on performance. 6.7. Test of the cloud platform In the following experiments, we used all the Caipan-based algorithms discussed above in the Yarn cloud platform. We used the Fair allocation algorithm as the resource scheduling algorithm and the LATE algorithm as the stragglers recognition algorithm in the Yarn platform. And in the another case, the dynamic threshold detection algorithm substituted the LATE algorithm as the stragglers recognition algorithm in the platform. In addition, the experiments were performed with workloads at different scales in the dedicated environment and the cloud environment. We recorded the completion times of workloads as the test results. We validated the optimization of the Caipan-based algorithms to show how Caipan offers optimization support. The completion times of workloads at different scales using the Caipan-based optimized Yarn platform and the other two Yarn platforms in the dedicated environment are shown in Fig. 17. As shown in Fig. 17, the completion time of the Caipan-based optimized Yarn platform was better than the other two platforms under the same workload. Moreover, the improvement increased with the workload scale, which occurred because the results evaluated by Caipan tend to become more accurate and its algorithms become more effective as the workload scale becomes larger. The improvement in workload completion time stabilized at approximately 62% compared with the original platform when the workload scale is greater than 14.5. And the improvement stabilized at approximately 52% compared with the Yarn using dynamic threshold detection algorithm when the workload scale is greater than 15. Figure 17. View largeDownload slide Completion times of workloads at different scales in the dedicated environment. Figure 17. View largeDownload slide Completion times of workloads at different scales in the dedicated environment. The completion times of workloads at different scales using the Caipan-based optimized Yarn platform and the other two Yarn platforms in the cloud environment are shown in Fig. 18. Consistent with the previous results in the dedicated environment, the performance of the Caipan-based optimized Yarn platform was better than that of the other platforms, and the improvement increased with workload scale in the cloud environment. In addition, the improvement in workload completion time stabilized at approximately 64% compared with the original platform when the workload scale was above 14. And the improvement stabilized at approximately 53.5% compared with the Yarn using dynamic threshold detection algorithm when the workload scale is greater than 15. The difference in the results between the two test environments is that the Caipan-based optimized Yarn platform improved performance by 18.71% and 25.17% compared with the other two platforms in the cloud environment when the workload scale was 2. This is far better than the results in the dedicated environment, in which the degradation was 1% and improvement was only 0.54% at that scale compared with other two platforms, respectively. Two reasons contribute to this situation. First, small-scale workloads require less time to finish. Second, other applications in the cloud environment influence the workload of the Yarn platform. These two reasons explain why the completion time required by small workloads increases sharply. In addition, this increases the room for optimization. Figure 18. View largeDownload slide Completion times of workloads at different scales in the cloud environment. Figure 18. View largeDownload slide Completion times of workloads at different scales in the cloud environment. Based on the test results in Figs. 17 and 18, we can conclude that the Caipan-based algorithms significantly improve the performance of the cloud platform in both the dedicated environment and the cloud environment. 7. RELATED WORK Many studies [3, 5, 6] have analyzed trace files from Yahoo!, Google and Facebook in recent years. They have noted that both the computing resource hardware of servers and the resource characteristics of workloads suffer from heterogeneous challenges. Heterogeneous hardware exacerbates the differences in resource quality between servers in a cloud platform, and leads to mismatches between resource supply and demand in cloud platforms. On the one hand, such mismatches may result in wasted resources and, thus, limit the performance of a cloud platform. On the other hand, the mismatches may result in the workload competing for limited resources, which causes performance degradation in cloud platforms. Hadoop [1] is an open-source distributed cloud platform maintained by the Apache foundation. It implements the MapReduce programming model and can be used on commercial clusters to provide distributed computing services. Yarn [2] is a cloud platform that is an evolved version of Hadoop. It uses a two-level scheduler that separates resource management and task management, which allows it to support diverse workloads. Mesos [19] is a resource management platform for cloud computing that was proposed by the AMP laboratory at the University of California, Berkeley. It uses a two-level scheduler that is similar to the Yarn platform to support diverse workloads. Many companies including Apple, eBay, Twitter, Amazon, Yahoo! and IBM use Hadoop, Yarn and Mesos to build their cloud computing platforms or centers [20, 21]; however, these platforms have not resolved the problem of mismatch between resource supply and demand. Facebook designed a management platform named Corona [22] to manage MapReduce jobs. Corona pushes messages between servers to improve the stability of the platform. Fuxi [23] is a resource and job management system implemented by Alibaba. It has outstanding stability, is good at fault tolerance and is currently used to manage hundreds of thousands of concurrent tasks in a large cloud platform with thousands of nodes. Borg [24] is the resource management system that Google has used to manage its large cloud platform for more than 10 years. Omega [25] is the resource management system implemented by Google. It increases the concurrency of resource allocation using parallelism, shared state and lock-free optimistic concurrent control. However, the resource management systems and platforms implemented by these companies still experience the same problem of mismatch between resource supply and demand. The root cause of such mismatches is the inability of the systems or platforms to understand the runtime-available capacities of computing resources and the requirements of workloads. The mismatch between resource supply and demand is one of the direct causes of individual tasks becoming stragglers, which eventually affects job completion times [4, 16]. The stragglers can be caused by many reasons [16], but this paper only focus on the stragglers caused by problems related to resources. The LATE algorithm [18] assumes that tasks run at a constant speed and judges whether a task is lagging behind by calculating the remaining execution time required for the task. It launches re-execute tasks to replace stragglers to ensure job completion. Mantri [26] analyzes runtime reports of task progress to discover stragglers and takes measures such as restarting tasks and network-aware task allocation to mitigate the effects of stragglers. Ouyang et al. [17] designed a algorithm for calculating a threshold dynamically to recognize stragglers. All of those algorithms can mitigate the effects attributable to the mismatch between resource supply and demand; however, none of them can resolve the mismatch problem. An efficient way to resolve the mismatch problem is by evaluating the available capacity of computing resources and assessing the resource requirements of tasks at runtime. There are many studies that have evaluated resource capacity. Bruneo [7] designed an analytical model that is based on stochastic reward nets. This model can be used to analyze cloud platform behavior including utilization, availability and wait time. Hwang et al. [8] built a performance evaluation model that is based on the results of benchmarks such as CloudSuite, HiBench, BenchClouds and TPC-W, which measure the efficiency, elasticity and QoS of cloud centers. Wang et al. [9] designed an analytical model to evaluate the performance of heterogeneous virtual machines by applying a continuous-time Markov chain. The aforementioned studies paid attention only to the peak performance measures of cloud platforms. In addition, the results of these studies lack timeliness and are, therefore, not applicable to real-world resource management platforms, which require timely information about available resource capacity. This paper applies the concept that many similar tasks exist in the cloud’s workload to establish an evaluation model. The model uses similar tasks instead of test programs and considers runtime information about similar tasks, resource hardware information and test results together to evaluate the available capacity of resources at runtime. Based on this model, we proposed a resource evaluation method that classifies resources by type; this method, named Caipan, evaluates the available capacity of different types of computing resources. We applied the evaluation results to design various algorithms. These Caipan-based algorithms increased the degree of matching between resource supply and demand, and improved both the performance and the stability of cloud platforms. Some previous studies have applied similar principles. The Apollo [10] analyzed the availability of resources according to task completion time, which was estimated using similar tasks. In addition, it proposed a coordinated scheduling framework that was based on resource availability. We used the runtime information of similar tasks to compare resource capacity and estimate task resource requirements in this paper. However, Apollo did not apply that principle to evaluate resource capacity. Moreover, Caipan also considers additional factors such as the differences between resources in the Caipan-based algorithm that assesses task resource requirements. Google proposed CPI2 [11], which applies similar principles to find anomalous behaviors; however, it considers only the indicator of cycles per instruction and does not involve an evaluation of the available resource capacity. Comparing the definition of similar task in this paper with those in previous researches, we considered additional factors such as execution logic,task stage and input data size. Caipan uses information obtained by hardware performance counters, which have been used to analyze performance and schedule tasks in previous studies. For example, Zhang et al. [27] scheduled CPU resources according to the resource characteristics of tasks, which were measured using information obtained by hardware performance counters. However, they did not use that information to evaluate resource capacities at runtime. GWP [28] is a tool that is used to analyze cloud center performance. It applies information obtained by hardware performance counters to find hotspots in code and to monitor the running state of cloud centers. However, it does not evaluate resources that can be used in resource management. Caipan can be applied in many open-source cloud platforms including Hadoop, Yarn and Mesos to provide evaluation results at runtime and support the optimization of algorithms and platforms. For example, the algorithm in paper [17] can use the information provided by Caipan to improve the efficiency of identifying stragglers and placing task replications. As another example, DPPACS [29] adopted the strategy of data placement and scheduled tasks within a Hadoop cloud infrastructure based on data blocks availability. We believe that the scheduling efficiency can be improved if the algorithm takes the available resource capacity evaluated by Caipan into account when considering knowledge of data blocks. Google described the knowledge gained and lessons learned from Borg to Kubernetes [30], noting that application monitoring and introspection can be dramatically improved by tying collected information to applications rather than to machines. We reached a similar conclusion during the algorithm tests in this paper. 8. CONCLUSION This paper proposes a new idea for resolving the mismatch problem between resource supply and demand in cloud computing. First, we applied the concept of similar tasks instead of test programs and considered several factors together to establish a model that evaluates the available resource capacity at runtime. Second, we provided details about the evaluation factors for different types of resources and implemented a classified evaluation method named Caipan. Finally, we designed algorithms on the basis of Caipan’s evaluation results to resolve the problem of mismatch between resource supply and demand, and to optimize cloud platforms. We performed experiments in both the dedicated environment and the cloud environment. The test results showed that Caipan can obtain accurate results quickly, has good scalability and can provide strong support for optimizing algorithms and cloud platforms. In addition, the Caipan-based algorithms improved the degree of matching between resource supply and demand. Moreover, they significantly improved the performance of cloud platforms. In future work, we plan to apply Caipan to more cloud platforms to provide better resource evaluation services. In addition, we will concentrate on enriching the evaluation factors and improving Caipan’s efficiency. FUNDING This work was supported by the National Key Research and Development Program [No. 2016YFB0200902 to X. Zhang]; and the National Natural Science Foundation of China [No. 61572394 to X. Dong]. REFERENCES 1 Welcome to apache, http://hadoop.apache.org/. 2 Vavilapalli, V.K. et al.  . ( 2013) Apache hadoop yarn: Yet another resource negotiator. Proc. 4th Annual Symposium on Cloud Computing, New York, NY, USA, October SOCC ’13, pp. 1–16. ACM. 3 Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H. and Kozuch, M.A. ( 2012) Heterogeneity and dynamicity of clouds at scale: Google trace analysis. Proc. 3rd ACM Symposium on Cloud Computing, New York, NY, USA, October SoCC ’12, pp. 1–13. ACM. 4 Dean, J. and Barroso, L.A. ( 2013) The tail at scale. Commun. ACM. , 56, 74– 80. Google Scholar CrossRef Search ADS   5 Mishra, A.K., Hellerstein, J.L., Cirne, W. and Das, C.R. ( 2010) Towards characterizing cloud backend workloads: Insights from google compute clusters. SIGMETRICS Perform. Eval. Rev. , 37, 34– 41. Google Scholar CrossRef Search ADS   6 Chen, Y., Alspaugh, S. and Katz, R.H. ( 2012) Design insights for mapreduce from diverse production workloads. Technical Report UCB/EECS-2012-17. EECS Department, University of California, Berkeley, Berkeley, CA, USA. 7 Bruneo, D. ( 2014) A stochastic model to investigate data center performance and qos in iaas cloud computing systems. IEEE Trans. Parallel Distrib. Syst. , 25, 560– 569. Google Scholar CrossRef Search ADS   8 Hwang, K., Bai, X., Shi, Y., Li, M., Chen, W.G. and Wu, Y. ( 2016) Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Trans. Parallel Distrib. Syst. , 27, 130– 143. Google Scholar CrossRef Search ADS   9 Wang, B., Chang, X. and Liu, J. ( 2015) Modeling heterogeneous virtual machines on iaas data centers. IEEE Commun. Lett. , 19, 537– 540. Google Scholar CrossRef Search ADS   10 Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., Qian, Z., Wu, M. and Zhou, L. ( 2014) Apollo: Scalable and coordinated scheduling for cloud-scale computing. Proc. 11th USENIX Conf. Operating Systems Design and Implementation, Berkeley, CA, USA, October OSDI’14, pp. 285–300. USENIX Association. 11 Zhang, X., Tune, E., Hagmann, R., Jnagal, R., Gokhale, V. and Wilkes, J. ( 2013) Cpi2: Cpu performance isolation for shared compute clusters. Proc. 8th ACM Eur. Conf. Computer Systems, New York, NY, USA, April EuroSys ’13, pp. 379–391. ACM. 12 Curnow, H.J. and Wichmann, B.A. ( 1976) A synthetic benchmark. Comput. J. , 19, 43– 49. Google Scholar CrossRef Search ADS   13 Dongarra, J.J. ( 1985) Performance of various computers using standard linear equations software in a fortran environment. SIGARCH Comput. Archit. News , 13, 3– 11. Google Scholar CrossRef Search ADS   14 McCalpin, J.D. ( 1995) A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsl. , 1, 19– 25. 15 Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S. and Stoica, I. ( 2009) Job scheduling for multi-user mapreduce clusters. Technical Report UCB/EECS-2009-55. EECS Department, University of California, Berkeley, CA, USA. 16 Garraghan, P., Ouyang, X., Yang, R., McKee, D. and Xu, J. ( 2016) Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Serv. Comput. , 1– 1. 17 Ouyang, X., Garraghan, P., McKee, D., Townend, P. and Xu, J. ( 2016) Straggler detection in parallel computing systems through dynamic threshold calculation. 2016 IEEE 30th Int. Conf. Advanced Information Networking and Applications (AINA), Crans-Montana, Switzerland, March, pp. 414–421. IEEE. 18 Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R. and Stoica, I. ( 2008) Improving mapreduce performance in heterogeneous environments. Proc. 8th USENIX Conf. Operating Systems Design and Implementation, Berkeley, CA, USA, December OSDI’08, pp. 29–42. USENIX Association. 19 Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S. and Stoica, I. ( 2011) Mesos: A platform for fine-grained resource sharing in the data center. Proc. 8th USENIX Conf. Networked Systems Design and Implementation, Berkeley, CA, USA, March NSDI’11, pp. 295–308. USENIX Association. 20 Powered by hadoop wiki, https://wiki.apache.org/hadoop/poweredby. 21 Apache mesos, http://mesos.apache.org/. 22 Facebook corona, https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. 23 Zhang, Z., Li, C., Tao, Y., Yang, R., Tang, H. and Xu, J. ( 2014) Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. , 7, 1393– 1404. Google Scholar CrossRef Search ADS   24 Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E. and Wilkes, J. ( 2015) Large-scale cluster management at google with borg. Proc. 10th European Conf. Computer Systems, New York, NY, USA, April EuroSys ’15, pp. 1–17. ACM. 25 Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M. and Wilkes, J. ( 2013) Omega: Flexible, scalable schedulers for large compute clusters. Proc. 8th ACM Eur. Conf. Computer Systems, New York, NY, USA, April EuroSys ’13, pp. 351–364. ACM. 26 Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B. and Harris, E. ( 2010) Reining in the outliers in map-reduce clusters using mantri. Proc. 9th USENIX Conf. Operating Systems Design and Implementation, Berkeley, CA, USA, October OSDI’10, pp. 265–278. USENIX Association. 27 Zhang, X., Dwarkadas, S., Folkmanis, G. and Shen, K. ( 2007) Processor hardware counter statistics as a first-class system resource. Proc. 11th USENIX Workshop on Hot Topics in Operating Systems, Berkeley, CA, USA, May HOTOS’07, pp. 1–6. USENIX Association. 28 Ren, G., Tune, E., Moseley, T., Shi, Y., Rus, S. and Hundt, R. ( 2010) Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro. , 30, 65– 79. Google Scholar CrossRef Search ADS   29 Reddy, K.H.K. and Roy, D.S. ( 2016) Dppacs: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. , 59, 64. 30 Burns, B., Grant, B., Oppenheimer, D., Brewer, E. and Wilkes, J. ( 2016) Borg, omega, and kubernetes. Queue , 14, 70– 93. Google Scholar CrossRef Search ADS   Author notes Handling editor: Mostafa Bassiouni © The British Computer Society 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) TI - A Runtime Available Resource Capacity Evaluation Model Based on the Concept of Similar Tasks JF - The Computer Journal DO - 10.1093/comjnl/bxx091 DA - 2018-05-01 UR - https://www.deepdyve.com/lp/oxford-university-press/a-runtime-available-resource-capacity-evaluation-model-based-on-the-WUSzTYo4rh SP - 722 EP - 744 VL - 61 IS - 5 DP - DeepDyve ER -