SmartRec: Fast Recovery from Single Failures in Heterogeneous RAID-Coded Storage Systems

SmartRec: Fast Recovery from Single Failures in Heterogeneous RAID-Coded Storage Systems Abstract It is not uncommon for reconstruction I/Os to encounter workload fluctuation in heterogeneous RAID-coded storage systems. This paper proposes a heterogeneity-aware single-failure recovery scheme—SmartRec—to tolerate double and multiple disk failures in RAIDs. We start this study by formulating the data recovery problem of single-disk failures in form of an optimization function in the context of online and heterogeneous disk arrays. To take both static heterogeneity associated with disk configurations and dynamic heterogeneity affected by I/O loads into account, SmartRec periodically selects an appropriate reconstruction solution according to up-to-date disk utilization. The appropriate reconstruction solution indicates the amount of data being retrieved across surviving disks and is expected to achieve minimal recovery time, which is induced by both candidate reconstruction sequences and reconstruction I/O capability of surviving disks. We build a response-time model in SmartRec to measure the reconstruction I/O capability of surviving disks during a recovery process. To quantitatively compare the SmartRec scheme against three alternatives (i.e. ConRec, MinRec and BalRec), we build four analytical models and validate the correctness of the four models using empirical evaluations. We implement the four reconstruction schemes in a heterogeneous RAID, and carry out comparative online reconstruction tests by replaying real-world workloads under various configurations. The experimental results illustrate that our SmartRec scheme outperforms the three existing reconstruction schemes in terms of reconstruction time by up to 35.3% with an average of 25.8%. 1. INTRODUCTION Owing to both volume scalability and I/O parallelism, disk arrays have been widely deployed in modern data centers. Different from single-fault–tolerant RAID-4 and RAID-5 layouts, many double- and triple-fault–tolerant RAID layouts were proposed to guarantee higher data reliability. For example, EVENODD [1], RDP [2] and X-Code [3] codes tolerate double failures; whereas TP [4] and STAR [5] tolerate triple failures. Disk failures are a norm event in large-scale storage systems [6]. Furthermore, single-disk failure plays a dominated role when disks fail in RAID-coded storage; for example, evidence shows that about 99.75% of recoveries are carried out for single-disk failures [7]. Fast online recovery from disk failures is very crucial to OLTP applications (e.g. Financial transaction systems, retail sale systems, etc.), which pose requirements of both I/O performance and data availability to underlying storage subsystems. Various reconstruction optimization schemes were proposed to recover single-disk failures [8–13]. These proposed reconstruction schemes seek the optimal recovery solution by exploiting several reconstruction sequences when a single-disk fails in RAID-coded storage systems that tolerate double and multiple concurrent failures. Minimizing the number of reconstruction I/Os is an efficient approach to accomplishing high-speed recovery. Reconstruction I/Os are induced by retrieving surviving blocks from surviving disks to reconstruct lost blocks. The RDOR algorithm [11] is an optimal disk recovery scheme for single-disk failures of RAIDs using RDP codes, achieving the smallest number of disk reads; two similar recovery schemes were proposed for single-disk recovery in RAIDs powered by EVENODD code and X-Code [8, 13]. An enumeration recovery algorithm is proposed to discover an optimal I/O recovery for single-disk failures in general XOR-based erasure codes [9]. The enumeration recovery can minimize read reconstruction I/Os at the cost of exponential computational overhead in searching for a recovery solution. To speedup the search process, a replace recovery scheme was developed to offer near-I/O-optimal recovery [10]. The aforementioned recovery schemes attempt to optimize reconstruction performance by minimizing read reconstruction I/Os. However, these schemes ignore the distribution of reconstruction I/Os across surviving disks. Intuitively, the total recovery time is determined by the time spent in reading data from a disk with the heaviest load in RAID-coded storage systems. To balance read reconstruction I/Os, Luo et al. proposed two recovery algorithms (i.e. C-algorithm and U-algorithm) to minimize read reconstruction I/Os from each surviving disk while balancing read reconstruction I/Os among surviving disks [14]; Xiang et al. designed an extended hybrid recovery approach to balancing the number of blocks being read among surviving disks [12]. Unfortunately, balanced read reconstruction I/Os do not necessarily lead to minimized recovery time. A disk’s I/O capability available to reconstruction is not only restricted by its inherent bandwidth but also influenced by user I/O load. Thus, the I/O capability of an involved disk is fluctuated during the entire reconstruction period. That is, there is an issue of disk heterogeneity in RAID reconstruction. In this paper, we propose a ‘Heterogeneity-aware’ reconstruction scheme—SmartRec—by making the number of reconstruction reads from a surviving disk to be matched with I/O capability of the disk. We use a concrete example to illustrate that it is necessary to take disk heterogeneity into consideration to improve reconstruction performance. Figure 1 shows the normalized reading time using ‘Balanced-I/O’ and ‘Heterogeneity-aware’ recovery policies. The balanced-I/O policy balances read reconstruction I/Os among surviving disks (e.g. {4, 4, 3} reads on disks {#1, #2, #3}, respectively). With the Heterogeneity-aware policy in place, the I/O capability of all the surviving disks can be fully utilized during the entire reconstruction process. We observe from Fig. 1 that the normalized reading time of the ‘Balanced-I/O’ and ‘Heterogeneity-aware’ policies are 1.0=max{1.0,0.4,0.5} and 0.6=max{0.5,0.6,0.5}, respectively. In a word, it is expected to carry out a fast reconstruction when reconstruction I/Os are assigned to disks according to the disk’s I/O capability. Figure 1. View largeDownload slide Comparison of reading time between balanced-I/O recovery policy and heterogeneity-aware recovery policy. (a) I/O capability of three surviving disks; (b) number of reconstruction reads; (c) reading time. Figure 1. View largeDownload slide Comparison of reading time between balanced-I/O recovery policy and heterogeneity-aware recovery policy. (a) I/O capability of three surviving disks; (b) number of reconstruction reads; (c) reading time. The design of recovery schemes for online and heterogeneous disk arrays is still an open problem. There are two factors in making heterogeneous reconstruction I/Os: (1) Disk configuration, to maintain and upgrade disk arrays [15], new disks from different vendors may be appended into existing RAIDs, where the new and old disks are heterogeneous in nature; (2) I/O load fluctuation, given homogeneous disks in a disk array, imbalanced user I/Os issued among disks make I/O capacity of the disks heterogeneous from the perspective of reconstruction I/Os. To the best of our knowledge, little research effort has been directed towards enabling reconstruction schemes to handle reconstruction I/O heterogeneity in surviving disks. To reduce reconstruction time while alleviating performance degradation of user I/Os, our SmartRec scheme attains a flexible and pertinent recovery strategy to match the reconstruction capacity that is dynamically changing in heterogeneous surviving disks. Specifically, we periodically select an appropriate reconstruction solution according to up-to-date disk utilization within each recovery window, which achieves minimal reconstruction time induced by both candidate recovery sequences and I/O capability of involved surviving disks. We build a model to estimate the average response time of reconstruction I/Os governed by SmartRec; we apply this response-time model to measure the reconstruction I/O capability of surviving disks during a recovery process. We formulate the reconstruction problem of single-disk failures in the form of an optimization function in the context of online and heterogeneous disk arrays. We implement the SmartRec scheme as well as the other three existing schemes in a heterogeneous RAID. We carry out comparative online reconstruction tests by replaying real-world workloads under a wide variety of configurations. The experimental results illustrate that our SmartRec scheme significantly outperforms the three existing reconstruction schemes in terms of reconstruction time by up to 35.3% with an average of 25.8%. The contributions of this paper are summarized as follows: We propose a heterogeneous-aware reconstruction scheme (i.e. SmartRec) to recover single-failure disk’s data in parallel RAID-coded storage systems that tolerate concurrent multiple-disk failures. We build a response-time model for SmartRec to quantify the reconstruction I/O capability of surviving disks during a recovery process. We develop four analytical models to theoretically compare SmartRec against the three alternative schemes (i.e. ConRec, MinRec and BalRec), and validate the correctness of the four models using empirical evaluations. We implement SmartRec and the three existing reconstruction schemes in a heterogeneous RAID, and the experimental results illustrate that our SmartRec scheme outperforms them in terms of reconstruction time and user response time. The rest of this paper is organized as follows. Section 2 briefly overviews the related work. The erasure-coded RAID are introduced in Section 3. Section 4 details the design issues of SmartRec. The analytical models for the four reconstruction schemes are presented in Section 5. The comparative experiments are given in Section 6. Finally, we conclude our work in Section 7. 2. RELATED WORK Data recovery for erasure-coded storage systems has drawn much attention over the years [16]; a number of recovery techniques have been proposed to improve the recovery performance from various perspectives [10, 12, 17–20]. In this section, we overview existing single-failure recovery techniques for multiple-fault–tolerant RAID-coded storage systems. We classify these techniques into the following three categories. Conventional Recovery Scheme (ConRec): In the case of ConRec, all symbols of a single-failure disk are reconstructed only by a set of independent parity chains (see Section 3), of which no common symbol exists. All surviving symbols in a parity chain set are retrieved for reconstructing failed single-disk, i.e. reconstruction I/Os involve all symbols of each surviving disk. For example, a naive recovery adopting ConRec scheme during a single-disk failure reconstruction process can be found in the literature [1–3]. Minimal Recovery Scheme (MinRec): For the MinRec category, all lost symbols of a single-failed disk are regenerated by a set of diverse parity chains (a.k.a., hybrid parity chains), where common symbols exist. Therefore, reconstruction I/Os only include a part of all surviving symbols, thereby improving reconstruction performance; however, the distribution of reconstruction I/Os in MinRec may be non-uniform across surviving disks, which lead to low reconstruction performance. To minimize read reconstruction I/Os, Xiang et al. [12] and Wang et al. [8] proposed optimal single-disk failure reconstruction schemes for RDP and EVENODD, respectively. For any RAID codes, Khan et al. [20] designed an enumeration recovery algorithm for single-disk failures, whereas Zhu et al. [10] developed a replace recovery algorithm generating near-optimal recovery scheme to speed up the searching process over the enumeration recovery scheme. Balanced Recovery Scheme (BalRec): When it comes to BalRec, a single-failed disk is reconstructed by hybrid parity chains; this reconstruction strategy is similar to that of MinRec. Different from MinRec minimizing read reconstructed I/Os from surviving disks, BalRec balances read reconstruction I/Os among surviving disks. Thanks to parallel I/O accesses in the context of RAID, Luo et al. [14] argued that the critical factor to influence recovery performance of failure disks lies in read reconstruction I/Os of a heavily loaded disk (i.e. bottleneck disk). To address the issue of data non-uniform distribution induced by minimal read reconstruction I/O load, Luo et al. proposed two improved reconstruction algorithms called C-algorithm and U-algorithm, which balance read reconstruction I/O load on surviving disks. Some above-mentioned recovery schemes are summarized in Table 1. Different from the existing balance-oriented reconstruction optimization schemes balancing a retrieved amount of reconstructed data among surviving disks, our SmartRec scheme aims to balance workload of surviving disks, where the number of data blocks retrieved from a surviving disk depends on the reconstruction I/O capability of the disk, and the reconstruction I/O capability is usually dynamically changing along with I/O intensity, disk I/O schedulers and disk hardware configurations. Table 1. Comparison of RAID reconstruction schemes. Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time aExisting load-balancing reconstruction schemes ensure that read reconstruction I/Os across surviving disks are equal or closed; these techniques ignore disk processing capability, which is affected by user I/Os. In some sense, the capability of disks can be considered equal in the case of offline reconstructions where there is no user I/O. bOur SmartRec considers dynamically changing disk capability (i.e. a higher disk capability indicates that disks can process more reconstruction I/Os). The disk capability is affected by two types of factors: (1) static factors–disk configurations (e.g. disk models; B. access patterns); (2) dynamic factors—depend on user I/Os. View Large Table 1. Comparison of RAID reconstruction schemes. Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time aExisting load-balancing reconstruction schemes ensure that read reconstruction I/Os across surviving disks are equal or closed; these techniques ignore disk processing capability, which is affected by user I/Os. In some sense, the capability of disks can be considered equal in the case of offline reconstructions where there is no user I/O. bOur SmartRec considers dynamically changing disk capability (i.e. a higher disk capability indicates that disks can process more reconstruction I/Os). The disk capability is affected by two types of factors: (1) static factors–disk configurations (e.g. disk models; B. access patterns); (2) dynamic factors—depend on user I/Os. View Large Apart from optimizing the reconstruction performance based on selecting various recovery chains’ combination, other studies propose different techniques on the failure recovery of storage systems, such as Pro [18] integrates the user I/O access locality into reconstruction process for optimizing reconstruction sequence, Workout [17] improves the reconstruction time and user response time by outsourcing users’ workloads during recovery, and Wan et al. [19] and Xie et al. [21] improve recovery performance by exploring better cache utilization, etc. In the paper, we focus on the single-failure recovery scenarios of paralleled heterogeneous RAID-coded storage systems. However, aim at the heterogeneous recovery problem, Zhu et al. [22] developed a cost-based heterogeneous recovery algorithm called CHR for heterogeneous networked storage systems. Luo et al. [23] propose a LaRS scheme for optimizing failure recovery in the context of heterogeneous Erasured-Coded storage cluster. Our SmartRec differs from CHR and LaRS from the following four perspectives. First, CHR and LaRS is focused on heterogeneity of network transmission bandwidth in networked storage systems, whereas SmartRec addresses the heterogeneity issue of disk bandwidth in the context of RAID systems. We design SmartRec for RAID systems, because we observe that the heterogeneity problem of data recovery in RAID systems lies in disks rather than network resources. Second, the recovery cost model of SmartRec is different from that of CHR and LaRS. CHR and LaRS are a coarse-grained model in a static environment; SmartRec is a fine-grained model incorporating the concept of window to quantify recovery cost in a dynamic environment. Third, CHR and LaRS pay no attention to user response time, whereas SmartRec employs a real-time monitor to keep track of user and disk request response times. We investigate the impacts of SmartRec on user requests in various dynamic scenarios. Last, CHR was implemented at the file system level using NCFS. We implement SmartRec at the block level to further speed up recovery performance. LaRS is implemented in RS-Coded [24] storage cluster, while SmartRec was implemented in RAID-Coded (i.e. only use XOR operation) storage system. 3. ERASURE-CODED RAID In this study, we focus on RAID codes computed using XOR-based operations. Throughout this paper, we apply similar notation used in [25] for erasure codes. Plank et al. [25] implemented a performance evaluation and examination of five different types of erasure codes, compared the encoding and decoding performance of them, and demonstrated which features and parameters lead to good coding performance, for instance, an optimization called Code-Specific Hybrid Reconstruction [26] is necessary to achieve good decoding speeds in many of the codes, which give storage system designers an idea of what to expect in terms of coding performance when designing their storage systems. Data stored in an erasure-coded storage system is partitioned into fixed-size stripes, each of which is a two-dimensional array with a fixed number of rows and n columns. Each column is called a strip with w symbols, which corresponds to a unique disk. Each parity strip is encoded using one strip from each data disk, and the collection of k+m strips that encode together is called a stripe. Each stripe stores a fixed n*w number of blocks. Stripes are independently encoded in such a way that data and parity strips are rotated among disks for balanced load [25]. Let us take the TP code as an example to illustrate the construction mechanisms of RAID codes tolerating multiple-disk failures. Figure 2 shows that TP code is defined in a (p−1)-row-by-(p+2)-column matrix, where p>2 is a prime number. The first p−1 disks (i.e. data disks) hold all data blocks, whereas the last three disks (i.e. parity disks) store all parity blocks. We assume that Ci,j represents a block in the i th row and the j th column. There are three block types, namely, row parity blocks, diagonal parity blocks and anti-diagonal parity blocks. A parity chain is an independent minimal fault–tolerant unit. There are three kinds of parity chains (i.e. row parity chains, diagonal parity chains and anti-diagonal parity chains); each parity chain is a set of blocks containing a parity block and the corresponding data blocks. In a parity chain, one block failure is reconstructed from surviving blocks. To facilitate this analysis, we consider cases where the same shape form a parity chain. For example, in a case where block C0,0 fails, the parity-chain mechanism constructs block C0,0 from the surviving blocks {C0,1,C0,2,C0,3,C0,4}. Figure 2. View largeDownload slide Data layout of TP code in a stripe. (a) Row parity chain set. (b) Diagonal parity chain set. (c) Anti-diagonal parity chain set. Figure 2. View largeDownload slide Data layout of TP code in a stripe. (a) Row parity chain set. (b) Diagonal parity chain set. (c) Anti-diagonal parity chain set. A TP-coded storage system has p+2 disks denoted by d0, d1,…,dp+1. Without loss of generality, we let disks dp−1, dp and dp+1 hold a row, diagonal and anti-diagonal parity blocks, respectively. Let df be a malfunctioned disk, where 0≤f≤p−2. We suppose that reconstruction operations read ri blocks from the i th surviving disk in a recovery sequence. Given the j th reconstruction sequence RSj={{yi}0≤i≤p−2}j, which specifies how each block of failed disk df is reconstructed, yi=0 indicates that the i th block of failed disk df is reconstructed by a row parity chain; yi=1 means that the i th symbol of failed disk df is reconstructed by a diagonal parity chain; and yi=−1 suggests that the i th block of failed disk df is reconstructed by an anti-diagonal parity chain. Various strategies that speedup in searching reconstruction sequences can be found in the literature [9, 10]. Each reconstruction sequence corresponds to a reconstruction I/O distribution. Among all available reconstruction sequences, SmartRec chooses I/O distributions that induce minimal reconstruction I/Os [20] as candidate recovery solutions ( Rcand) (i.e. Rcand={RS1,RS2,…,RSj}). Suppose that the j th reconstruction sequence ( RSj) is determined, we can obtain an associated reconstruction I/O distribution {{rj,i}i∈ds}j, where ds∈{0,…,f−1,f+1,…,p+1}. For example, if disk ‘0’ fails (see Fig. 2) and RSj={1,1,0,−1}, then we have reconstruction I/O distribution {{rj,i}i∈ds}j={0,2,3,3,2,2,1}. 4. DESIGN ISSUES 4.1. The overview of SmartRec SmartRec aims to optimize the reconstruction process by dynamically balancing I/O capability of surviving disks. In the context of heterogeneous disk arrays, various types of disks may experience different access patterns, strategies and I/O capabilities. Reconstruction time, of course, is determined by the slowest surviving disk (i.e. bottleneck disk), which takes the most amount of time in retrieving data for the recovery purpose. In addition, for online storage systems, user I/O intensity dynamically changes during the course of data recovery [18]. Thus, the reconstruction bandwidth varies with the changing I/O intensity within each recovery window during an online recovery process in heterogeneous disk arrays. SmartRec advocates an idea of assigning the amount of recovery load according to I/O capability of surviving disks, thereby taking a full advantage of the dramatically changing reconstruction bandwidth within multiple recovery windows. Figure 3 depicts a sample data reconstruction process governed by SmartRec. The regenerated data of single-failure disk is placed in a replacement disk during data recovery, including reconstructed units and unit needs reconstruction. The data-recovery process is divided into multiple fixed sizes of recovery windows (i.e. K−1, K and K+1). The recovery of malfunctioned units for each recovery window has to retrieve different amounts of data from surviving disks. We refer to such a recovery data assignment as reconstruction I/O distribution (see details in Section 3) across surviving disks. The retrieved I/O distribution from surviving disks depends on capability-I/O-based candidate recovery sequences (e.g. High, Medium and Low), which indicate different amounts of data being retrieved for reconstruction. For example, ‘High’ means the amount of data being retrieved most matches I/O capability of involved surviving disks in the current recovery window;in this case, the reconstruction time is minimized. We refer to this recovery sequence as a capability-I/O-matched recovery solution, e.g. RSmatch={0,1,0,1,1,−1}. The capability-I/O-matched recovery solutions are derived from both candidate recovery sequences and I/O capability of disks (i.e. reconstruction bandwidth of disk). Each recovery window corresponds to multiple candidate recovery sequences, among which SmartRec attempts to select the capability-I/O-matched recovery solution in each recovery window. For instance, assuming the I/O capability of surviving disks {#1, #2, #3, #4, #5, #6, #7, #8} are 38.32 ms/MB, 9.63 ms/MB, 40.51 ms/MB, 16.01 ms/MB, 16.87 ms/MB, 9.37 ms/MB, 6.45 ms/MB and 16.05 ms/MB in the current K th recovery window, separately. The candidate recovery sequence of TP-coded storage system with an array of 9 disks (i.e. P=7) RS1 is {0,1,0,1,1,−1} corresponding to the reconstruction I/O distribution of {3,3,4,3,4,4,3,1}, RS2 is {0,1,−1,−1,0,−1} corresponding to the reconstruction I/O distribution of {3,3,4,3,4,4,1,3}, RS3 is {0,−1,0,−1,−1,1} corresponding to the reconstruction I/O distribution of {4,4,3,4,3,3,1,3,}, RS4 is {0,−1,1,1,0,1} corresponding to the reconstruction I/O distribution of {4,4,3,4,3,3,3,1}. Figure 3. View largeDownload slide The snapshot of a data reconstruction process in SmartRec. Figure 3. View largeDownload slide The snapshot of a data reconstruction process in SmartRec. Recall that SmartRec advocates the idea of minimal reconstruction time induced by both candidate recovery sequences and I/O capability of involved surviving disks. Theoretically, the minimal reconstruction time is equal to the minimal value among the results of reconstruction I/O distributions multiply by I/O capability of the slowest disk. Thus, recovery sequence {0,−1,0,−1,−1,1} achieving minimal reconstruction time is selected as a capability-I/O-matched recovery solution. SmartRec schedules reconstruction threads using a capability-I/O-based time-sharing policy. The reconstruction thread for a recovery window is activated as soon as a capability-I/O-matched solution is selected among all the candidate recovery sequences. A recovery window corresponds to a reconstruction time slice (e.g. K−1 associated with Tk−1, K associated with Tk, K+1 associated with Tk+1). The size of corrupted units is fixed in a recovery window; however, the reconstruction time slice varies based on changing reconstruction bandwidth within different recovery windows. The overarching goal of SmartRec is to minimize reconstruction time slices. It is worth noting that size of reconstruction time slice in SmartRec is associated with the number of candidate recovery sequences. A capability-I/O-matched recovery solution minimizing reconstruction time slice is expected to be chosen in each recovery window. Figure 4 depicts the I/O-capability selection strategy among multiple reconstruction time slice (i.e. Tk−1, Tk and Tk+1). To measure the current I/O capability of involved surviving disks (e.g. disk i and disk i+1), we collect statistics of historical access times to assess the I/O capability of surviving disks during each recovery window. Such measurements are of importance, because actual I/O capability of the next window is unpredictable during the current window. Therefore, the reconstruction bandwidth of a previous recovery window is induced to determine a predicted capability-I/O-Matched reconstruction solution in the current recovery window (see Equation (6)). For instance, for the current (K−1) th window (i.e. the Tk−1 reconstruction time slice), the I/O capability is derived from the (K−2) th window. Similarly, the I/O ability of the k th recovery window is governed by one of the (k−1) th recovery windows. Figure 4. View largeDownload slide The I/O-capability selection strategy among multiple recovery windows. Figure 4. View largeDownload slide The I/O-capability selection strategy among multiple recovery windows. 4.2. The implementation details In this section, we discuss the design issues of SmartRec for online single-disk failure recovery in heterogeneous disk arrays. We present design principles and algorithm descriptions. The notation and parameters used throughout this paper can be found in Table 2. The SmartRec architecture consists of three key components, namely access monitor (AM), reconstruction scheduler (RS) and reconstruction executor (RE). Table 2. The notation used in the recovery models. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. View Large Table 2. The notation used in the recovery models. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. View Large The access monitor or AM is responsible for capturing user and reconstruction I/Os. AM calculates the average response time for user I/O in RAID-coded storage systems; AM also keeps track of average access time for reconstruction I/Os within each recovery window. AM performs the following steps: (1) AM checks if all recovery windows have been reconstructed. If Numwin≠0, then AM captures user and reconstruction I/O requests, which include I/O type and address information; (2) AM calculates average response time of user I/Os (i.e. tuser) and average response time of reconstruction I/Os during the k th recovery window (i.e. trec,k). Such an average response time is an indicator to quantify the reconstruction I/O capability of disks (see Equation (6)). The design of AM is outlined in Algorithm 1. Algorithm 1 Access Monitor (AM). Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Algorithm 1 Access Monitor (AM). Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while The responsibility of reconstruction scheduler or RS is to select a capability-I/O-matched recovery solution within the k th recovery window, according to the candidate recovery sequences Rcand (see also Section 3) and reconstruction I/O capability of disks (i.e. trec,k, see Algorithm 1) in the current recovery window. Algorithm 2 shows the pseudo-code of the RS algorithm. The RS process is divided into two phases: (1) according to trec,k and candidate recovery sequence RSj, RS calculates the reconstruction time for all candidate recovery sequences (i.e. max(RSj)) (see Steps 5–12 in Algorithm 2); (2) using the results from phase 1, RS obtains reconstruction times of all candidate recovery sequences, followed by selecting the minimal value among all the max(RSj) values. Such a minimized max(RSj) value becomes the recovery solution of capability-I/O-matched( RSmatch) (see Steps 14–23 in Algorithm 2) Algorithm 2 Reconstruction Scheduler (RS) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Algorithm 2 Reconstruction Scheduler (RS) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) . The main function of RE is to retrieve reconstruction I/Os from a capability-I/O-matched recovery solution and to rebuild corresponding failed units on a replacement disk. RE is comprised of three phases: (1) reading data process; (2) XOR data process; (3) writing data process. Algorithm 3 outlines the RE design. The operations of the components are detailed as follows. During the reading data process (see Steps 4–9 in Algorithm 3), all processes associated with surviving disks follow the following five steps to retrieve data in accordance to a capability-I/O-matched recovery solution. First, reconstruction sequence RSmatch for the current recovery window is obtained. Second, a request is issued to read an indicated failed unit into a read buffer manager. Third, RE waits for the read request to complete. Fourth, the unit’s data is transferred to a centralized buffer manager for an XOR operation; such a data transfer may be blocked if the buffer is full. Last, if all failed units in the current recovery window are reconstructed, Numrec,k will be decreased to zero. In the XOR data process (see Steps 11–17 in Algorithm 3), RE regenerates failed data according to the retrieved data via reconstructed I/Os. Specifically, RE first checks whether the read buffer has any data. If the read buffer is not empty, then RE reads data from the buffer for an XOR operation. Finally, RE writes the XOR result to the write buffer. Algorithm 3 Reconstruction Executor (RE). State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while Algorithm 3 Reconstruction Executor (RE). State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while During the writing data process (see Steps 19–24 in Algorithm 3), RE is responsible for writing rebuilt data into a replacement disk. Thus, a request is issued to write data in the write buffer into the replacement disk. If the write buffer is empty, the writing process is stalled. The writing process is resumed when the write buffer has data. To obtain a dynamically balancing reconstruction I/O capacity for improving recovery performance, SmartRec tracks the changing of I/O Capacity across all surviving disks during the entire recovery. SmartRec divides the entire reconstruction area into multiple non-overlapping but consecutive data areas, one of which called one recovery window associated with multiple recovery sequences. Among each recovery window, the reconstruction threads select a Capacity-I/O-Matched recovery solution that minimal recovery time from different candidate recovery sequences. In our implementation, the entire recovery process creates n threads associated with n−1 surviving disks and one placement disk. The whole reconstruction areas hold 10 GB data volume, which is average divided into five recovery windows, i.e. a recovery window is 2 GB of data. With this approach, SmartRec achieves the goal of maintaining dynamically balancing reconstruction load. 4.3. Overhead analysis Space Overhead Analysis: For online recovery process, memory overhead is mainly governed by user I/O trace and reconstruction I/O, which corresponding to three data buffers, i.e. Tracebuffer, ReadBuffer and WriteBuffer in the memory, respectively. Tracebuffer hold user I/O trace assigned to appropriate address, ReadBuffer store retrieved reconstruction blocks from surviving disks, which is computed by XOR operation to generate failed data blocks saved in WrtieBuffer. For instance, assuming 20 reconstruction blocks in a stripe are retrieved into ReadBuffer, each block’s size is 16 KB, the ReadBuffer’s size is approximately 0.32 MB. Obviously, WriteBuffer’s size is smaller. Memory overhead for TraceBuffer is about 2.5 MB from a part of web-2 trace. However, the memory overhead is only temporary and will be removed after the reclaim process completes. So memory overhead is arguably reasonable and acceptable to 16 GB of memory capacity in real storage server. Computing Overhead Analysis: AM is mainly responsible for calculating the average response time for both user I/O and reconstruction I/O. So the AM’s overhead is mainly from computing overhead. RS process is responsible for selecting Capacity-I/O-Matched recovery solution based on both average response time of reconstruction I/O and candidate recovery solutions. RS creates multiple threads associated with candidate recovery solutions, of which each thread is responsible for calculating the recovery time corresponding to reconstruction window. In the process of RS, candidate recovery solutions that minimal reconstruction I/O distribution are generated in advance instead of making during the course of RS implementing. So the overhead of enumerate all possibilities is hidden. As we know, the computing overhead may be ignored compared with I/O overhead of disks during entire recovery. I/O Overhead Analysis: I/O overhead is divided into access time of foreground user’s I/Os and background reconstruction I/Os. The user’s I/Os and reconstruction I/Os are influenced each other during the entire recovery process. For RE process, there create n threads associated with both n−1 surviving disks and one replacement disk. Data are read from surviving disks in paralleled methods, and then both read and write are implemented in pipeline scheme. The above approach further decreases the I/O overhead of disks. 5. RECOVERY MODEL Now we formulate the online single-failure disk recovery problem using an optimization model for heterogeneous disk arrays. The notation used in our models is listed in Table 2. We take the TP codes as an example to build our models. Similar to the existing recovery schemes reported in [10, 14], SmartRec is focused on the recovery of data disks; we adopt a conventional recovery scheme to recover parity disks. 5.1. Problem formulation We develop a reconstruction-time model to formulate the problem solved by our proposed reconstruction schemes. The reconstruction time equals to the amount reconstruction I/O data divided by reconstruction bandwidth determined by the slowest disk. In the scenarios of ConRec, MinRec and BalRec, there is only a single recovery sequence regardless of the number of recovery windows. Thus, the amount of reconstruction I/O load is a constant during the entire recovery process. Our SmartRec differs from the above three recovery schemes in that SmartRec is proposed to optimize online reconstruction performance for heterogeneous disk arrays. The model of SmartRec is formulated as follows: Tsmart=Σk=1NumwinNumsmart,k×SBWrec,k (1) where Numsmart,k is reconstruction I/O’s load, BWrec,k is reconstruction bandwidth in the k th recovery window for a bottleneck disk. In the offline recovery process for heterogeneous RAID, rebuilding a single-corrupted disk is simply governed by a reconstruction sequence, where reconstruction bandwidth is fixed. In such a case, there is no need to vary reconstruction sequences among multiple recovery windows. The disk I/O capability depends on disk bandwidth of a heterogeneous RAID; the bandwidth is independent of I/O intensity. In the online recovery case, the entire reconstruction process can be divided into multiple recovery windows, in which reconstruction bandwidth varies with the changing user-I/O intensity during the online recovery process. Each window may adopt a customized reconstruction sequence according to the disk run-time I/O capability. The primary optimization objective is to minimize reconstruction time Tsmart. Thus, we express this objective as MinimizeΣk=1NumwinNumsmart,k×SBWrec,k (2) 5.2. SmartRec recovery model To implement the design idea of ‘matching the number of reconstruction I/Os to disks’ reconstruction capability’, we have to develop a light-weight evaluation mechanism to measure I/O capability of disks. We are facing two challenges: (1) how do we choose an appropriate evaluation metric? and (2) how do we minimize the overhead caused by such an evaluation mechanism? We address the first challenge by making use of I/O transmission bandwidth to quantify disk I/O capability in disk arrays; average response time is measured to determine the I/O capability of disks. A similar approach can be found in prior studies [22]. We overcome the second challenge by increasing recovery window size. We have to make tradeoffs between evaluation overhead and disk utilization. It is critical to determine the window granularity, because a larger window granularity leads to smaller evaluation cost and lower disk utilization and vice versa. The recovery problem solved by SmartRec is formulated as an optimal reconstruction model for heterogeneous disk arrays. Given the jth recovery sequence, the reconstruction time in the i th disk during the k th recovery window is formulated as Tk,j,i. Thus, we have Tk,j,i=rj,i×SBWk,i (3) where Tk,j,i represents the reconstruction time of the i th disk within the k th recovery window. The window is also expressed as rj,i×trec,k,i (see Phase 1 in Algorithm 2). Recall that the reconstruction time of an arbitrary single surviving disk equals to the results of multiplying numbers of retrieved reconstruction I/Os ( rj,i) by average response time of reconstruction I/O( trec,k,i). According to (3), we derive the reconstruction time of the j th recovery sequence within the k th recovery window (Tk,j): Tk,j=maxi∈ds{Tk,j,i} (4) where Tk,j is the maximal value of reconstruction time among all surviving disks. Tk,j represents the reconstruction time of a bottleneck disk during a recovery window (see max(RSj) in Algorithm 2). The reconstruction time in the k th recovery window is expressed as follows Tk=minj∈Rcand{maxi∈ds{Tk,j,i}} (5) where Tk is the minimal value of reconstruction time of bottleneck disks for all candidate recovery sequences in the k th recovery window (see Phase 2 in Algorithm 2). To quantify the available bandwidth for data recovery, SmartRec keeps track of historical access times of reconstruction I/Os. Then, available reconstruction bandwidth of the ith surviving disk during the k th recovery window is formulated below: BWk,i=S∑l∈Numrec,k−1,itlNumrec,k−1,iwheretrec,k,i=∑l∈Numrec,k−1,itlNumrec,k−1,i (6) where trec,k,i represents average response time of reconstruction I/Os in the current recovery window (i.e. reconstruction I/O capability of the i th surviving disk within the k th recovery window). From (6), we derive reconstruction I/O capability of the current recovery window (i.e. k th) from that of the previous recovery window (i.e. (k−1) th) (see also Fig. 4). According to (3) and (6), we obtain Tk,j,i as Tk,j,i=rj,i×trec,k,i (7) According to Equations (3–7), the total reconstruction time( TSmart) is modeled as TSmart=∑k∈Numwin{Tk}=∑k∈Numwin{minj∈Rcand{maxi∈ds{Tk,j,i}}}=∑k∈Numwin{minj∈Rcand{maxi∈ds{rj,i×trec,k,i}}} (8) where TSmart is a sum of all reconstruction time slices of SmartRec. Each reconstruction time slice corresponds to a recovery window. The size of reconstruction time slice varies with the dynamically changing reconstruction bandwidth determined by both inherent heterogeneity of disk array and I/O intensity of user requests. 5.3. Model validation We introduce reconstruction time ratios to validate the proposed models. The model is validated for heterogeneous RAIDs in the online and offline recovery processes. Let Rcon,smart be a reconstruction time ratio between ConRec and SmartRec. We can express ratio Rcon,smart as Rcon,smart=TconTsmart=Σk=1NumwinNumcon,kNumsmart,k (9) where Tcon and Tsmart are reconstruction times for ConRec and SmartRec, respectively. In the offline recovery case, the reconstruction bandwidth—a constant—depends on the slowest disk in a heterogeneous RAID. Thus, the entire recovery process only adopt a single reconstruction sequence for ConRec and SmartRec (i.e. Numwin=1). In an online recovery, reconstruction bandwidth is governed by both changing user-I/O intensity and the bottleneck disk’s bandwidth. Thus, the reconstruction process is divided into multiple recovery windows (i.e. Numwin>1). Reconstruction I/O load of ConRec is a constant during the entire recovery process; thus, ConRec only adopts one parity chain to rebuild single-failed disk. In contrast, the reconstruction sequence in SmartRec is judiciously adjusted by reconstruction I/O load and the bottleneck disk. Equation (9) suggests that the reconstruction time ratios is translated into the ratios of reconstruction I/O load between the two recovery schemes. We validate the models in the offline and online scenarios for RDP-coded heterogeneous storage system with p=5, 7, 11 and 13. In the offline scenario, ConRec and SmartRec maintain a fixed reconstruction sequence. The reconstruction sequence for ConRec is only induced by one type of parity chain set, SmartRec reconstruction sequence is governed by a hybrid parity chains set. Thus, the distribution of reconstruction I/O load for ConRec is identical among all surviving disks; however, the I/O load of SmartRec is non-uniformly assigned to surviving disks. For example, in the case of RDP with p = 5, the distribution of reconstruction I/Os for ConRec is {{rj,i}i∈ds}j={4,4,4,4,0} in a stripe, which is induced by reconstruction sequence RSj={0,0,0,0}. In the SmartRec case, the distribution of reconstruction I/Os in a stripe is {{rj,i}i∈ds}j={3,2,2,3,2} governed by reconstruction sequence RSj={0,0,1,1}. Thus, Rcon,smart is derived from retrieved I/Os ratio of the bottleneck disk according to the theoretical analysis (i.e. Rcon,smart=4/2=2, when p=5). Recall that in the online scenario, the reconstruction sequence of ConRec is fixed regardless of changing user-I/O intensity; however, the sequence of SmartRec varies with the changing I/O intensity among multiple recovery windows. Let us consider a case where p=5 and Numwin=5, the reconstruction I/O load in ConRec is 4 due to the bottleneck disk induced by fixed reconstruction bandwidth. When it comes to SmartRec, the reconstruction I/O load of the bottleneck disk is 2, 3, 3, 2, 3 from the five recovery windows, respectively. The varied load is contributed by the changing reconstruction bandwidth (see also (9, Rcon,smart=(4+4+4+4+4)/(2+3+3+2+3)=1.54). Figures 5 and 6 show the reconstruction time derived from our models and the experimental results obtained from the implemented prototypes using real-world disk arrays, respectively. We observe that the difference between the theoretical ratios and experimental counterparts is marginal (i.e. 11% and 12% for the offline and online scenarios). The experimental results confirm that the proposed models effectively estimate reconstruction times spent in recovering single disk failures in disk arrays. Although it is different from the construction mechanism of diversity parity chains between array codes, there exist common symbols between diversity parity chains, which leads to non-uniform distribution of reconstruction I/O load. In addition, the benefit of our scheme is attributed by non-uniform distribution of reconstruction I/Os among array codes. Thus, we draw similar conclusions for EVENODD, TP and STAR in that reconstruction time ratios can be derived from the I/O load ratio of a bottleneck disk, which only depends on deployed reconstruction strategies. Figure 5. View largeDownload slide Comparisons of offline reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 5. View largeDownload slide Comparisons of offline reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 6. View largeDownload slide Comparisons of online reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 6. View largeDownload slide Comparisons of online reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. In the future, the measure accuracy among experimental process need to be improved to narrow the gap between theoretical results and experimental counterparts. To improve the measure accuracy, we adopt average response time of reconstruction I/O to evaluate I/O capacity for retrieving reconstruction I/Os among surviving disks. We statistic the historical access times and total I/O time of reconstruction I/O to calculate the average response time. we plan to further seize the changing character of I/O capacity among surviving disks, by analysis I/O access behaves of both user I/Os and reconstruction I/Os to decrease the evaluation overhead, changing the size of recovery windows to improve disk utilization, and optimizing efficiently schedule reconstruction threads to improve measure accuracy of I/O capacity. 6. PERFORMANCE EVALUATION We implement the proposed SmartRec reconstruction scheme along with the three alternatives (i.e. ConRec, MinRec and BalRec) in a real-world storage server. We conduct a wide range of experiments to quantitatively compare the performance of the four reconstruction schemes. 6.1. Experiment environment We implement the four prototypes in a high performance storage-server, where all disks are organized in the form of RAID. The server is equipped with two 6-core Intel(R) Xeon(R) E7540@2.00 GHz CPUs and 16 GB DDR3 main memory. All disks are Seagate ST9300605SS SAS-II connected by a MegaRAID SAS 1078 controller with 512 MB dedicated Cache. The operating system is Ubuntu 10.04 LTS X86-64 with the Linux Kernel 2.6.32. To resemble heterogeneous RAIDs, we vary the disk parameters, i.e. /sys/block/sdx/queue/read_ahead_kb, to create heterogeneity in I/O abilities of the disks. For example, we set /sys/block/sdx/queue/read_ahead_kb to 0, then the bandwidth of disk for Seagate ST9300605SS SAS-II is approximately 30 MB/s. Table 3 shows that four types of heterogeneous disks are implemented on our experiment. We adopt IOMeter [27] to measure the disk I/O speed, which is changing in a range between 30 MB/s and 150 MB/s. Thus, we implement three heterogeneous RAIDs by changing disk configure parameters, including Conf-A, Cof-B and Conf-C. Disk bandwidth of Conf-A is in a range between 100 MB/s and 150 MB/s, Conf-B is changing between 78 MB/s and 150 MB/s, Conf-C varies anywhere between 30 MB/s and 150 MB/s. Table 3. The configuration of heterogeneous disks within RAID. Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 View Large Table 3. The configuration of heterogeneous disks within RAID. Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 View Large 6.2. Evaluation methodology We evaluate the performance of the four reconstruction schemes from the perspective of recovering single-disk in disk arrays powered by multiple-fault–tolerant RAID codes: EVENODD [1], RDP [2], Star [5] and TP [4]. The former two codes are a specific to RAID-6 systems that can tolerate exactly two failures, the latter two codes can tolerant exactly three failures. The amount of data stored on each disk is set to 10 GBytes, which is sufficiently large to evaluate the reconstruction times of the tested schemes. Such an amount of tested data, which has been used in other studies [17], can adequately cover the footprint of the evaluated workloads. The reconstruction performance is measured in terms of the completion time spent in reconstructing 10 GB data with a recovery window of 2 GB. Each experiment is repeatedly conducted five times; the average reconstruction time is calculated. During the entire recovery process, C program language is adopted to implement above various reconstruction schemes. For play the foreground user I/Os, we create n threads to assign the user I/Os to the corresponding disks, i.e. a thread is responsible for playing user I/Os to an appropriate disk. The strategy using multiple threads can statistic queue time and server time of user I/Os. To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 3), we create n threads, each of which is associated with one disk. Of which, n−1 threads are responsible for reading surviving data blocks in paralleled method into ReadBuffer in the memory. Multiple ReadBuffers associated with surviving disks make up a linked list for XOR operation to regenerate the failed data. The remaining one thread implements write operation from XOR results. Due to multiple threads design, read operation, XOR operation and write operation are implemented in pipe-lined method. To adequately evaluate the improvement of SmartRec over the three alternatives, we conduct comparative experiments in the offline recovery and online recovery scenarios. In the process of evaluating online reconstruction performance, we implement different RAID codes, test various array sizes and stripe unit sizes, and configure I/O heterogeneity in our storage system. We adopt an open-loop model during the online recovery test, where traces are replayed according to timestamps recorded in the trace files (i.e. I/O arrival rates are independent of I/O request completions [28]). The trace replayer issues I/O requests to appropriate data chunks according to address mappings. The trace replay tool is RAIDmeter [18] that replays traces at the block level and evaluates user response time in the storage device. We evaluate the online reconstruction performance using the Web-2 trace [29] that represents read-intensive applications. The Web-2 used in our experiments is obtained from the Storage Performance Council [29]. The Web-2 was collected from a machine running a web search engine. It represents read-domination(99.98% read ratio) and with high locality in its access pattern, block request size range from 8 KB, 16 KB, 24 KB and 32 KB. The total I/O request numbers are 4 579 809 traces recorded in the web-2 files. In the paper, we are mainly focus on the consideration of dynamically balancing reconstruction I/Os capacity during recovery process, and only 0.02% write traces in web-2 file. Optimizing read or write operation in I/O workloads is widely by studied in the literature [17, 18]. For fairness and simplicity, we only consider read operations issued by the RAIDmeter replay tool. We also assume that the faulty disk is the first disk in the test system. 6.3. Offline reconstruction performance To examine the advantage of SmartRec over the ConRec, MinRec and BalRec reconstruction schemes, we implement four RAID Codes (i.e. EVENODD, STAR, RDP and TP), of which each RAID Code represents different RAID sizes. Figure 7 shows the experimental results of offline reconstruction performance comparison. Figure 7. View largeDownload slide Offline reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. Figure 7. View largeDownload slide Offline reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. We draw two observations from Fig. 7. First, SmartRec consistently outperforms the other three reconstruction schemes regardless of RAID codes and RAID sizes. The performance improvement offered by SmartRec can be attributed to the fact that data reconstruction load is assigned by SmartRec in accordance to the I/O ability of disks during recovery, where the I/O heterogeneity is incorporated. For example, in the case of the TP-coded storage system, where P is set to 7 and stripe unit size is 64 KB, the recovery times are 434.43 s, 336.88 s, 274.04 s and 238.84 s for ConRec, MinRec, BalRec and SmartRec, respectively. Second, the reconstruction performance of the four schemes is consistently and marginally degraded with the increasing RAID sizes, mainly because the total reconstruction amount is increased with growing RAID sizes. This trend is held true even though the reconstruction data amount is same among all the disks. For example, in the TP-coded storage system, the recovery times of SmartRec are 207.62 s, 238.84 s and 373.58 s when P is set to 5, 7 and 11, respectively. 6.4. Online reconstruction performance Compared with the offline recovery scenario, the online recovery environment has external user I/O requests. Hence, the reconstruction bandwidth is determined by overall disk bandwidth and bandwidth consumed by external user I/Os. To adequately assess the effectiveness of SmartRec over the other three reconstruction schemes, we conduct online reconstruction performance comparisons from the following four aspects. 6.4.1. Different RAID codes and different RAID sizes We carry out comparative experiments on storage systems powered by four RAID codes (i.e. EVENODD, STAR, RDP and TP), each of which has different RAID sizes. Figure 8 shows the experimental results of the four reconstruction schemes. Figure 8. View largeDownload slide Online reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. Figure 8. View largeDownload slide Online reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. We observe from Fig. 8 that the online reconstruction performance exhibits similar trend as that of the offline recovery scenario. In other words, SmartRec achieves better reconstruction performance over the other three reconstruction schemes regardless of the RAID codes; reconstruction performance of all four schemes show marginal degradation with the increasing RAID sizes. For example, in the TP-coded storage system where P is 7 and stripe unit is set to 64 KB, the reconstruction times are 521.23 s, 496.28 s, 436.58 s and 385.16 s for ConRec, MinRec, BalRec and SmartRec, respectively. SmartRec is superior to the other three schemes, because SmartRec chooses a flexible and capability-I/O-matched reconstruction data distribution in each recovery window (i.e. minimizing each reconstruction time slice) to minimize reconstruction time. Furthermore, the reconstruction time of the online scenario is longer than that of the offline scenario regardless of both RAID codes and RAID sizes. The reason is that the online reconstruction process has to handle the contention between user I/Os and reconstruction I/Os. 6.4.2. User response time Now we examine average user response times of the storage systems where the four reconstruction schemes are running. We focus on a TP-coded system where P is set to 7 and stripe unit size is 64 KB. We collect the response times of all user I/Os and calculate the average user response time of a window of 20 s. All the average user response times form an average user response time series. Figure 9 shows the average user response times of the four reconstruction schemes. Figure 9. View largeDownload slide Average user response time comparison. P=7, stripe unit size=64KB. Figure 9. View largeDownload slide Average user response time comparison. P=7, stripe unit size=64KB. From Fig. 9, we observe that SmartRec completes the reconstruction process faster than the other three schemes and the average response time of SmartRec is slightly lower than those of ConRec, MinRec and BalRec during data recovery. For instance, at the time slot 140 s, average user response time is 7.1 ms, 6.1 ms, 6.2 ms and 5.8 ms for ConRec, MinRec, BalRec and SmartRec, respectively. This is because SmartRec judiciously distributed reconstruction load among disks to well balance I/Os across all the disks. As a result, disks with high I/O capability treat large data volume and vice versa. Therefore, SmartRec decreases the user I/O queuing delay of the three competitive schemes, thereby shortening user average response times. 6.4.3. Different stripe unit sizes We evaluate the impact of different stripe unit sizes on reconstruction performance. Again, we study the TP-coded system where P is 7. The stripe unit size is set to 16 KB, 32 KB, 64 KB and 128 KB. Figure 10 depicts the comparative experiment results illustrating the impacts of stripe unit size. Figure 10. View largeDownload slide Reconstruction performance of the TP-coded storage system under different stripe unit sizes. P=7. Figure 10. View largeDownload slide Reconstruction performance of the TP-coded storage system under different stripe unit sizes. P=7. Figure 10 reveals that the reconstruction performance of the four schemes are gradually improving when the stripe unit size goes up. For example, when the stripe unit size is set to 16 KB, 32 KB, 64 KB and 128 KB, the reconstruction time of SmartRec is 715.64 s, 586.48 s, 385.16 s and 324.50 s. This trend is reasonable, because a large stripe unit size leads to a reduced number of I/O accesses, which in turn helps to improve access sequentiality. Moreover, SmartRec performs better in terms of data reconstruction than the other schemes regardless of stripe unit size. 6.4.4. Different heterogeneous configuration To assess the sensitivity of the four reconstruction schemes running in heterogeneous RAIDs, we evaluate the TP-coded storage system, where P is 7 and stripe unit size is 64 KB. We conduct comparative experiments on three different heterogeneous configurations, namely, Conf-A (100 MB/s–150 MB/s), Conf-B (78 MB/s–150 MB/s) and Conf-C (30 MB/s–150 MB/s). Figure 11 depicts the experimental results of the reconstruction schemes on heterogeneous RAIDs. Figure 11. View largeDownload slide Reconstruction performance comparison under different heterogeneous configurations. P=7 and stripe unit size=64KB. Figure 11. View largeDownload slide Reconstruction performance comparison under different heterogeneous configurations. P=7 and stripe unit size=64KB. Figure 11 illustrates that SmartRec consistently outperforms the other three reconstruction schemes under all the heterogeneous configurations. We also observe that the speedup of reconstruction performance gradually increases when disk heterogeneity rises. For example, in the Conf-A, Conf-B, and Conf-C cases, SmartRec improves reconstruction performance by 18.8%, 25% and 35.3% over the ConRec scheme, respectively. RAID systems with high heterogeneity gain more benefit from SmartRec than those with low heterogeneity, because SmartRec is tailored for heterogeneous recovery environments with the consideration of both reconstruction data distribution and I/O capabilities across all disks. 7. CONCLUSION In this paper, we have focused on the issue of single-disk failure recovery in parallel RAID-coded storage systems that tolerate multiple failures. We showed that the existing single-failure recovery schemes like ConRec, MinRec and BalRec speed up the recovery process of storage systems without addressing heterogeneity in RAID environments. We solved this problem by proposing a heterogeneous-aware single-disk failure recovery scheme called SmartRec, which improves the recovery performance by periodically selecting an appropriate reconstruction solution that retrieves surviving data based on the I/O capabilities of the surviving disks. We built recovery-time models for the four reconstruction schemes and validated the correctness of the four models using the empirical evaluations. We compared SmartRec against the three alternatives in the context of online data reconstruction tests by replaying real-world workloads under various disk configurations. The experimental results show that our SmartRec recovery scheme outperforms the existing three recovery schemes in terms of reconstruction time and user I/O performance. For example, in the online TP-coded heterogeneous RAIDs with 9 disks, SmartRec improves reconstruction performance over the existing three alternatives by up to 35.3% with an average of 25.8%. As a future research direction, we plan to improve the measurement accuracy of reconstruction I/O capability among surviving disks by incorporating varied window granularity, and further investigate the impact of storage heterogeneity on reconstruction I/O flows. FUNDING This work is supported in part by the National Science Foundation of China under Grant nos. 61572209, 61472152 and 61762075, and the Fundamental Research Funds for the Central Universities under HUST: 2015MS006. This work is also supported by Key Laboratory of Information Storage System. Xiao Qin’s work was supported by the U.S. National Science Foundation under Grants CCF-0845257 (CAREER Award), CNS-0917137, CCF-0742187 and supported by the 111 Project under Grant B07038. Meanwhile, it is also supported by Chunhui planning project of Ministry of Education under Grant no. Z2015059, the Province Science Foundation of QingHai under Grant nos. 2016-ZJ-920Q, 2014-ZJ-908, 2016-ZJ-739 and 2015-ZJ-718, the commercialization of research findings project of QingHai under Grant no. 2016-SF-130. This work is also supported by Key Laboratory of IoT of QingHai Province under Grant no. 2017-ZJ-Y21, and Society Science Foundation of china under Grant no. 15XMZ057. REFERENCES 1 Blaum , M. , Brady , J. , Bruck , J. and Menon , J. ( 1995 ) Evenodd: an efficient scheme for tolerating double disk failures in raid architectures . IEEE Trans. Comput. , 44 , 192 – 202 . Google Scholar CrossRef Search ADS 2 Corbett , P. , English , B. , Goel , A. , Grcanac , T. , Kleiman , S. , Leong , J. and Sankar , S. ( 2004 ) Row-Diagonal Parity for Double Disk Failure Correction. Proc. 3rd USENIX Conf. File and Storage Technologies, Berkeley, CA, USA FAST’04, pp. 1–14. USENIX Association. 3 Xu , L. and Bruck , J. ( 1999 ) X-code: Mds array codes with optimal encoding . IEEE Trans. Inf. Theory , 45 , 272 – 276 . Google Scholar CrossRef Search ADS 4 Corbett , P. F. and Goel , A. ( 2011 ). Triple Parity Technique for Enabling Efficient Recovery from Triple Failures in a Storage Array. US Patent 8,010,874. 5 Huang , C. and Xu , L. ( 2008 ) Star: an efficient coding scheme for correcting triple storage node failures . IEEE Trans. Comput. , 57 , 889 – 901 . Google Scholar CrossRef Search ADS 6 Ghemawat , S , Gobioff , H. and Leung , S.-T. ( 2003 ) The Google file system. ACM SIGOPS Operating Systems Review, pp. 29–43. ACM Association. 7 Pinheiro , E. , Weber , W.-D. and Barroso , L.A. ( 2007 ) Failure Trends in a Large Disk Drive Population. Proc. 5th USENIX Conf. File and Storage Technologies, Berkeley, CA, USA FAST ‘07, pp. 17–28. USENIX Association. 8 Wang , Z. , Dimakis , A.G. and Bruck , J. ( 2010 ) Rebuilding for array codes in distributed storage systems. GLOBECOM Workshops (GC Wkshps), 2010 IEEE, pp. 1905–1909. IEEE. 9 Khan , O. , Burns , R. , Plank , J. and Huang , C. ( 2011 ) In Search of I/O-Optimal Recovery from Disk Failures. 3rd USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage’11, Portland, OR, USA, June 14, 2011, pp. 6–11. USENIX Association. 10 Zhu , Y. , Lee , P.P.C. , Hu , Y. , Xiang , L. and Xu , Y. ( 2012 ) On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice. IEEE 28th Symp. Mass Storage Systems and Technologies, MSST 2012, April 16–20, 2012, Asilomar Conference Grounds, Pacific Grove, CA, USA, pp. 1–12. IEEE. 11 Xiang , L. , Xu , Y. , Lui , J.C.S. and Chang , Q. ( 2010 ) Optimal Recovery of Single Disk Failure in RDP Code Storage Systems. SIGMETRICS 2010, Proc. 2010 ACM SIGMETRICS Int. Conf. Measurement and Modeling of Computer Systems, New York, USA, 14–18 June 2010, pp. 119–130. ACM. 12 Xiang , L. , Xu , Y. , Lui , J.C.S. , Chang , Q. , Pan , Y. and Li , R. ( 2011 ) A hybrid approach to failed disk recovery using RAID-6 codes: algorithms and performance evaluation . TOS , 7 , 11:1 – 11:34 . Google Scholar CrossRef Search ADS 13 Xu , S. , Li , R. , Lee , P. , Zhu , Y. , Xiang , L. , Xu , Y. and Lui , J. ( 2013 ) Single disk failure recovery for x-code-based parallel storage systems . IEEE Trans. Comput. , 63 , 995 – 1007 . Google Scholar CrossRef Search ADS 14 Luo , X. and Shu , J. ( 2013 ) Load-Balanced Recovery Schemes for Single-Disk Failure in Storage Systems with Any Erasure Code. 42nd Int. Conf. Parallel Processing, ICPP 2013, Lyon, France, October 1–4, 2013, pp. 552–561. IEEE. 15 Schroeder , B. and Gibson , G.A. ( 2007 ) Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? 5th USENIX Conf. File and Storage Technologies, FAST 2007, February 13–16, 2007, San Jose, CA, USA, pp. 1–16. USENIX Association. 16 Drapeau , A.L. et al. . ( 1994 ) RAID-II: A High-Bandwidth Network File Server. Proc. 21st Annual Int. Symp. Computer Architecture. Chicago, IL, USA, April 1994, pp. 234–244. IEEE. 17 Wu , S. , Jiang , H. , Feng , D. , Tian , L. and Mao , B. ( 2009 ) WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance. 7th USENIX Conf. File and Storage Technologies, February 24–27, 2009, San Francisco, CA, USA. Proceedings, pp. 239–252. USENIX Association. 18 Tian , L. , Feng , D. , Jiang , H. , Zhou , K. , Zeng , L. , Chen , J. , Wang , Z. and Song , Z. ( 2007 ) PRO: A Popularity-Based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems. 5th USENIX Conf. File and Storage Technologies, FAST 2007, February 13–16, 2007, San Jose, CA, USA, pp. 277–290. USENIX Association. 19 Wan , S. , Cao , Q. , Huang , J. , Li , S. , Li , X. , Zhan , S. , Yu , L. , Xie , C. and He , X. ( 2011 ) Victim Disk First: An Asymmetric Cache to Boost the Performance of Disk Arrays under Faulty Conditions. Proc. 2011 USENIX Annual Technical Conference, pp. 13–25. USENIX Association. 20 Khan , O. , Burns , R.C. , Plank , J.S. , Pierce , W. and Huang , C. ( 2012 ) Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads. Proc. 10th USENIX Conf. File and Storage Technologies, FAST 2012, San Jose, CA, USA, February 14–17, 2012, pp. 1–20. USENIX Association. 21 Xie , T. and Wang , H. ( 2008 ) Micro: a multilevel caching-based reconstruction optimization for mobile storage systems . IEEE Trans. Comput. , 57 , 1386 – 1398 . Google Scholar CrossRef Search ADS 22 Zhu , Y. , Lee , P.P.C. , Xiang , L. , Xu , Y. and Gao , L. ( 2012 ) A Cost-Based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes. IEEE/IFIP Int. Conf. Dependable Systems and Networks, DSN 2012, Boston, MA, USA, June 25–28, 2012, pp. 1–12. IEEE. 23 Luo , H. , Huang , J. , Cao , Q. and Xie , C. ( 2014 ) LaRS: A Load-Aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters. 9th IEEE Int. Conf. Networking, Architecture, and Storage, NAS 2014, Tianjin, China, August 6–8, 2014, pp. 168–175. IEEE. 24 Reed , I. S. and Solomon , G. ( 1960 ) Polynomial codes over certain finite fields . J. Soc. Ind. Appl. Math. , 8 , 300 – 304 . Google Scholar CrossRef Search ADS 25 Plank , J.S. , Luo , J. , Schuman , C.D. , Xu , L. and Wilcox-O’Hearn , Z. ( 2009 ) A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. 7th USENIX Conf. File and Storage Technologies, February 24–27, 2009, San Francisco, CA, USA. Proceedings, pp. 253–265. USENIX Association. 26 Hafner , J.L. , Deenadhayalan , V. , Rao , K.K. and Tomlin , J.A. ( 2005 ) Matrix Methods for Lost Data Reconstruction in Erasure Codes. Proc. FAST ‘05 Conf. File and Storage Technologies, December 13–16, 2005, San Francisco, California, USA, pp. 1–14. USENIX Association. 27 Scheibli , D. , Eiler , J. and Randall , R. , Iometer: I/O Subsystem Measurement and Characterization Tool. http://www.iometer.org. 28 Schroeder , B. , Wierman , A. and Harchol-Balter , M. ( 2006 ) Open Versus Closed: A Cautionary Tale. 3rd Symp. Networked Systems Design and Implementation (NSDI 2006), May 8–10, 2007, San Jose, California, USA, Proceedings., pp. 1–18. 29 Marc Liberatore . OLTP Application I/O and Search Engine I/O. http://traces.cs.umass.edu/index.php/Storage/Storage. Author notes Handling editor: Antonio Fernandez Anta © The British Computer Society 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

SmartRec: Fast Recovery from Single Failures in Heterogeneous RAID-Coded Storage Systems

Loading next page...
 
/lp/ou_press/smartrec-fast-recovery-from-single-failures-in-heterogeneous-raid-FFrRf08Whf
Publisher
Oxford University Press
Copyright
© The British Computer Society 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxx106
Publisher site
See Article on Publisher Site

Abstract

Abstract It is not uncommon for reconstruction I/Os to encounter workload fluctuation in heterogeneous RAID-coded storage systems. This paper proposes a heterogeneity-aware single-failure recovery scheme—SmartRec—to tolerate double and multiple disk failures in RAIDs. We start this study by formulating the data recovery problem of single-disk failures in form of an optimization function in the context of online and heterogeneous disk arrays. To take both static heterogeneity associated with disk configurations and dynamic heterogeneity affected by I/O loads into account, SmartRec periodically selects an appropriate reconstruction solution according to up-to-date disk utilization. The appropriate reconstruction solution indicates the amount of data being retrieved across surviving disks and is expected to achieve minimal recovery time, which is induced by both candidate reconstruction sequences and reconstruction I/O capability of surviving disks. We build a response-time model in SmartRec to measure the reconstruction I/O capability of surviving disks during a recovery process. To quantitatively compare the SmartRec scheme against three alternatives (i.e. ConRec, MinRec and BalRec), we build four analytical models and validate the correctness of the four models using empirical evaluations. We implement the four reconstruction schemes in a heterogeneous RAID, and carry out comparative online reconstruction tests by replaying real-world workloads under various configurations. The experimental results illustrate that our SmartRec scheme outperforms the three existing reconstruction schemes in terms of reconstruction time by up to 35.3% with an average of 25.8%. 1. INTRODUCTION Owing to both volume scalability and I/O parallelism, disk arrays have been widely deployed in modern data centers. Different from single-fault–tolerant RAID-4 and RAID-5 layouts, many double- and triple-fault–tolerant RAID layouts were proposed to guarantee higher data reliability. For example, EVENODD [1], RDP [2] and X-Code [3] codes tolerate double failures; whereas TP [4] and STAR [5] tolerate triple failures. Disk failures are a norm event in large-scale storage systems [6]. Furthermore, single-disk failure plays a dominated role when disks fail in RAID-coded storage; for example, evidence shows that about 99.75% of recoveries are carried out for single-disk failures [7]. Fast online recovery from disk failures is very crucial to OLTP applications (e.g. Financial transaction systems, retail sale systems, etc.), which pose requirements of both I/O performance and data availability to underlying storage subsystems. Various reconstruction optimization schemes were proposed to recover single-disk failures [8–13]. These proposed reconstruction schemes seek the optimal recovery solution by exploiting several reconstruction sequences when a single-disk fails in RAID-coded storage systems that tolerate double and multiple concurrent failures. Minimizing the number of reconstruction I/Os is an efficient approach to accomplishing high-speed recovery. Reconstruction I/Os are induced by retrieving surviving blocks from surviving disks to reconstruct lost blocks. The RDOR algorithm [11] is an optimal disk recovery scheme for single-disk failures of RAIDs using RDP codes, achieving the smallest number of disk reads; two similar recovery schemes were proposed for single-disk recovery in RAIDs powered by EVENODD code and X-Code [8, 13]. An enumeration recovery algorithm is proposed to discover an optimal I/O recovery for single-disk failures in general XOR-based erasure codes [9]. The enumeration recovery can minimize read reconstruction I/Os at the cost of exponential computational overhead in searching for a recovery solution. To speedup the search process, a replace recovery scheme was developed to offer near-I/O-optimal recovery [10]. The aforementioned recovery schemes attempt to optimize reconstruction performance by minimizing read reconstruction I/Os. However, these schemes ignore the distribution of reconstruction I/Os across surviving disks. Intuitively, the total recovery time is determined by the time spent in reading data from a disk with the heaviest load in RAID-coded storage systems. To balance read reconstruction I/Os, Luo et al. proposed two recovery algorithms (i.e. C-algorithm and U-algorithm) to minimize read reconstruction I/Os from each surviving disk while balancing read reconstruction I/Os among surviving disks [14]; Xiang et al. designed an extended hybrid recovery approach to balancing the number of blocks being read among surviving disks [12]. Unfortunately, balanced read reconstruction I/Os do not necessarily lead to minimized recovery time. A disk’s I/O capability available to reconstruction is not only restricted by its inherent bandwidth but also influenced by user I/O load. Thus, the I/O capability of an involved disk is fluctuated during the entire reconstruction period. That is, there is an issue of disk heterogeneity in RAID reconstruction. In this paper, we propose a ‘Heterogeneity-aware’ reconstruction scheme—SmartRec—by making the number of reconstruction reads from a surviving disk to be matched with I/O capability of the disk. We use a concrete example to illustrate that it is necessary to take disk heterogeneity into consideration to improve reconstruction performance. Figure 1 shows the normalized reading time using ‘Balanced-I/O’ and ‘Heterogeneity-aware’ recovery policies. The balanced-I/O policy balances read reconstruction I/Os among surviving disks (e.g. {4, 4, 3} reads on disks {#1, #2, #3}, respectively). With the Heterogeneity-aware policy in place, the I/O capability of all the surviving disks can be fully utilized during the entire reconstruction process. We observe from Fig. 1 that the normalized reading time of the ‘Balanced-I/O’ and ‘Heterogeneity-aware’ policies are 1.0=max{1.0,0.4,0.5} and 0.6=max{0.5,0.6,0.5}, respectively. In a word, it is expected to carry out a fast reconstruction when reconstruction I/Os are assigned to disks according to the disk’s I/O capability. Figure 1. View largeDownload slide Comparison of reading time between balanced-I/O recovery policy and heterogeneity-aware recovery policy. (a) I/O capability of three surviving disks; (b) number of reconstruction reads; (c) reading time. Figure 1. View largeDownload slide Comparison of reading time between balanced-I/O recovery policy and heterogeneity-aware recovery policy. (a) I/O capability of three surviving disks; (b) number of reconstruction reads; (c) reading time. The design of recovery schemes for online and heterogeneous disk arrays is still an open problem. There are two factors in making heterogeneous reconstruction I/Os: (1) Disk configuration, to maintain and upgrade disk arrays [15], new disks from different vendors may be appended into existing RAIDs, where the new and old disks are heterogeneous in nature; (2) I/O load fluctuation, given homogeneous disks in a disk array, imbalanced user I/Os issued among disks make I/O capacity of the disks heterogeneous from the perspective of reconstruction I/Os. To the best of our knowledge, little research effort has been directed towards enabling reconstruction schemes to handle reconstruction I/O heterogeneity in surviving disks. To reduce reconstruction time while alleviating performance degradation of user I/Os, our SmartRec scheme attains a flexible and pertinent recovery strategy to match the reconstruction capacity that is dynamically changing in heterogeneous surviving disks. Specifically, we periodically select an appropriate reconstruction solution according to up-to-date disk utilization within each recovery window, which achieves minimal reconstruction time induced by both candidate recovery sequences and I/O capability of involved surviving disks. We build a model to estimate the average response time of reconstruction I/Os governed by SmartRec; we apply this response-time model to measure the reconstruction I/O capability of surviving disks during a recovery process. We formulate the reconstruction problem of single-disk failures in the form of an optimization function in the context of online and heterogeneous disk arrays. We implement the SmartRec scheme as well as the other three existing schemes in a heterogeneous RAID. We carry out comparative online reconstruction tests by replaying real-world workloads under a wide variety of configurations. The experimental results illustrate that our SmartRec scheme significantly outperforms the three existing reconstruction schemes in terms of reconstruction time by up to 35.3% with an average of 25.8%. The contributions of this paper are summarized as follows: We propose a heterogeneous-aware reconstruction scheme (i.e. SmartRec) to recover single-failure disk’s data in parallel RAID-coded storage systems that tolerate concurrent multiple-disk failures. We build a response-time model for SmartRec to quantify the reconstruction I/O capability of surviving disks during a recovery process. We develop four analytical models to theoretically compare SmartRec against the three alternative schemes (i.e. ConRec, MinRec and BalRec), and validate the correctness of the four models using empirical evaluations. We implement SmartRec and the three existing reconstruction schemes in a heterogeneous RAID, and the experimental results illustrate that our SmartRec scheme outperforms them in terms of reconstruction time and user response time. The rest of this paper is organized as follows. Section 2 briefly overviews the related work. The erasure-coded RAID are introduced in Section 3. Section 4 details the design issues of SmartRec. The analytical models for the four reconstruction schemes are presented in Section 5. The comparative experiments are given in Section 6. Finally, we conclude our work in Section 7. 2. RELATED WORK Data recovery for erasure-coded storage systems has drawn much attention over the years [16]; a number of recovery techniques have been proposed to improve the recovery performance from various perspectives [10, 12, 17–20]. In this section, we overview existing single-failure recovery techniques for multiple-fault–tolerant RAID-coded storage systems. We classify these techniques into the following three categories. Conventional Recovery Scheme (ConRec): In the case of ConRec, all symbols of a single-failure disk are reconstructed only by a set of independent parity chains (see Section 3), of which no common symbol exists. All surviving symbols in a parity chain set are retrieved for reconstructing failed single-disk, i.e. reconstruction I/Os involve all symbols of each surviving disk. For example, a naive recovery adopting ConRec scheme during a single-disk failure reconstruction process can be found in the literature [1–3]. Minimal Recovery Scheme (MinRec): For the MinRec category, all lost symbols of a single-failed disk are regenerated by a set of diverse parity chains (a.k.a., hybrid parity chains), where common symbols exist. Therefore, reconstruction I/Os only include a part of all surviving symbols, thereby improving reconstruction performance; however, the distribution of reconstruction I/Os in MinRec may be non-uniform across surviving disks, which lead to low reconstruction performance. To minimize read reconstruction I/Os, Xiang et al. [12] and Wang et al. [8] proposed optimal single-disk failure reconstruction schemes for RDP and EVENODD, respectively. For any RAID codes, Khan et al. [20] designed an enumeration recovery algorithm for single-disk failures, whereas Zhu et al. [10] developed a replace recovery algorithm generating near-optimal recovery scheme to speed up the searching process over the enumeration recovery scheme. Balanced Recovery Scheme (BalRec): When it comes to BalRec, a single-failed disk is reconstructed by hybrid parity chains; this reconstruction strategy is similar to that of MinRec. Different from MinRec minimizing read reconstructed I/Os from surviving disks, BalRec balances read reconstruction I/Os among surviving disks. Thanks to parallel I/O accesses in the context of RAID, Luo et al. [14] argued that the critical factor to influence recovery performance of failure disks lies in read reconstruction I/Os of a heavily loaded disk (i.e. bottleneck disk). To address the issue of data non-uniform distribution induced by minimal read reconstruction I/O load, Luo et al. proposed two improved reconstruction algorithms called C-algorithm and U-algorithm, which balance read reconstruction I/O load on surviving disks. Some above-mentioned recovery schemes are summarized in Table 1. Different from the existing balance-oriented reconstruction optimization schemes balancing a retrieved amount of reconstructed data among surviving disks, our SmartRec scheme aims to balance workload of surviving disks, where the number of data blocks retrieved from a surviving disk depends on the reconstruction I/O capability of the disk, and the reconstruction I/O capability is usually dynamically changing along with I/O intensity, disk I/O schedulers and disk hardware configurations. Table 1. Comparison of RAID reconstruction schemes. Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time aExisting load-balancing reconstruction schemes ensure that read reconstruction I/Os across surviving disks are equal or closed; these techniques ignore disk processing capability, which is affected by user I/Os. In some sense, the capability of disks can be considered equal in the case of offline reconstructions where there is no user I/O. bOur SmartRec considers dynamically changing disk capability (i.e. a higher disk capability indicates that disks can process more reconstruction I/Os). The disk capability is affected by two types of factors: (1) static factors–disk configurations (e.g. disk models; B. access patterns); (2) dynamic factors—depend on user I/Os. View Large Table 1. Comparison of RAID reconstruction schemes. Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time Schemes Codes Balance Metric RDOR [12] RDP Yes a min-reconstruction I/O Wang et al. [8] EVENODD No min-reconstruction I/O Khan et al. [20] RAID codes No min-reconstruction I/O Zhu et al. [22] RAID Codes No transmission latency Luo et al. [14] RAID Codes Yes a bal-reconstruction I/O SmartRec RAID Codes Yes b reconstruction time aExisting load-balancing reconstruction schemes ensure that read reconstruction I/Os across surviving disks are equal or closed; these techniques ignore disk processing capability, which is affected by user I/Os. In some sense, the capability of disks can be considered equal in the case of offline reconstructions where there is no user I/O. bOur SmartRec considers dynamically changing disk capability (i.e. a higher disk capability indicates that disks can process more reconstruction I/Os). The disk capability is affected by two types of factors: (1) static factors–disk configurations (e.g. disk models; B. access patterns); (2) dynamic factors—depend on user I/Os. View Large Apart from optimizing the reconstruction performance based on selecting various recovery chains’ combination, other studies propose different techniques on the failure recovery of storage systems, such as Pro [18] integrates the user I/O access locality into reconstruction process for optimizing reconstruction sequence, Workout [17] improves the reconstruction time and user response time by outsourcing users’ workloads during recovery, and Wan et al. [19] and Xie et al. [21] improve recovery performance by exploring better cache utilization, etc. In the paper, we focus on the single-failure recovery scenarios of paralleled heterogeneous RAID-coded storage systems. However, aim at the heterogeneous recovery problem, Zhu et al. [22] developed a cost-based heterogeneous recovery algorithm called CHR for heterogeneous networked storage systems. Luo et al. [23] propose a LaRS scheme for optimizing failure recovery in the context of heterogeneous Erasured-Coded storage cluster. Our SmartRec differs from CHR and LaRS from the following four perspectives. First, CHR and LaRS is focused on heterogeneity of network transmission bandwidth in networked storage systems, whereas SmartRec addresses the heterogeneity issue of disk bandwidth in the context of RAID systems. We design SmartRec for RAID systems, because we observe that the heterogeneity problem of data recovery in RAID systems lies in disks rather than network resources. Second, the recovery cost model of SmartRec is different from that of CHR and LaRS. CHR and LaRS are a coarse-grained model in a static environment; SmartRec is a fine-grained model incorporating the concept of window to quantify recovery cost in a dynamic environment. Third, CHR and LaRS pay no attention to user response time, whereas SmartRec employs a real-time monitor to keep track of user and disk request response times. We investigate the impacts of SmartRec on user requests in various dynamic scenarios. Last, CHR was implemented at the file system level using NCFS. We implement SmartRec at the block level to further speed up recovery performance. LaRS is implemented in RS-Coded [24] storage cluster, while SmartRec was implemented in RAID-Coded (i.e. only use XOR operation) storage system. 3. ERASURE-CODED RAID In this study, we focus on RAID codes computed using XOR-based operations. Throughout this paper, we apply similar notation used in [25] for erasure codes. Plank et al. [25] implemented a performance evaluation and examination of five different types of erasure codes, compared the encoding and decoding performance of them, and demonstrated which features and parameters lead to good coding performance, for instance, an optimization called Code-Specific Hybrid Reconstruction [26] is necessary to achieve good decoding speeds in many of the codes, which give storage system designers an idea of what to expect in terms of coding performance when designing their storage systems. Data stored in an erasure-coded storage system is partitioned into fixed-size stripes, each of which is a two-dimensional array with a fixed number of rows and n columns. Each column is called a strip with w symbols, which corresponds to a unique disk. Each parity strip is encoded using one strip from each data disk, and the collection of k+m strips that encode together is called a stripe. Each stripe stores a fixed n*w number of blocks. Stripes are independently encoded in such a way that data and parity strips are rotated among disks for balanced load [25]. Let us take the TP code as an example to illustrate the construction mechanisms of RAID codes tolerating multiple-disk failures. Figure 2 shows that TP code is defined in a (p−1)-row-by-(p+2)-column matrix, where p>2 is a prime number. The first p−1 disks (i.e. data disks) hold all data blocks, whereas the last three disks (i.e. parity disks) store all parity blocks. We assume that Ci,j represents a block in the i th row and the j th column. There are three block types, namely, row parity blocks, diagonal parity blocks and anti-diagonal parity blocks. A parity chain is an independent minimal fault–tolerant unit. There are three kinds of parity chains (i.e. row parity chains, diagonal parity chains and anti-diagonal parity chains); each parity chain is a set of blocks containing a parity block and the corresponding data blocks. In a parity chain, one block failure is reconstructed from surviving blocks. To facilitate this analysis, we consider cases where the same shape form a parity chain. For example, in a case where block C0,0 fails, the parity-chain mechanism constructs block C0,0 from the surviving blocks {C0,1,C0,2,C0,3,C0,4}. Figure 2. View largeDownload slide Data layout of TP code in a stripe. (a) Row parity chain set. (b) Diagonal parity chain set. (c) Anti-diagonal parity chain set. Figure 2. View largeDownload slide Data layout of TP code in a stripe. (a) Row parity chain set. (b) Diagonal parity chain set. (c) Anti-diagonal parity chain set. A TP-coded storage system has p+2 disks denoted by d0, d1,…,dp+1. Without loss of generality, we let disks dp−1, dp and dp+1 hold a row, diagonal and anti-diagonal parity blocks, respectively. Let df be a malfunctioned disk, where 0≤f≤p−2. We suppose that reconstruction operations read ri blocks from the i th surviving disk in a recovery sequence. Given the j th reconstruction sequence RSj={{yi}0≤i≤p−2}j, which specifies how each block of failed disk df is reconstructed, yi=0 indicates that the i th block of failed disk df is reconstructed by a row parity chain; yi=1 means that the i th symbol of failed disk df is reconstructed by a diagonal parity chain; and yi=−1 suggests that the i th block of failed disk df is reconstructed by an anti-diagonal parity chain. Various strategies that speedup in searching reconstruction sequences can be found in the literature [9, 10]. Each reconstruction sequence corresponds to a reconstruction I/O distribution. Among all available reconstruction sequences, SmartRec chooses I/O distributions that induce minimal reconstruction I/Os [20] as candidate recovery solutions ( Rcand) (i.e. Rcand={RS1,RS2,…,RSj}). Suppose that the j th reconstruction sequence ( RSj) is determined, we can obtain an associated reconstruction I/O distribution {{rj,i}i∈ds}j, where ds∈{0,…,f−1,f+1,…,p+1}. For example, if disk ‘0’ fails (see Fig. 2) and RSj={1,1,0,−1}, then we have reconstruction I/O distribution {{rj,i}i∈ds}j={0,2,3,3,2,2,1}. 4. DESIGN ISSUES 4.1. The overview of SmartRec SmartRec aims to optimize the reconstruction process by dynamically balancing I/O capability of surviving disks. In the context of heterogeneous disk arrays, various types of disks may experience different access patterns, strategies and I/O capabilities. Reconstruction time, of course, is determined by the slowest surviving disk (i.e. bottleneck disk), which takes the most amount of time in retrieving data for the recovery purpose. In addition, for online storage systems, user I/O intensity dynamically changes during the course of data recovery [18]. Thus, the reconstruction bandwidth varies with the changing I/O intensity within each recovery window during an online recovery process in heterogeneous disk arrays. SmartRec advocates an idea of assigning the amount of recovery load according to I/O capability of surviving disks, thereby taking a full advantage of the dramatically changing reconstruction bandwidth within multiple recovery windows. Figure 3 depicts a sample data reconstruction process governed by SmartRec. The regenerated data of single-failure disk is placed in a replacement disk during data recovery, including reconstructed units and unit needs reconstruction. The data-recovery process is divided into multiple fixed sizes of recovery windows (i.e. K−1, K and K+1). The recovery of malfunctioned units for each recovery window has to retrieve different amounts of data from surviving disks. We refer to such a recovery data assignment as reconstruction I/O distribution (see details in Section 3) across surviving disks. The retrieved I/O distribution from surviving disks depends on capability-I/O-based candidate recovery sequences (e.g. High, Medium and Low), which indicate different amounts of data being retrieved for reconstruction. For example, ‘High’ means the amount of data being retrieved most matches I/O capability of involved surviving disks in the current recovery window;in this case, the reconstruction time is minimized. We refer to this recovery sequence as a capability-I/O-matched recovery solution, e.g. RSmatch={0,1,0,1,1,−1}. The capability-I/O-matched recovery solutions are derived from both candidate recovery sequences and I/O capability of disks (i.e. reconstruction bandwidth of disk). Each recovery window corresponds to multiple candidate recovery sequences, among which SmartRec attempts to select the capability-I/O-matched recovery solution in each recovery window. For instance, assuming the I/O capability of surviving disks {#1, #2, #3, #4, #5, #6, #7, #8} are 38.32 ms/MB, 9.63 ms/MB, 40.51 ms/MB, 16.01 ms/MB, 16.87 ms/MB, 9.37 ms/MB, 6.45 ms/MB and 16.05 ms/MB in the current K th recovery window, separately. The candidate recovery sequence of TP-coded storage system with an array of 9 disks (i.e. P=7) RS1 is {0,1,0,1,1,−1} corresponding to the reconstruction I/O distribution of {3,3,4,3,4,4,3,1}, RS2 is {0,1,−1,−1,0,−1} corresponding to the reconstruction I/O distribution of {3,3,4,3,4,4,1,3}, RS3 is {0,−1,0,−1,−1,1} corresponding to the reconstruction I/O distribution of {4,4,3,4,3,3,1,3,}, RS4 is {0,−1,1,1,0,1} corresponding to the reconstruction I/O distribution of {4,4,3,4,3,3,3,1}. Figure 3. View largeDownload slide The snapshot of a data reconstruction process in SmartRec. Figure 3. View largeDownload slide The snapshot of a data reconstruction process in SmartRec. Recall that SmartRec advocates the idea of minimal reconstruction time induced by both candidate recovery sequences and I/O capability of involved surviving disks. Theoretically, the minimal reconstruction time is equal to the minimal value among the results of reconstruction I/O distributions multiply by I/O capability of the slowest disk. Thus, recovery sequence {0,−1,0,−1,−1,1} achieving minimal reconstruction time is selected as a capability-I/O-matched recovery solution. SmartRec schedules reconstruction threads using a capability-I/O-based time-sharing policy. The reconstruction thread for a recovery window is activated as soon as a capability-I/O-matched solution is selected among all the candidate recovery sequences. A recovery window corresponds to a reconstruction time slice (e.g. K−1 associated with Tk−1, K associated with Tk, K+1 associated with Tk+1). The size of corrupted units is fixed in a recovery window; however, the reconstruction time slice varies based on changing reconstruction bandwidth within different recovery windows. The overarching goal of SmartRec is to minimize reconstruction time slices. It is worth noting that size of reconstruction time slice in SmartRec is associated with the number of candidate recovery sequences. A capability-I/O-matched recovery solution minimizing reconstruction time slice is expected to be chosen in each recovery window. Figure 4 depicts the I/O-capability selection strategy among multiple reconstruction time slice (i.e. Tk−1, Tk and Tk+1). To measure the current I/O capability of involved surviving disks (e.g. disk i and disk i+1), we collect statistics of historical access times to assess the I/O capability of surviving disks during each recovery window. Such measurements are of importance, because actual I/O capability of the next window is unpredictable during the current window. Therefore, the reconstruction bandwidth of a previous recovery window is induced to determine a predicted capability-I/O-Matched reconstruction solution in the current recovery window (see Equation (6)). For instance, for the current (K−1) th window (i.e. the Tk−1 reconstruction time slice), the I/O capability is derived from the (K−2) th window. Similarly, the I/O ability of the k th recovery window is governed by one of the (k−1) th recovery windows. Figure 4. View largeDownload slide The I/O-capability selection strategy among multiple recovery windows. Figure 4. View largeDownload slide The I/O-capability selection strategy among multiple recovery windows. 4.2. The implementation details In this section, we discuss the design issues of SmartRec for online single-disk failure recovery in heterogeneous disk arrays. We present design principles and algorithm descriptions. The notation and parameters used throughout this paper can be found in Table 2. The SmartRec architecture consists of three key components, namely access monitor (AM), reconstruction scheduler (RS) and reconstruction executor (RE). Table 2. The notation used in the recovery models. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. View Large Table 2. The notation used in the recovery models. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. Parameters Meaning S Size of block unit BWrec,k BWrec,k denote reconstruction bandwidth among bottleneck disk in the k th recovery window BWk,i BWk,i denote reconstruction bandwidth in i th disk during k th recovery window rj,i I/Os retrieved from i th disk in j th recovery sequence Rcand a set of candidate recovery sequence Numwin # of recovery windows for need reconstruction Numrec,k−1,i # of processed reconstruction I/O in i th disk during (k−1) th recovery window Tcon, Tmin, Tbal, Tsmart Total reconstruction time of ConRec, MinRec, BalRec and SmartRec tl the response time of l th processed reconstruction I/O trec,k,i average response time of reconstruction I/O in the i th disk during k th recovery window, i.e. reconstruction I/O capability of disk df, ds df denote failed disk, ds denote surviving disks Numcon,k, Nummin,k, reconstruction I/O’s amount of ConRec, MinRec, BalRec and SmartRec Numbal,k, Numsmart,k among bottleneck disk in k th recovery window. View Large The access monitor or AM is responsible for capturing user and reconstruction I/Os. AM calculates the average response time for user I/O in RAID-coded storage systems; AM also keeps track of average access time for reconstruction I/Os within each recovery window. AM performs the following steps: (1) AM checks if all recovery windows have been reconstructed. If Numwin≠0, then AM captures user and reconstruction I/O requests, which include I/O type and address information; (2) AM calculates average response time of user I/Os (i.e. tuser) and average response time of reconstruction I/Os during the k th recovery window (i.e. trec,k). Such an average response time is an indicator to quantify the reconstruction I/O capability of disks (see Equation (6)). The design of AM is outlined in Algorithm 1. Algorithm 1 Access Monitor (AM). Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Algorithm 1 Access Monitor (AM). Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while Input: tuser,array[] and trec,array[] hold the response time of both user I/Os and reconstruction I/Os, which correspond to two circular arrays respectively. Output: tuser and trec,k. Notation: fuser:# of user I/O during recovery. frec,k,i:# of reconstruction I/O retrieved in i th surviving disk during k th recovery window.  1: while Numwin≠ 0 do  2:  if (User IO is true) do //user IO request  3:    Tuser=0  4:    for each (j th user IO in tuser,array[])do  5:      Tuser=Tuser+tuser,array[j]  6:    end for  7:    tuser=Tuser/fuser  8:    return tuser  9:  else //reconstruction IO request 10:    for each ( i th surviving disks) do 11:      Trec,k,i=0 12:     for each (j th reconstruction IO       in trec,array[])do 13:       Trec,k,i=Trec,k,i+trec,array[j] 14:     end for 15:      trec,k,i=Trec,k,i/frec,k,i 16:    end for 17:    return trec,k={trec,k,i}i∈ds 18:  end if 19: end while The responsibility of reconstruction scheduler or RS is to select a capability-I/O-matched recovery solution within the k th recovery window, according to the candidate recovery sequences Rcand (see also Section 3) and reconstruction I/O capability of disks (i.e. trec,k, see Algorithm 1) in the current recovery window. Algorithm 2 shows the pseudo-code of the RS algorithm. The RS process is divided into two phases: (1) according to trec,k and candidate recovery sequence RSj, RS calculates the reconstruction time for all candidate recovery sequences (i.e. max(RSj)) (see Steps 5–12 in Algorithm 2); (2) using the results from phase 1, RS obtains reconstruction times of all candidate recovery sequences, followed by selecting the minimal value among all the max(RSj) values. Such a minimized max(RSj) value becomes the recovery solution of capability-I/O-matched( RSmatch) (see Steps 14–23 in Algorithm 2) Algorithm 2 Reconstruction Scheduler (RS) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Algorithm 2 Reconstruction Scheduler (RS) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) Input: Rcand and trec,k. Output: Capability-I/O-matched recovery solution ( RSmatch). Notation: Rcand={RS1,RS2,…,RSj} is a set of recovery sequence. trec,k={trec,k,1,trec,k,2,…,trec,k,n}, n is # of disks. given j th recovery sequence RSj, existing: RSj×trec,k={rj,1×trec,k,1,…,rj,n×trec,k,n}. max(RSj)=MAXi=1n{rj,i×trec,k,i}.  1: while Numwin≠0do  2:  if Numrec,k−1=0then  3: /*Failed units have been reconstructed in (k−1) th window.*/  4: Phase1:/*compute max( RSj), 1≤j≤m*/  5:  for(j = 1; j≤m; j++) do  6:    max(RSj)=0  7:    for ( i=1; i≤n; i++) do  8:     if (max(RSj)<rj,i×trec,k,i)do  9:       max(RSj)=rj,i×trec,k,i 10:      end if 11:    end for 12:  end for 13: Phase2:/*find the Capability-I/O-matched sequence*/ 14:  match = 1 15:   minRcand=max(RSmatch) 16:  for(j = 2; j≤m; j++) do 17:    if (max(RSj)<minRcand)do 18:     match = j 19:      minRcand=max(RSj) 20:    end if 21:  end for 22:  return match 23:  end if 24:  Initialize Numrec,k 25:   Numwin=Numwin−1 26: end while 27:  Invoke RE(RSmatch) . The main function of RE is to retrieve reconstruction I/Os from a capability-I/O-matched recovery solution and to rebuild corresponding failed units on a replacement disk. RE is comprised of three phases: (1) reading data process; (2) XOR data process; (3) writing data process. Algorithm 3 outlines the RE design. The operations of the components are detailed as follows. During the reading data process (see Steps 4–9 in Algorithm 3), all processes associated with surviving disks follow the following five steps to retrieve data in accordance to a capability-I/O-matched recovery solution. First, reconstruction sequence RSmatch for the current recovery window is obtained. Second, a request is issued to read an indicated failed unit into a read buffer manager. Third, RE waits for the read request to complete. Fourth, the unit’s data is transferred to a centralized buffer manager for an XOR operation; such a data transfer may be blocked if the buffer is full. Last, if all failed units in the current recovery window are reconstructed, Numrec,k will be decreased to zero. In the XOR data process (see Steps 11–17 in Algorithm 3), RE regenerates failed data according to the retrieved data via reconstructed I/Os. Specifically, RE first checks whether the read buffer has any data. If the read buffer is not empty, then RE reads data from the buffer for an XOR operation. Finally, RE writes the XOR result to the write buffer. Algorithm 3 Reconstruction Executor (RE). State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while Algorithm 3 Reconstruction Executor (RE). State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while State: To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 1), RE creates n threads, each of which is associated with one disk. Input: RSmatch  1: Create ReadBuffer to handle reconstruction I/Os.  2: Create WriteBuffer to store regenerated failed data.  3: /*Reading Data Process*/  4: while ( Numwin≠0) and ( Numrec,k>0) do  5:  if ReadBuffer()≠Fullthen  6:    read(RSmatch)→ReadBuffer()  7:  end if  8:    Numrec,k=Numrec,k−1  9: end while 10: /*XOR Data Process*/ 11: while Numwin≠0do 12:  if ReadBuffer()≠nullthen 13:    bufferdata ← ReadBuffer(size,address) 14:    XorResult = Xor(bufferdata) 15:    WriteBuffer(XorResult) 16:  end if 17: end while 18: /*Writing Data Process*/ 19: while Numwin≠0do 20:  if WriteBuff()≠nullthen 21:    readresult ← read(size, address, WriteBuffer) 22:    write(readresult) → replacedisk 23:  end if 24: end while During the writing data process (see Steps 19–24 in Algorithm 3), RE is responsible for writing rebuilt data into a replacement disk. Thus, a request is issued to write data in the write buffer into the replacement disk. If the write buffer is empty, the writing process is stalled. The writing process is resumed when the write buffer has data. To obtain a dynamically balancing reconstruction I/O capacity for improving recovery performance, SmartRec tracks the changing of I/O Capacity across all surviving disks during the entire recovery. SmartRec divides the entire reconstruction area into multiple non-overlapping but consecutive data areas, one of which called one recovery window associated with multiple recovery sequences. Among each recovery window, the reconstruction threads select a Capacity-I/O-Matched recovery solution that minimal recovery time from different candidate recovery sequences. In our implementation, the entire recovery process creates n threads associated with n−1 surviving disks and one placement disk. The whole reconstruction areas hold 10 GB data volume, which is average divided into five recovery windows, i.e. a recovery window is 2 GB of data. With this approach, SmartRec achieves the goal of maintaining dynamically balancing reconstruction load. 4.3. Overhead analysis Space Overhead Analysis: For online recovery process, memory overhead is mainly governed by user I/O trace and reconstruction I/O, which corresponding to three data buffers, i.e. Tracebuffer, ReadBuffer and WriteBuffer in the memory, respectively. Tracebuffer hold user I/O trace assigned to appropriate address, ReadBuffer store retrieved reconstruction blocks from surviving disks, which is computed by XOR operation to generate failed data blocks saved in WrtieBuffer. For instance, assuming 20 reconstruction blocks in a stripe are retrieved into ReadBuffer, each block’s size is 16 KB, the ReadBuffer’s size is approximately 0.32 MB. Obviously, WriteBuffer’s size is smaller. Memory overhead for TraceBuffer is about 2.5 MB from a part of web-2 trace. However, the memory overhead is only temporary and will be removed after the reclaim process completes. So memory overhead is arguably reasonable and acceptable to 16 GB of memory capacity in real storage server. Computing Overhead Analysis: AM is mainly responsible for calculating the average response time for both user I/O and reconstruction I/O. So the AM’s overhead is mainly from computing overhead. RS process is responsible for selecting Capacity-I/O-Matched recovery solution based on both average response time of reconstruction I/O and candidate recovery solutions. RS creates multiple threads associated with candidate recovery solutions, of which each thread is responsible for calculating the recovery time corresponding to reconstruction window. In the process of RS, candidate recovery solutions that minimal reconstruction I/O distribution are generated in advance instead of making during the course of RS implementing. So the overhead of enumerate all possibilities is hidden. As we know, the computing overhead may be ignored compared with I/O overhead of disks during entire recovery. I/O Overhead Analysis: I/O overhead is divided into access time of foreground user’s I/Os and background reconstruction I/Os. The user’s I/Os and reconstruction I/Os are influenced each other during the entire recovery process. For RE process, there create n threads associated with both n−1 surviving disks and one replacement disk. Data are read from surviving disks in paralleled methods, and then both read and write are implemented in pipeline scheme. The above approach further decreases the I/O overhead of disks. 5. RECOVERY MODEL Now we formulate the online single-failure disk recovery problem using an optimization model for heterogeneous disk arrays. The notation used in our models is listed in Table 2. We take the TP codes as an example to build our models. Similar to the existing recovery schemes reported in [10, 14], SmartRec is focused on the recovery of data disks; we adopt a conventional recovery scheme to recover parity disks. 5.1. Problem formulation We develop a reconstruction-time model to formulate the problem solved by our proposed reconstruction schemes. The reconstruction time equals to the amount reconstruction I/O data divided by reconstruction bandwidth determined by the slowest disk. In the scenarios of ConRec, MinRec and BalRec, there is only a single recovery sequence regardless of the number of recovery windows. Thus, the amount of reconstruction I/O load is a constant during the entire recovery process. Our SmartRec differs from the above three recovery schemes in that SmartRec is proposed to optimize online reconstruction performance for heterogeneous disk arrays. The model of SmartRec is formulated as follows: Tsmart=Σk=1NumwinNumsmart,k×SBWrec,k (1) where Numsmart,k is reconstruction I/O’s load, BWrec,k is reconstruction bandwidth in the k th recovery window for a bottleneck disk. In the offline recovery process for heterogeneous RAID, rebuilding a single-corrupted disk is simply governed by a reconstruction sequence, where reconstruction bandwidth is fixed. In such a case, there is no need to vary reconstruction sequences among multiple recovery windows. The disk I/O capability depends on disk bandwidth of a heterogeneous RAID; the bandwidth is independent of I/O intensity. In the online recovery case, the entire reconstruction process can be divided into multiple recovery windows, in which reconstruction bandwidth varies with the changing user-I/O intensity during the online recovery process. Each window may adopt a customized reconstruction sequence according to the disk run-time I/O capability. The primary optimization objective is to minimize reconstruction time Tsmart. Thus, we express this objective as MinimizeΣk=1NumwinNumsmart,k×SBWrec,k (2) 5.2. SmartRec recovery model To implement the design idea of ‘matching the number of reconstruction I/Os to disks’ reconstruction capability’, we have to develop a light-weight evaluation mechanism to measure I/O capability of disks. We are facing two challenges: (1) how do we choose an appropriate evaluation metric? and (2) how do we minimize the overhead caused by such an evaluation mechanism? We address the first challenge by making use of I/O transmission bandwidth to quantify disk I/O capability in disk arrays; average response time is measured to determine the I/O capability of disks. A similar approach can be found in prior studies [22]. We overcome the second challenge by increasing recovery window size. We have to make tradeoffs between evaluation overhead and disk utilization. It is critical to determine the window granularity, because a larger window granularity leads to smaller evaluation cost and lower disk utilization and vice versa. The recovery problem solved by SmartRec is formulated as an optimal reconstruction model for heterogeneous disk arrays. Given the jth recovery sequence, the reconstruction time in the i th disk during the k th recovery window is formulated as Tk,j,i. Thus, we have Tk,j,i=rj,i×SBWk,i (3) where Tk,j,i represents the reconstruction time of the i th disk within the k th recovery window. The window is also expressed as rj,i×trec,k,i (see Phase 1 in Algorithm 2). Recall that the reconstruction time of an arbitrary single surviving disk equals to the results of multiplying numbers of retrieved reconstruction I/Os ( rj,i) by average response time of reconstruction I/O( trec,k,i). According to (3), we derive the reconstruction time of the j th recovery sequence within the k th recovery window (Tk,j): Tk,j=maxi∈ds{Tk,j,i} (4) where Tk,j is the maximal value of reconstruction time among all surviving disks. Tk,j represents the reconstruction time of a bottleneck disk during a recovery window (see max(RSj) in Algorithm 2). The reconstruction time in the k th recovery window is expressed as follows Tk=minj∈Rcand{maxi∈ds{Tk,j,i}} (5) where Tk is the minimal value of reconstruction time of bottleneck disks for all candidate recovery sequences in the k th recovery window (see Phase 2 in Algorithm 2). To quantify the available bandwidth for data recovery, SmartRec keeps track of historical access times of reconstruction I/Os. Then, available reconstruction bandwidth of the ith surviving disk during the k th recovery window is formulated below: BWk,i=S∑l∈Numrec,k−1,itlNumrec,k−1,iwheretrec,k,i=∑l∈Numrec,k−1,itlNumrec,k−1,i (6) where trec,k,i represents average response time of reconstruction I/Os in the current recovery window (i.e. reconstruction I/O capability of the i th surviving disk within the k th recovery window). From (6), we derive reconstruction I/O capability of the current recovery window (i.e. k th) from that of the previous recovery window (i.e. (k−1) th) (see also Fig. 4). According to (3) and (6), we obtain Tk,j,i as Tk,j,i=rj,i×trec,k,i (7) According to Equations (3–7), the total reconstruction time( TSmart) is modeled as TSmart=∑k∈Numwin{Tk}=∑k∈Numwin{minj∈Rcand{maxi∈ds{Tk,j,i}}}=∑k∈Numwin{minj∈Rcand{maxi∈ds{rj,i×trec,k,i}}} (8) where TSmart is a sum of all reconstruction time slices of SmartRec. Each reconstruction time slice corresponds to a recovery window. The size of reconstruction time slice varies with the dynamically changing reconstruction bandwidth determined by both inherent heterogeneity of disk array and I/O intensity of user requests. 5.3. Model validation We introduce reconstruction time ratios to validate the proposed models. The model is validated for heterogeneous RAIDs in the online and offline recovery processes. Let Rcon,smart be a reconstruction time ratio between ConRec and SmartRec. We can express ratio Rcon,smart as Rcon,smart=TconTsmart=Σk=1NumwinNumcon,kNumsmart,k (9) where Tcon and Tsmart are reconstruction times for ConRec and SmartRec, respectively. In the offline recovery case, the reconstruction bandwidth—a constant—depends on the slowest disk in a heterogeneous RAID. Thus, the entire recovery process only adopt a single reconstruction sequence for ConRec and SmartRec (i.e. Numwin=1). In an online recovery, reconstruction bandwidth is governed by both changing user-I/O intensity and the bottleneck disk’s bandwidth. Thus, the reconstruction process is divided into multiple recovery windows (i.e. Numwin>1). Reconstruction I/O load of ConRec is a constant during the entire recovery process; thus, ConRec only adopts one parity chain to rebuild single-failed disk. In contrast, the reconstruction sequence in SmartRec is judiciously adjusted by reconstruction I/O load and the bottleneck disk. Equation (9) suggests that the reconstruction time ratios is translated into the ratios of reconstruction I/O load between the two recovery schemes. We validate the models in the offline and online scenarios for RDP-coded heterogeneous storage system with p=5, 7, 11 and 13. In the offline scenario, ConRec and SmartRec maintain a fixed reconstruction sequence. The reconstruction sequence for ConRec is only induced by one type of parity chain set, SmartRec reconstruction sequence is governed by a hybrid parity chains set. Thus, the distribution of reconstruction I/O load for ConRec is identical among all surviving disks; however, the I/O load of SmartRec is non-uniformly assigned to surviving disks. For example, in the case of RDP with p = 5, the distribution of reconstruction I/Os for ConRec is {{rj,i}i∈ds}j={4,4,4,4,0} in a stripe, which is induced by reconstruction sequence RSj={0,0,0,0}. In the SmartRec case, the distribution of reconstruction I/Os in a stripe is {{rj,i}i∈ds}j={3,2,2,3,2} governed by reconstruction sequence RSj={0,0,1,1}. Thus, Rcon,smart is derived from retrieved I/Os ratio of the bottleneck disk according to the theoretical analysis (i.e. Rcon,smart=4/2=2, when p=5). Recall that in the online scenario, the reconstruction sequence of ConRec is fixed regardless of changing user-I/O intensity; however, the sequence of SmartRec varies with the changing I/O intensity among multiple recovery windows. Let us consider a case where p=5 and Numwin=5, the reconstruction I/O load in ConRec is 4 due to the bottleneck disk induced by fixed reconstruction bandwidth. When it comes to SmartRec, the reconstruction I/O load of the bottleneck disk is 2, 3, 3, 2, 3 from the five recovery windows, respectively. The varied load is contributed by the changing reconstruction bandwidth (see also (9, Rcon,smart=(4+4+4+4+4)/(2+3+3+2+3)=1.54). Figures 5 and 6 show the reconstruction time derived from our models and the experimental results obtained from the implemented prototypes using real-world disk arrays, respectively. We observe that the difference between the theoretical ratios and experimental counterparts is marginal (i.e. 11% and 12% for the offline and online scenarios). The experimental results confirm that the proposed models effectively estimate reconstruction times spent in recovering single disk failures in disk arrays. Although it is different from the construction mechanism of diversity parity chains between array codes, there exist common symbols between diversity parity chains, which leads to non-uniform distribution of reconstruction I/O load. In addition, the benefit of our scheme is attributed by non-uniform distribution of reconstruction I/Os among array codes. Thus, we draw similar conclusions for EVENODD, TP and STAR in that reconstruction time ratios can be derived from the I/O load ratio of a bottleneck disk, which only depends on deployed reconstruction strategies. Figure 5. View largeDownload slide Comparisons of offline reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 5. View largeDownload slide Comparisons of offline reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 6. View largeDownload slide Comparisons of online reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. Figure 6. View largeDownload slide Comparisons of online reconstruction time ratios obtained from the models and empirical studies. (a) Tcon/Tsmart. (b) Tmin/Tsmart. (c) Tbal/Tsmart. In the future, the measure accuracy among experimental process need to be improved to narrow the gap between theoretical results and experimental counterparts. To improve the measure accuracy, we adopt average response time of reconstruction I/O to evaluate I/O capacity for retrieving reconstruction I/Os among surviving disks. We statistic the historical access times and total I/O time of reconstruction I/O to calculate the average response time. we plan to further seize the changing character of I/O capacity among surviving disks, by analysis I/O access behaves of both user I/Os and reconstruction I/Os to decrease the evaluation overhead, changing the size of recovery windows to improve disk utilization, and optimizing efficiently schedule reconstruction threads to improve measure accuracy of I/O capacity. 6. PERFORMANCE EVALUATION We implement the proposed SmartRec reconstruction scheme along with the three alternatives (i.e. ConRec, MinRec and BalRec) in a real-world storage server. We conduct a wide range of experiments to quantitatively compare the performance of the four reconstruction schemes. 6.1. Experiment environment We implement the four prototypes in a high performance storage-server, where all disks are organized in the form of RAID. The server is equipped with two 6-core Intel(R) Xeon(R) E7540@2.00 GHz CPUs and 16 GB DDR3 main memory. All disks are Seagate ST9300605SS SAS-II connected by a MegaRAID SAS 1078 controller with 512 MB dedicated Cache. The operating system is Ubuntu 10.04 LTS X86-64 with the Linux Kernel 2.6.32. To resemble heterogeneous RAIDs, we vary the disk parameters, i.e. /sys/block/sdx/queue/read_ahead_kb, to create heterogeneity in I/O abilities of the disks. For example, we set /sys/block/sdx/queue/read_ahead_kb to 0, then the bandwidth of disk for Seagate ST9300605SS SAS-II is approximately 30 MB/s. Table 3 shows that four types of heterogeneous disks are implemented on our experiment. We adopt IOMeter [27] to measure the disk I/O speed, which is changing in a range between 30 MB/s and 150 MB/s. Thus, we implement three heterogeneous RAIDs by changing disk configure parameters, including Conf-A, Cof-B and Conf-C. Disk bandwidth of Conf-A is in a range between 100 MB/s and 150 MB/s, Conf-B is changing between 78 MB/s and 150 MB/s, Conf-C varies anywhere between 30 MB/s and 150 MB/s. Table 3. The configuration of heterogeneous disks within RAID. Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 View Large Table 3. The configuration of heterogeneous disks within RAID. Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 Heterogeneous-disk-type Disk-bandwidth (MB/s) Disk-configure-parameter Disk-1 30 0 Disk-2 78 4 Disk-3 100 9 Disk-4 150 128 View Large 6.2. Evaluation methodology We evaluate the performance of the four reconstruction schemes from the perspective of recovering single-disk in disk arrays powered by multiple-fault–tolerant RAID codes: EVENODD [1], RDP [2], Star [5] and TP [4]. The former two codes are a specific to RAID-6 systems that can tolerate exactly two failures, the latter two codes can tolerant exactly three failures. The amount of data stored on each disk is set to 10 GBytes, which is sufficiently large to evaluate the reconstruction times of the tested schemes. Such an amount of tested data, which has been used in other studies [17], can adequately cover the footprint of the evaluated workloads. The reconstruction performance is measured in terms of the completion time spent in reconstructing 10 GB data with a recovery window of 2 GB. Each experiment is repeatedly conducted five times; the average reconstruction time is calculated. During the entire recovery process, C program language is adopted to implement above various reconstruction schemes. For play the foreground user I/Os, we create n threads to assign the user I/Os to the corresponding disks, i.e. a thread is responsible for playing user I/Os to an appropriate disk. The strategy using multiple threads can statistic queue time and server time of user I/Os. To regenerate failed data from Capability-I/O-Matched recovery solution(see Algorithm 3), we create n threads, each of which is associated with one disk. Of which, n−1 threads are responsible for reading surviving data blocks in paralleled method into ReadBuffer in the memory. Multiple ReadBuffers associated with surviving disks make up a linked list for XOR operation to regenerate the failed data. The remaining one thread implements write operation from XOR results. Due to multiple threads design, read operation, XOR operation and write operation are implemented in pipe-lined method. To adequately evaluate the improvement of SmartRec over the three alternatives, we conduct comparative experiments in the offline recovery and online recovery scenarios. In the process of evaluating online reconstruction performance, we implement different RAID codes, test various array sizes and stripe unit sizes, and configure I/O heterogeneity in our storage system. We adopt an open-loop model during the online recovery test, where traces are replayed according to timestamps recorded in the trace files (i.e. I/O arrival rates are independent of I/O request completions [28]). The trace replayer issues I/O requests to appropriate data chunks according to address mappings. The trace replay tool is RAIDmeter [18] that replays traces at the block level and evaluates user response time in the storage device. We evaluate the online reconstruction performance using the Web-2 trace [29] that represents read-intensive applications. The Web-2 used in our experiments is obtained from the Storage Performance Council [29]. The Web-2 was collected from a machine running a web search engine. It represents read-domination(99.98% read ratio) and with high locality in its access pattern, block request size range from 8 KB, 16 KB, 24 KB and 32 KB. The total I/O request numbers are 4 579 809 traces recorded in the web-2 files. In the paper, we are mainly focus on the consideration of dynamically balancing reconstruction I/Os capacity during recovery process, and only 0.02% write traces in web-2 file. Optimizing read or write operation in I/O workloads is widely by studied in the literature [17, 18]. For fairness and simplicity, we only consider read operations issued by the RAIDmeter replay tool. We also assume that the faulty disk is the first disk in the test system. 6.3. Offline reconstruction performance To examine the advantage of SmartRec over the ConRec, MinRec and BalRec reconstruction schemes, we implement four RAID Codes (i.e. EVENODD, STAR, RDP and TP), of which each RAID Code represents different RAID sizes. Figure 7 shows the experimental results of offline reconstruction performance comparison. Figure 7. View largeDownload slide Offline reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. Figure 7. View largeDownload slide Offline reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. We draw two observations from Fig. 7. First, SmartRec consistently outperforms the other three reconstruction schemes regardless of RAID codes and RAID sizes. The performance improvement offered by SmartRec can be attributed to the fact that data reconstruction load is assigned by SmartRec in accordance to the I/O ability of disks during recovery, where the I/O heterogeneity is incorporated. For example, in the case of the TP-coded storage system, where P is set to 7 and stripe unit size is 64 KB, the recovery times are 434.43 s, 336.88 s, 274.04 s and 238.84 s for ConRec, MinRec, BalRec and SmartRec, respectively. Second, the reconstruction performance of the four schemes is consistently and marginally degraded with the increasing RAID sizes, mainly because the total reconstruction amount is increased with growing RAID sizes. This trend is held true even though the reconstruction data amount is same among all the disks. For example, in the TP-coded storage system, the recovery times of SmartRec are 207.62 s, 238.84 s and 373.58 s when P is set to 5, 7 and 11, respectively. 6.4. Online reconstruction performance Compared with the offline recovery scenario, the online recovery environment has external user I/O requests. Hence, the reconstruction bandwidth is determined by overall disk bandwidth and bandwidth consumed by external user I/Os. To adequately assess the effectiveness of SmartRec over the other three reconstruction schemes, we conduct online reconstruction performance comparisons from the following four aspects. 6.4.1. Different RAID codes and different RAID sizes We carry out comparative experiments on storage systems powered by four RAID codes (i.e. EVENODD, STAR, RDP and TP), each of which has different RAID sizes. Figure 8 shows the experimental results of the four reconstruction schemes. Figure 8. View largeDownload slide Online reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. Figure 8. View largeDownload slide Online reconstruction performance comparison. Stripe unit size=64KB. (a) EVENODD. (b) STAR. (c) RDP. (d) TP. We observe from Fig. 8 that the online reconstruction performance exhibits similar trend as that of the offline recovery scenario. In other words, SmartRec achieves better reconstruction performance over the other three reconstruction schemes regardless of the RAID codes; reconstruction performance of all four schemes show marginal degradation with the increasing RAID sizes. For example, in the TP-coded storage system where P is 7 and stripe unit is set to 64 KB, the reconstruction times are 521.23 s, 496.28 s, 436.58 s and 385.16 s for ConRec, MinRec, BalRec and SmartRec, respectively. SmartRec is superior to the other three schemes, because SmartRec chooses a flexible and capability-I/O-matched reconstruction data distribution in each recovery window (i.e. minimizing each reconstruction time slice) to minimize reconstruction time. Furthermore, the reconstruction time of the online scenario is longer than that of the offline scenario regardless of both RAID codes and RAID sizes. The reason is that the online reconstruction process has to handle the contention between user I/Os and reconstruction I/Os. 6.4.2. User response time Now we examine average user response times of the storage systems where the four reconstruction schemes are running. We focus on a TP-coded system where P is set to 7 and stripe unit size is 64 KB. We collect the response times of all user I/Os and calculate the average user response time of a window of 20 s. All the average user response times form an average user response time series. Figure 9 shows the average user response times of the four reconstruction schemes. Figure 9. View largeDownload slide Average user response time comparison. P=7, stripe unit size=64KB. Figure 9. View largeDownload slide Average user response time comparison. P=7, stripe unit size=64KB. From Fig. 9, we observe that SmartRec completes the reconstruction process faster than the other three schemes and the average response time of SmartRec is slightly lower than those of ConRec, MinRec and BalRec during data recovery. For instance, at the time slot 140 s, average user response time is 7.1 ms, 6.1 ms, 6.2 ms and 5.8 ms for ConRec, MinRec, BalRec and SmartRec, respectively. This is because SmartRec judiciously distributed reconstruction load among disks to well balance I/Os across all the disks. As a result, disks with high I/O capability treat large data volume and vice versa. Therefore, SmartRec decreases the user I/O queuing delay of the three competitive schemes, thereby shortening user average response times. 6.4.3. Different stripe unit sizes We evaluate the impact of different stripe unit sizes on reconstruction performance. Again, we study the TP-coded system where P is 7. The stripe unit size is set to 16 KB, 32 KB, 64 KB and 128 KB. Figure 10 depicts the comparative experiment results illustrating the impacts of stripe unit size. Figure 10. View largeDownload slide Reconstruction performance of the TP-coded storage system under different stripe unit sizes. P=7. Figure 10. View largeDownload slide Reconstruction performance of the TP-coded storage system under different stripe unit sizes. P=7. Figure 10 reveals that the reconstruction performance of the four schemes are gradually improving when the stripe unit size goes up. For example, when the stripe unit size is set to 16 KB, 32 KB, 64 KB and 128 KB, the reconstruction time of SmartRec is 715.64 s, 586.48 s, 385.16 s and 324.50 s. This trend is reasonable, because a large stripe unit size leads to a reduced number of I/O accesses, which in turn helps to improve access sequentiality. Moreover, SmartRec performs better in terms of data reconstruction than the other schemes regardless of stripe unit size. 6.4.4. Different heterogeneous configuration To assess the sensitivity of the four reconstruction schemes running in heterogeneous RAIDs, we evaluate the TP-coded storage system, where P is 7 and stripe unit size is 64 KB. We conduct comparative experiments on three different heterogeneous configurations, namely, Conf-A (100 MB/s–150 MB/s), Conf-B (78 MB/s–150 MB/s) and Conf-C (30 MB/s–150 MB/s). Figure 11 depicts the experimental results of the reconstruction schemes on heterogeneous RAIDs. Figure 11. View largeDownload slide Reconstruction performance comparison under different heterogeneous configurations. P=7 and stripe unit size=64KB. Figure 11. View largeDownload slide Reconstruction performance comparison under different heterogeneous configurations. P=7 and stripe unit size=64KB. Figure 11 illustrates that SmartRec consistently outperforms the other three reconstruction schemes under all the heterogeneous configurations. We also observe that the speedup of reconstruction performance gradually increases when disk heterogeneity rises. For example, in the Conf-A, Conf-B, and Conf-C cases, SmartRec improves reconstruction performance by 18.8%, 25% and 35.3% over the ConRec scheme, respectively. RAID systems with high heterogeneity gain more benefit from SmartRec than those with low heterogeneity, because SmartRec is tailored for heterogeneous recovery environments with the consideration of both reconstruction data distribution and I/O capabilities across all disks. 7. CONCLUSION In this paper, we have focused on the issue of single-disk failure recovery in parallel RAID-coded storage systems that tolerate multiple failures. We showed that the existing single-failure recovery schemes like ConRec, MinRec and BalRec speed up the recovery process of storage systems without addressing heterogeneity in RAID environments. We solved this problem by proposing a heterogeneous-aware single-disk failure recovery scheme called SmartRec, which improves the recovery performance by periodically selecting an appropriate reconstruction solution that retrieves surviving data based on the I/O capabilities of the surviving disks. We built recovery-time models for the four reconstruction schemes and validated the correctness of the four models using the empirical evaluations. We compared SmartRec against the three alternatives in the context of online data reconstruction tests by replaying real-world workloads under various disk configurations. The experimental results show that our SmartRec recovery scheme outperforms the existing three recovery schemes in terms of reconstruction time and user I/O performance. For example, in the online TP-coded heterogeneous RAIDs with 9 disks, SmartRec improves reconstruction performance over the existing three alternatives by up to 35.3% with an average of 25.8%. As a future research direction, we plan to improve the measurement accuracy of reconstruction I/O capability among surviving disks by incorporating varied window granularity, and further investigate the impact of storage heterogeneity on reconstruction I/O flows. FUNDING This work is supported in part by the National Science Foundation of China under Grant nos. 61572209, 61472152 and 61762075, and the Fundamental Research Funds for the Central Universities under HUST: 2015MS006. This work is also supported by Key Laboratory of Information Storage System. Xiao Qin’s work was supported by the U.S. National Science Foundation under Grants CCF-0845257 (CAREER Award), CNS-0917137, CCF-0742187 and supported by the 111 Project under Grant B07038. Meanwhile, it is also supported by Chunhui planning project of Ministry of Education under Grant no. Z2015059, the Province Science Foundation of QingHai under Grant nos. 2016-ZJ-920Q, 2014-ZJ-908, 2016-ZJ-739 and 2015-ZJ-718, the commercialization of research findings project of QingHai under Grant no. 2016-SF-130. This work is also supported by Key Laboratory of IoT of QingHai Province under Grant no. 2017-ZJ-Y21, and Society Science Foundation of china under Grant no. 15XMZ057. REFERENCES 1 Blaum , M. , Brady , J. , Bruck , J. and Menon , J. ( 1995 ) Evenodd: an efficient scheme for tolerating double disk failures in raid architectures . IEEE Trans. Comput. , 44 , 192 – 202 . Google Scholar CrossRef Search ADS 2 Corbett , P. , English , B. , Goel , A. , Grcanac , T. , Kleiman , S. , Leong , J. and Sankar , S. ( 2004 ) Row-Diagonal Parity for Double Disk Failure Correction. Proc. 3rd USENIX Conf. File and Storage Technologies, Berkeley, CA, USA FAST’04, pp. 1–14. USENIX Association. 3 Xu , L. and Bruck , J. ( 1999 ) X-code: Mds array codes with optimal encoding . IEEE Trans. Inf. Theory , 45 , 272 – 276 . Google Scholar CrossRef Search ADS 4 Corbett , P. F. and Goel , A. ( 2011 ). Triple Parity Technique for Enabling Efficient Recovery from Triple Failures in a Storage Array. US Patent 8,010,874. 5 Huang , C. and Xu , L. ( 2008 ) Star: an efficient coding scheme for correcting triple storage node failures . IEEE Trans. Comput. , 57 , 889 – 901 . Google Scholar CrossRef Search ADS 6 Ghemawat , S , Gobioff , H. and Leung , S.-T. ( 2003 ) The Google file system. ACM SIGOPS Operating Systems Review, pp. 29–43. ACM Association. 7 Pinheiro , E. , Weber , W.-D. and Barroso , L.A. ( 2007 ) Failure Trends in a Large Disk Drive Population. Proc. 5th USENIX Conf. File and Storage Technologies, Berkeley, CA, USA FAST ‘07, pp. 17–28. USENIX Association. 8 Wang , Z. , Dimakis , A.G. and Bruck , J. ( 2010 ) Rebuilding for array codes in distributed storage systems. GLOBECOM Workshops (GC Wkshps), 2010 IEEE, pp. 1905–1909. IEEE. 9 Khan , O. , Burns , R. , Plank , J. and Huang , C. ( 2011 ) In Search of I/O-Optimal Recovery from Disk Failures. 3rd USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage’11, Portland, OR, USA, June 14, 2011, pp. 6–11. USENIX Association. 10 Zhu , Y. , Lee , P.P.C. , Hu , Y. , Xiang , L. and Xu , Y. ( 2012 ) On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice. IEEE 28th Symp. Mass Storage Systems and Technologies, MSST 2012, April 16–20, 2012, Asilomar Conference Grounds, Pacific Grove, CA, USA, pp. 1–12. IEEE. 11 Xiang , L. , Xu , Y. , Lui , J.C.S. and Chang , Q. ( 2010 ) Optimal Recovery of Single Disk Failure in RDP Code Storage Systems. SIGMETRICS 2010, Proc. 2010 ACM SIGMETRICS Int. Conf. Measurement and Modeling of Computer Systems, New York, USA, 14–18 June 2010, pp. 119–130. ACM. 12 Xiang , L. , Xu , Y. , Lui , J.C.S. , Chang , Q. , Pan , Y. and Li , R. ( 2011 ) A hybrid approach to failed disk recovery using RAID-6 codes: algorithms and performance evaluation . TOS , 7 , 11:1 – 11:34 . Google Scholar CrossRef Search ADS 13 Xu , S. , Li , R. , Lee , P. , Zhu , Y. , Xiang , L. , Xu , Y. and Lui , J. ( 2013 ) Single disk failure recovery for x-code-based parallel storage systems . IEEE Trans. Comput. , 63 , 995 – 1007 . Google Scholar CrossRef Search ADS 14 Luo , X. and Shu , J. ( 2013 ) Load-Balanced Recovery Schemes for Single-Disk Failure in Storage Systems with Any Erasure Code. 42nd Int. Conf. Parallel Processing, ICPP 2013, Lyon, France, October 1–4, 2013, pp. 552–561. IEEE. 15 Schroeder , B. and Gibson , G.A. ( 2007 ) Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? 5th USENIX Conf. File and Storage Technologies, FAST 2007, February 13–16, 2007, San Jose, CA, USA, pp. 1–16. USENIX Association. 16 Drapeau , A.L. et al. . ( 1994 ) RAID-II: A High-Bandwidth Network File Server. Proc. 21st Annual Int. Symp. Computer Architecture. Chicago, IL, USA, April 1994, pp. 234–244. IEEE. 17 Wu , S. , Jiang , H. , Feng , D. , Tian , L. and Mao , B. ( 2009 ) WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance. 7th USENIX Conf. File and Storage Technologies, February 24–27, 2009, San Francisco, CA, USA. Proceedings, pp. 239–252. USENIX Association. 18 Tian , L. , Feng , D. , Jiang , H. , Zhou , K. , Zeng , L. , Chen , J. , Wang , Z. and Song , Z. ( 2007 ) PRO: A Popularity-Based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems. 5th USENIX Conf. File and Storage Technologies, FAST 2007, February 13–16, 2007, San Jose, CA, USA, pp. 277–290. USENIX Association. 19 Wan , S. , Cao , Q. , Huang , J. , Li , S. , Li , X. , Zhan , S. , Yu , L. , Xie , C. and He , X. ( 2011 ) Victim Disk First: An Asymmetric Cache to Boost the Performance of Disk Arrays under Faulty Conditions. Proc. 2011 USENIX Annual Technical Conference, pp. 13–25. USENIX Association. 20 Khan , O. , Burns , R.C. , Plank , J.S. , Pierce , W. and Huang , C. ( 2012 ) Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads. Proc. 10th USENIX Conf. File and Storage Technologies, FAST 2012, San Jose, CA, USA, February 14–17, 2012, pp. 1–20. USENIX Association. 21 Xie , T. and Wang , H. ( 2008 ) Micro: a multilevel caching-based reconstruction optimization for mobile storage systems . IEEE Trans. Comput. , 57 , 1386 – 1398 . Google Scholar CrossRef Search ADS 22 Zhu , Y. , Lee , P.P.C. , Xiang , L. , Xu , Y. and Gao , L. ( 2012 ) A Cost-Based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes. IEEE/IFIP Int. Conf. Dependable Systems and Networks, DSN 2012, Boston, MA, USA, June 25–28, 2012, pp. 1–12. IEEE. 23 Luo , H. , Huang , J. , Cao , Q. and Xie , C. ( 2014 ) LaRS: A Load-Aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters. 9th IEEE Int. Conf. Networking, Architecture, and Storage, NAS 2014, Tianjin, China, August 6–8, 2014, pp. 168–175. IEEE. 24 Reed , I. S. and Solomon , G. ( 1960 ) Polynomial codes over certain finite fields . J. Soc. Ind. Appl. Math. , 8 , 300 – 304 . Google Scholar CrossRef Search ADS 25 Plank , J.S. , Luo , J. , Schuman , C.D. , Xu , L. and Wilcox-O’Hearn , Z. ( 2009 ) A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. 7th USENIX Conf. File and Storage Technologies, February 24–27, 2009, San Francisco, CA, USA. Proceedings, pp. 253–265. USENIX Association. 26 Hafner , J.L. , Deenadhayalan , V. , Rao , K.K. and Tomlin , J.A. ( 2005 ) Matrix Methods for Lost Data Reconstruction in Erasure Codes. Proc. FAST ‘05 Conf. File and Storage Technologies, December 13–16, 2005, San Francisco, California, USA, pp. 1–14. USENIX Association. 27 Scheibli , D. , Eiler , J. and Randall , R. , Iometer: I/O Subsystem Measurement and Characterization Tool. http://www.iometer.org. 28 Schroeder , B. , Wierman , A. and Harchol-Balter , M. ( 2006 ) Open Versus Closed: A Cautionary Tale. 3rd Symp. Networked Systems Design and Implementation (NSDI 2006), May 8–10, 2007, San Jose, California, USA, Proceedings., pp. 1–18. 29 Marc Liberatore . OLTP Application I/O and Search Engine I/O. http://traces.cs.umass.edu/index.php/Storage/Storage. Author notes Handling editor: Antonio Fernandez Anta © The British Computer Society 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

The Computer JournalOxford University Press

Published: Nov 14, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off