TY - JOUR AU1 - Huang,, Jianzhong AU2 - Zhou,, Panping AU3 - Qin,, Xiao AU4 - Wang,, Yanqun AU5 - Xie,, Changsheng AB - Abstract For the sake of cost-effectiveness, it is a conventional wisdom to employ (k + r,k) erasure codes to archive rarely accessed replicas, i.e. erasure-coded data archival. Existing researches on erasure-coded data archival optimizations are mainly aimed to reduce archival traffic within storage clusters. Apart from archival traffic, both non-sequential reads and imbalanced loads can deteriorate archival performance. Traditional distributed archival schemes (DArch for short) for randomly distributed replicas tend to suffer from two problems: (i) non-sequential reads because underlying file systems split a data block into multiple smaller data chunks and (ii) imbalanced loads since archival tasks are assigned according to data locality of replicas. To overcome such drawbacks, we incorporate both prefetching mechanism and balancing strategy into erasure-coded archival for replica-based storage clusters, and propose three new archival schemes: a prefetching-enabled archival scheme (i.e. P-DArch), a balancing-enabled archival scheme (i.e. B-DArch) and a prefetching-and-balancing-enabled archival scheme (i.e. PB-DArch). We implement a proof-of-concept prototype, where all the four archival schemes are deployed and quantitatively evaluated. The experimental results show that both the prefetching mechanism and balancing strategy can effectively optimize archival performance of a replica-based storage cluster exhibiting a random data layout. In a (12,9) RS-coded archival scenario, P-DArch, B-DArch and PB-DArch outperform DArch by a factor of 2.95, 1.72 and 3.85, respectively. 1. INTRODUCTION 1.1. Motivations To protect data from various failures while maintaining high I/O parallelisms, many production storage systems (see, for example, Google GFS [1], Hadoop HDFS [2], Microsoft FDS [3]) maintain replicated data (e.g. three copies for each block). Since there is a strong temporal locality in data accesses (e.g. a majority of accesses in the Yahoo M45 hadoop clusters and Facebook data warehouse clusters happen within the first day after creation, and access frequency of newly created data continuously decreases when the data are aging [4]), it is common wisdom to archive rarely used files to low-cost storage. Compared to replication, erasure codes offer the same fault tolerance at a much lower cost [5–7] (e.g. 1.x for erasure codes and 3.0 for triplication). As mentioned in Microsoft WAS [6], Google GFS II[8], Facebook HDFS-RAID [9], it is a cost-effective solution to archive infrequently accessed data replicas using erasure codes, that is, erasure-coded data archival. Researches on data archival are mainly focused on two fields: (i) designing long-term archival storage systems for newly created data using erasure codes, including Pergamum [7], Postshards [10] and DAWN [11], and so forth; (ii) developing erasure-coded archival schemes for rarely accessed replicas, e.g. Synchronous Encoding [4], Decentralized Erasure Codes [12], RapidRAID [13], 3X [14] and aHDFS [15]. As to the latter, this paper addresses an important research issue of ‘how to efficiently archive infrequently-accessed data?’. Existing erasure-coded archival schemes attempt to optimize archival processes by means of minimizing network traffic induced by archival operations, without taking nodes’ network bandwidths and I/O accesses into account. We take Synchronous Encoding [4] as an example to illustrate the erasure-coded archival process. As shown in Fig. 1, there are two storage clusters: (i) a replica-based production storage cluster, which supplies source data for upper data processing applications and (ii) a (k+r,k) erasure-coded archival storage cluster, which holds rarely accessed data migrated from the production storage cluster. A (k+r,k) erasure-coded archival process includes two phases: (i) k source data blocks {D1,D2,…,Dk} are delivered from the production storage cluster to the archival storage cluster; and (ii) r corresponding parity blocks {P1,P2,…,Pr} generated from the k source data blocks are transferred to the archival cluster. The traffic incurred by data migration, parity generation and parity migration is k blocks, k blocks, and r blocks, respectively. Thus, the archival traffic in Synchronous Encoding is 2k+r blocks. Here, archival traffic is referred to as the total amount of data transferred in network during data archival. FIGURE 1. View largeDownload slide Diagram of traditional (k+r,k) erasure-coded archival process, in which k source data blocks {D1,D2,…,Dk} are migrated from a production cluster to an archival cluster, and r corresponding parity blocks {P1,P2,…,Pr} generated from the k source data blocks by an archival manager are also delivered to the archival cluster. FIGURE 1. View largeDownload slide Diagram of traditional (k+r,k) erasure-coded archival process, in which k source data blocks {D1,D2,…,Dk} are migrated from a production cluster to an archival cluster, and r corresponding parity blocks {P1,P2,…,Pr} generated from the k source data blocks by an archival manager are also delivered to the archival cluster. It is noted that the parity generation traffic could be reduced to k−1 blocks if the encoding operation is conducted by a node storing one source data block. Furthermore, storage nodes in a real-world storage cluster are logically grouped into a production cluster and an archival cluster. That is, both the replica-based production cluster and the erasure-coded archival cluster may share a collection of identical storage nodes. As mentioned in Section 2.1, each of the k+r storage nodes may keep a data block in a stripe and a parity block in another stripe. As such, a parity block in a stripe can either (i) be placed on a separate node or (ii) be mixed with a data block in another stripe across all nodes. In this study, we focus on such a data placement. Two issues can be pinpointed by analyzing the existing erasure-coded archival processes: Non-sequential disk reads. Usually, underlying file systems split a data block into multiple smaller data chunks [2]. An encoding node generates r corresponding parity blocks after retrieving relevant source data blocks from the other nodes via the cluster’s interconnection network. In such an archiving process, chunks across different blocks are usually non-sequential since the capacity of a modern disk is large,1 and random disk reads cause a large amount of disk seek time, thereby leading to low read throughput and degrading overall archival performance. Furthermore, when it comes to the case in which multiple encoding nodes concurrently deal with data archival processes, if source data blocks of different archival stripes are located to an identical disk, then fetching these blocks from the disk will trigger an excessive number of random reads, because chunks in different data blocks are also stored non-sequentially on the disk. Imbalanced I/Os (e.g. network load and archival tasks). For the case in which there is only a dedicated encoding node (e.g. the archival manager in Fig. 1 in Synchronous Encoding [4]), the overall archival performance is affected by the sending bandwidth of each data-block provider node. Given a stripe—basic unit in archival, an encoding node is required to obtain k data blocks from k different provider nodes before generating parity blocks. If the k provider nodes have different sending bandwidths, then the node featured with the lowest bandwidth slows down the entire archival process. On the other hand, as to the case in which there are several encoding nodes (i.e. distributed data archiving in Decentralized Erasure Codes [12], RapidRAID [13] and 3X [14]), data-block replicas are usually distributed among nodes in a random manner and thus, a majority of archival tasks are assigned to a small group of provider nodes owing to data locality of data-block replicas. Here, the parity-block generation for a stripe is referred to as an archival task. From the methodological perspective, an archival scheme using an encoding node can be regarded as a specific case of an archival scheme adopting multiple encoding nodes. Hence, this study mainly investigates the latter case (i.e. distributed data archiving (DArch)). On the basis of DArch, we propose a prefetching-enabled archival scheme to alleviate the non-sequential-reading problem (i.e. P-DArch), a balancing-enabled archival scheme to address the I/O-imbalance issue (i.e. B-DArch), and a prefetching-and-balancing-enabled archival scheme (i.e. PB-DArch). With the prefetching mechanism in place, P-DArch speeds up the data-block-reading stage by turning random disk reads into sequential reads. With the help of balancing strategy, B-DArch enables network load among storage nodes that running archival tasks in a cluster to be judiciously balanced, thereby maximizing the utilization of network bandwidth of the entire cluster. 1.2. Contributions The contributions of this study are summarized as follows: We thoroughly analyze I/O processes of the existing archiving techniques and pinpoint performance bottlenecks of archiving data with random distributions in a replica-based storage cluster. That is, non-sequential disk reads coupled with imbalanced I/O loads suppress overall archiving performance. We incorporate a prefetching mechanism into an archival process for a storage cluster with a random data layout. Such a prefetching-enabled archival scheme aims to leverage data prefetching to achieve sequential reads, solving the random read problem occurred in archiving processes. We integrate a balancing strategy into archival schemes to address the imbalanced load issue in distributed archival solutions. The balancing strategy enables archival schemes to judiciously balance network load among storage nodes running archival tasks in a cluster, thereby maximizing the cluster’s network-bandwidth utilization. We implement a proof-of-concept prototype where four archival schemes are deployed and quantitatively evaluated. Experimental results confirm that both prefetching mechanism and balancing strategy effectively improve the archival performance. In a (12,9) RS-coded archival scenario, P-DArch, B-DArch and PB-DArch outperform DArch by a factor of 2.95, 1.72 and 3.85, respectively. 1.3. Organization The rest of this paper is organized as follows. Section 2 introduces the background of this study. A prefetching-enabled archival scheme (i.e. P-DArch), a balancing-enabled archival scheme (i.e. B-DArch), and a prefetching-and-balancing-enabled archival scheme (i.e. PB-DArch) are detailed in Sections 3–5, respectively. Performance evaluation can be found in Section 6. Section 7 outlines prior studies related to erasure-coded archival techniques. Section 8 discusses a few important applicability issues. We conclude this study in Section 9. 2. BACKGROUND 2.1. Erasure-codes for storage systems Erasure-coded data help in reducing storage consumption in data centers. RS codes have been employed to build an erasure-coded storage cluster [6, 8, 16–18, 10]. In particular, (k+r,k) RS codes encode source data with a k×(k+r)Generator Matrix, which involves a k×kIdentity Matrix and a k×rRedundancy Matrix [19]. The source data are embedded in encoded data because of an Identity Matrix. The RS encoding adopts the simple linear algebra; parity blocks are generated by multiplying k data blocks with the k×r Redundancy Matrix. In Vandermonde Reed–Solomon codes [21, 20], for example, the coefficient α1,i is 1, where i∈{1,2,…,k} ⁠. Each archival unit is a stripe, which includes k data strips and r parity strips [22]. In RAIDs, each disk can be viewed as a collection of strips, and each strip can be partitioned into multiple blocks [23]; while it is rather complex in distributed storage systems. For example, HDFS usually divides files into fixed-size logical byte ranges called logical blocks, and these logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. Let us consider a (k+r,k) RS-coded storage cluster, which consists of an array of k+r storage nodes, a manager node and a collection of client nodes. All these machines are connected via a fast switch. As (i.e. maximum distance separable (MDS)) codes, (k+r,k) RS codes have the property that source data may be reconstructed when any r nodes fail. Specifically, each file with k data blocks in the (k+r,k) RS-coded storage cluster is encoded into k+r blocks, such that any k of the total k+r blocks can recover all data blocks. To guarantee fault tolerance, k data blocks and r parity blocks are exclusively stored on the k data nodes and r parity nodes, respectively (see Fig. 2). In practice, each of the k+r storage nodes may keep a data block in a stripe and a parity block in another stripe. That is, a parity block in a stripe can either (i) be stored on a separate and dedicated node or (ii) be mixed with a data block in another stripe across all nodes in the cluster. Figure 2. View largeDownload slide Each archival unit in (k + r,k) RS-coded storage clusters is a stripe, which includes k data strips and r parity strips. A strip is set to be a block. Parameter row serves as a stripe ID. Figure 2. View largeDownload slide Each archival unit in (k + r,k) RS-coded storage clusters is a stripe, which includes k data strips and r parity strips. A strip is set to be a block. Parameter row serves as a stripe ID. 2.2. Replication redundancy Replication schemes are widely used to ensure high-data reliability and to facilitate I/O parallelisms in distributed storage clusters like GFS [1], HDFS [2] and QFS [24]. A cluster file system maintains two or three replicas for each data block. However, replicated data improve data reliability and I/O performance at the cost of high-storage consumption. For example, the storage capacity overhead of replication and triplication is as high as 200% and 300%, respectively. Archiving infrequently accessed data (a.k.a., unpopular data) can improve storage utilization. Additionally, such an archiving operation has a little impact on I/O access services in clusters since the data to be archived exhibit decreased access frequency. In the Facebook’s BLOB storage clusters, the BLOB storage workload changes along with Facebook’s growth. Consequentially, newly created BLOBs (i.e. hot data) have three replicas to support a high request rate; week-old BLOBs (i.e. warm data) have two duplicates using Geo-replicated XOR coding; and 2+-month old data (i.e. cold data) are converted into the format of (14,10) RS-coded data [18, 25]. Replica placement plays a significant role in data reliability. A typical example is the rack-aware replica placement policy adopted by HDFS [2]. With the rack-aware replica placement in place, three replicas of a data block are organized as below: one replica is stored on one node in a local rack, another replica is placed on a node in a remote rack and the last one is kept on a separate node in the same remote rack. When it comes to two-way replication, each primary block and its replica block are stored on two distinct nodes. When an erasure-coded data archival procedure is launched, it is challenging to determine the set of source data blocks for an archival stripe from a group of data replicas, because replicas of any two distinct data blocks are randomly distributed. 2.3. Distributed erasure-coded archival Decentralized encoding [12], parallel data archiving [15] and pipelined archival [14] belong to the camp of distributed data archival schemes (i.e. DArch). In the DArch cases, two nodes or more—acting as encoding nodes—accomplish the parity generation. In particular, each encoding node retrieves k different data blocks from existing data replicas and computes r parity blocks. In this section, we make use of a concrete example to illustrate the DArch scheme. Figure 3 demonstrates how to archive two-replica data with random distribution using RS codes with coding parameter ‘k = 4’ in a storage cluster, where {D1,1,D1,2,D1,3,D1,4} and {D2,1,D2,2,D2,3,D2,4} are the source data blocks of stripe 1 and 2, respectively. Data blocks are randomly distributed on eight data nodes as long as two replicas of any data block are stored on two different data nodes. Figure 3. View largeDownload slide Distributed archiving scheme (DArch) enables multiple encoding nodes (e.g. SN1 and SN5 ⁠) to share the data archival responsibility for all archival stripes. Figure 3. View largeDownload slide Distributed archiving scheme (DArch) enables multiple encoding nodes (e.g. SN1 and SN5 ⁠) to share the data archival responsibility for all archival stripes. DArch embraces an idea of parallel archiving, namely, multiple encoding nodes share the responsibility of parity-block generation. For each archival stripe, a master node and some provider nodes are chosen according to data locality. A master node performs as an encoding node for a specific archival stripe. Specifically, given a stripe, a master node should be a node that contains the most amount of source data blocks in the stripe. For example, Fig. 3 illustrates that the two archival stripes are processed by two encoding nodes— SN1 and SN5 ⁠. After a master node is selected for an archival stripe, DArch nominates associated data-block provider nodes as follows: a higher priority is assigned to a node storing a larger amount of source data blocks; the master node fetches data blocks from the remote node with a high priority in the first place, and so forth. In this study, we investigate the optimization approaches on the basis of DArch, which is capable of accomplishing erasure-coded archival for replicas managed by random data placement strategies. After a thorough analysis, we observe that non-sequential disk reads coupled with imbalanced I/O loads suppress overall archiving performance. To alleviate the above bottlenecks, we propose to incorporate a prefetching mechanism to the data-block-reading stage to enhance disk I/O throughput (see Section 3), and adopt a load-balancing strategy to evenly distribute archival tasks among multiple nodes to increase the utilization of network bandwidth (see Section 4). 3. PREFETCHING-ENABLED ARCHIVING 3.1. Performance bottlenecks in data archival In a handful of replica-based storage clusters (e.g. GFS [8], HDFS [9] and QFS [24]), files are organized in form of large blocks to reduce metadata management overhead. The default block size is 64 MB in GFS, HDFS and QFS, and the size is configurable (e.g. 128 MB, 256 MB and 512 MB). If an encoding unit is a large block of 64 MB, then the large encoding unit may cause the following two problems. First, the large unit requires a large amount of main memory space to keep all k data blocks and r generated parity blocks, thereby limiting archival parallelisms. Taking (9,6) RS codes as an example, an encoding node needs 576 MB (i.e. 64MB×9 ⁠) to archive a stripe. Second, the entire archival process is comprised of three stages, namely, data-block-reading/data-block-retrieving, parity-block-generating and parity-block-distributing stages. All the three stages should be running in a serialized manner for a stripe, consuming a considerable amount of archiving time. Therefore, it is reasonable to keep encoding units as small data chunks (e.g. 64 KB). In this way, chunks in different data blocks are stored non-sequentially on the disk. Recall that replicas of any two distinct data blocks are randomly distributed, and the DArch scheme enables a master nodes to retrieve data blocks from associated provider nodes, then chunks in different data blocks stored on the same disk are read in a non-sequential manner. In what follows, we qualitatively analyze the weaknesses of DArch in terms of archival performance. In DArch, multiple master nodes are judiciously chosen in the cluster, and stripe-archiving tasks are assigned to the master nodes. In particular, given an archival stripe, a master node is selected according to the stripe’s data locality. The master node reads some data blocks from local disks while receiving other data blocks from remote provider nodes. DArch suffers from non-sequential disk I/Os because: (i) a master node may read multiple chunks from different data blocks for a stripe (see Fig. 4a); (ii) a master node for one stripe may provide data blocks for another stripe; and (iii) a provider node for one stripe may become a parity-block-receiver node of another stripe. Figure 4. View largeDownload slide Two reading patterns in a data-provider node: (a) in a normal per-chunk read, chunks in multiple data blocks are fetched to main memory, and only one chunk is read from a data block; (b) in a prefetching-enabled read, two sequential data chunks are fetched to a corresponding memory region. Figure 4. View largeDownload slide Two reading patterns in a data-provider node: (a) in a normal per-chunk read, chunks in multiple data blocks are fetched to main memory, and only one chunk is read from a data block; (b) in a prefetching-enabled read, two sequential data chunks are fetched to a corresponding memory region. 3.2. A prefetching mechanism for archival Prefetching schemes are widely employed in caching systems in order to (i) bridge the performance gap between processors and main memory [26, 27] or (ii) to obtain metadata information in advance for storage systems [28, 29]. In this study, we deploy a prefetching mechanism to accelerate data-block-retrieving operations in the erasure-coded data archival process. Figure 4a shows that each data block (e.g. D1 ⁠) is read from a disk in the form of chunks { C[1,D1] ⁠, C[2,D1],…,C[Nc,D1]}, where parameter Nc is the number of chunks in a block. If three data blocks (e.g. D1,D2 ⁠, and D3 ⁠) located in a node belong to a stripe, then the chunks are read in the following order: { C[1,D1] ⁠, C[1,D2] ⁠, C[1,D3] ⁠, C[2,D1] ⁠, C[2,D2] ⁠, C[2,D3] ⁠, …, C[Nc,D1] ⁠, C[Nc,D2] ⁠, C[Nc,D3]}. Such a reading pattern has a strong likelihood of leading to an excessive number of random disk reads. If chunk C[2,Di] is accessed immediately after reading chunk C[1,Di] from the disk (where i can be 1 ⁠, 2 ⁠, or 3 ⁠), then reading C[1,Di] along with C[2,Di] potentially forms sequential reads, thereby improving the disk read performance. Note that, chunks C[1,Di] and C[2,Di] can be contiguously written to disk Di ⁠, the reasons lie in the facts: (i) each data block in GFS, HDFS, or QFS is stored as a separate file to a native file system (e.g. ext3) on a storage node and (ii) the space reservation feature of the underlying file system can be deployed to enable chunks to be contiguously written to a disk. The prefetching mechanism avoids random reads of data chunks from multiple data blocks stored in a single node. Figure 4b demonstrates that after fetching two sequential data chunks to the corresponding memory region (e.g. Mem[D1] for chunks C[1,D1] and C[2,D1] ⁠), the master node directly retrieves the buffered data chunks from the memory region of the provider nodes, and then carries out the calculation process using k data chunks. It is noteworthy that the master node obtains the three blocks (i.e. D1 ⁠, D2 ⁠, and D3 ⁠) from a local disk. With the prefetching mechanism in place, the master node applies three memory regions Mem[D1] ⁠, Mem[D2] ⁠, and Mem[D3] to keep chunks of data blocks D1 ⁠, D2 ⁠, and D3 ⁠, respectively. After two sequential chunks or more are read from Di ⁠, the chunks are buffered into memory region Mem[Di] ⁠, where i∈{1,2,3} ⁠. 3.3. Incorporating prefetching into archival To overcome the non-sequential-I/O problem in the DArch scheme, we integrate the prefetching mechanism (see Section 3.2) to the data-block-reading stage in the erasure-coded data archival procedure, and design a prefetching-based archival algorithms tailored for DArch. We refer to the new algorithm as P-DArch (see Algorithm 1). Algorithm 1 The P-DArch algorithm View Large Algorithm 1 The P-DArch algorithm View Large Before presenting the algorithms in pseudo-code, we summarize the symbols and notation used in the algorithms in Table 1. Table 1. Symbols and definitions. Symbols Definition EN Encoding node, or master node DPN Data-provider node Nnode Number of nodes keeping replicas in a cluster k,r Redundancy parameters in (k+r,k) RS codes Srow rowth stripe Drow,i ith data block in stripe Srow ⁠, with i∈{1,…,k} Di Drow,i is denoted as Di here Nc Number of chunks in a data block Nprefetch Number of prefetched chunks Cj,Di jth chunk in block Di ⁠, with j∈{1,2,…,Nc} Mem[Di] Memory region allocated to data block Di Locality[row,j] Locality of stripe Srow in node SNj Taskj Archival task number of node SNj Symbols Definition EN Encoding node, or master node DPN Data-provider node Nnode Number of nodes keeping replicas in a cluster k,r Redundancy parameters in (k+r,k) RS codes Srow rowth stripe Drow,i ith data block in stripe Srow ⁠, with i∈{1,…,k} Di Drow,i is denoted as Di here Nc Number of chunks in a data block Nprefetch Number of prefetched chunks Cj,Di jth chunk in block Di ⁠, with j∈{1,2,…,Nc} Mem[Di] Memory region allocated to data block Di Locality[row,j] Locality of stripe Srow in node SNj Taskj Archival task number of node SNj View Large Table 1. Symbols and definitions. Symbols Definition EN Encoding node, or master node DPN Data-provider node Nnode Number of nodes keeping replicas in a cluster k,r Redundancy parameters in (k+r,k) RS codes Srow rowth stripe Drow,i ith data block in stripe Srow ⁠, with i∈{1,…,k} Di Drow,i is denoted as Di here Nc Number of chunks in a data block Nprefetch Number of prefetched chunks Cj,Di jth chunk in block Di ⁠, with j∈{1,2,…,Nc} Mem[Di] Memory region allocated to data block Di Locality[row,j] Locality of stripe Srow in node SNj Taskj Archival task number of node SNj Symbols Definition EN Encoding node, or master node DPN Data-provider node Nnode Number of nodes keeping replicas in a cluster k,r Redundancy parameters in (k+r,k) RS codes Srow rowth stripe Drow,i ith data block in stripe Srow ⁠, with i∈{1,…,k} Di Drow,i is denoted as Di here Nc Number of chunks in a data block Nprefetch Number of prefetched chunks Cj,Di jth chunk in block Di ⁠, with j∈{1,2,…,Nc} Mem[Di] Memory region allocated to data block Di Locality[row,j] Locality of stripe Srow in node SNj Taskj Archival task number of node SNj View Large In the P-DArch scheme, each data stripe is archived by carrying out the following five steps. First, data locality is assessed for each archival stripe by counting the number Locality[row,j] of data blocks residing in storage node SNj ⁠, with 1≤j≤Nnode ⁠. Second, master node EN is designated as the one with the best locality measure (i.e. the largest value of Locality[row,j] ⁠). Third, provider nodes are selected in light of the data locality of storage nodes. The master node EN is also serving as a provider node. Fourth, each provider node retrieves data blocks from its local disk using the prefetching mechanism. Finally, the master node computes r parity blocks, which are delivered to r separate nodes. Both data-block-retrieving (see Lines 12–18) and parity-block-generating (see Lines 19–23) operations are performed in a granularity of chunk, and they are handled in an asynchronous manner. 3.4. Features of prefetching-enabled archival In what follows, we summarize the features of the prefetching-enabled archival scheme (i.e. P-DArch) from the perspective of sequential I/Os, data locality, and network load. The incorporated prefetching technique optimizes data-block-reading performance by the virtue of sequential I/Os, which sequentially fetch data chunks from a disk into main memory followed by the encoding process. Similar to the baseline archival scheme (i.e. DArch), P-DArch appoints data-provider nodes that exhibit high-data locality of the stripe to reduce overall network traffic of the cluster. Owing to data locality, there is a big network-traffic discrepancy among involved data-provider nodes in a storage cluster where data layouts are random. Such a data-locality problem leads to imbalanced-network load. 4. BALANCING-ENABLED ARCHIVING Due to uneven distribution of data-block replicas, DArch may assign a large number of archival tasks, each of which is responsible for archiving one stripe, to a small group of storage nodes. As a result, DArch is likely to experience unbalanced network I/Os. In this section, we design a balancing-enabled archival scheme—B-DArch—by applying a load-balancing strategy. 4.1. A balancing strategy for archival In DArch (see Section 2.3) and P-DArch (see Section 3.3), both master and provider nodes are chosen based on the data locality of archival stripes, potentially distributing network load among all the involved nodes. However, there is still a big discrepancy between the network traffic of data-provider nodes and that of non-data-provider nodes in a storage cluster exhibiting a random data layout. Figure 5 shows a concrete example of the load-balancing strategy. For the sake of simplicity, network load is indicated by archival tasks of master and provider nodes. An archival task is referred to as a master node’s one-data-block-receiving or one-parity-block-sending request, a provider node’s one-data-block-sending or one-parity-block-receiving request. When it comes to determining master and provider nodes, B-DArch gives a top priority to achieving balanced load among involved nodes for a specific archival stripe, and then takes data locality into account to select an appropriate node if two or more nodes have close archival tasks. Figure 5. View largeDownload slide B-DArch lets multiple archival stripes to be processed by different encoding nodes (e.g. SN2 and SN6 ⁠), which are picked according to nodes’ archival tasks shown in Table 2. Figure 5. View largeDownload slide B-DArch lets multiple archival stripes to be processed by different encoding nodes (e.g. SN2 and SN6 ⁠), which are picked according to nodes’ archival tasks shown in Table 2. Given an archival stripe, a node running the fewest archival tasks is elected to be a master node. Table 2 depicts the number of archival tasks assigned to eight storage nodes {SN1,SN2,…,SN8} ⁠. For archival stripe 1, node SN6 is designated as the master node, since it has the lightest archival load. Because SN6 holds data block D1,3 ⁠, it is expected to retrieve the other three data blocks {D1,1,D1,2,D1,4} from the other candidate storage nodes. Among three nodes {SN1,SN3,SN4} having the identical number of archival tasks, node SN1 has the best data locality. However, node SN1 only supplies data block D1,2 for archival stripe 1 for the purpose of load balancing. Table 2. Number of archival tasks managed by nodes. Node SN1 SN2 SN3 SN4 SN5 SN6 SN7 SN8 For stripe 1 2 3 2 2 3 1 3 5 For stripe 2 4 2 3 3 3 2 3 3 Node SN1 SN2 SN3 SN4 SN5 SN6 SN7 SN8 For stripe 1 2 3 2 2 3 1 3 5 For stripe 2 4 2 3 3 3 2 3 3 View Large Table 2. Number of archival tasks managed by nodes. Node SN1 SN2 SN3 SN4 SN5 SN6 SN7 SN8 For stripe 1 2 3 2 2 3 1 3 5 For stripe 2 4 2 3 3 3 2 3 3 Node SN1 SN2 SN3 SN4 SN5 SN6 SN7 SN8 For stripe 1 2 3 2 2 3 1 3 5 For stripe 2 4 2 3 3 3 2 3 3 View Large When it comes to nominating a master node if two nodes or more have the same number of archival tasks, data locality will be taken into consideration to break the tie. For example, as to stripe 2, nodes SN2 and SN6 are identical in terms of the number of archival tasks. In this case, SN2 rather than SN6 is a preferred master node, because a data block (i.e. D2,4 ⁠) in stripe 2 resides in node SN2 ⁠. When both master and provider nodes are determined depending on the number of archival tasks as well as data locality, B-DArch further boosts disk and network bandwidths compared with DArch. That is, B-DArch can achieve higher disk and network utilization than DArch from the entire cluster’s perspective, thereby speeding up the overall archival process. 4.2. Integrating balancing into archival To address the imbalanced-network-I/O issue in DArch, we apply the balancing strategy (see Section 4.1) to determine both master and provider nodes during the course of erasure-coded data archiving, and achieve a balancing-based data archival algorithms for DArch. We refer to this new algorithm as B-DArch (see Algorithm 2). Algorithm 2 The B-DArch algorithm View Large Algorithm 2 The B-DArch algorithm View Large In the B-DArch scheme, each data stripe is archived by carrying out the following five steps. First, B-DArch counts the number Locality[row,j] of data blocks residing in storage node SNj (⁠ 1≤j≤Nnode ⁠). Second, one node with the lightest archival task measure (i.e. the smallest value of Taskj ⁠) is selected to be master node EN ⁠. If two or multiple nodes have the lightest archival tasks, the one with high-data locality is designated as EN to break the tie. Third, provider nodes are elected in light of the archival task of storage nodes. Fourth, each provider node reads data blocks to its main memory in a per-block manner (see also Fig. 4a). Finally, the master node computes r parity blocks, which are delivered to r separate nodes. Note that the number of archival tasks is accordingly updated when the master and provider nodes undertake a certain job (e.g. data-block-receiving, data-block-sending, parity-block-sending, etc.), and the update of the number of archival tasks helps to accurately reflect the master and provider nodes’ current workload. 4.3. Features of balancing-enabled archival The main features of the balancing-enabled archival scheme are summarized as below. The integrated balancing strategy enables data-block-retrieving jobs to be distributed to provider nodes as evenly as possible, maximizing the network utilization of storage clusters. Apart from archival task measure, data locality is taken into consideration when it comes to determining the group of data-provider nodes. 5. THE PB-DARCH ALGORITHM Schemes P-DArch and B-DArch improve the archival performance of DArch by deploying a prefetching mechanism and a balancing strategy, respectively. To address the non-sequential-read and imbalanced-network-I/O issues of DArch, we develop a prefetching-and-balancing-enabled archival algorithms for DArch. We refer to this new algorithm as PB-DArch. The pseudo-code of PB-DArch can be found in Algorithm 3. It is noted that all the steps except the fourth one are similar to those in B-DArch. In the fourth step, each provider node retrieves data blocks from its local disk using the prefetching mechanism (see Fig. 4b). With the prefetching mechanism in place, one provider node sequentially reads Nprefetch chunks in a data block and buffer them to a pre-allocated memory region; chunks of different data blocks are buffered to corresponding memory regions. Intuitively, the PB-DArch scheme is superior to both P-DArch and B-DArch, because it not only adopts prefetching to improve the data-block-reading stage of each provider node but also deploys the load-balancing strategy to balance network load among provider nodes. Algorithm 3 The PB-DArch algorithm View Large Algorithm 3 The PB-DArch algorithm View Large 6. PERFORMANCE EVALUATION To quantitatively evaluate performance improvements offered by the prefetching mechanism and balancing strategy, we implement four archival schemes: a prefetching-enabled scheme (i.e. P-DArch), a balancing-enabled scheme (i.e. B-DArch), a prefetching-and-balancing-enabled scheme (i.e. PB-DArch) and a baseline solution (i.e. DArch) in a real-world storage cluster. In this section, we describe the experimental environment and methodology; then we evaluate the archival performance impacts of the system parameters such as k ⁠, Nnode ⁠, and Nprefetch ⁠, and so on. 6.1. Experimental environment 6.1.1. System setup We build an experimental storage cluster, which consists of 1 management node and 15 storage nodes. The management node is configurable in terms of data archival parameters. Storage nodes are responsible for storing and archiving data stripes. All the nodes are connected through a Cisco’s ‘WS-C3750G-24TS-S’ Ethernet switch. The management node contains an Intel(R) E5-2650 @ 2.0 GHz CPU, 16GB DDR3 memory. Each storage node comprises an Intel (R) CPU E5800 @ 3.2 GHz, a Gigabit Ethernet card, and a Western Digital Enterprise SATA2.0 disk with model WD1002FBYS; we install the Linux operation system (Kernel 2.6.32) on the storage cluster. 6.1.2. Prototype design We implement a prototype system, where the manager node embraces one out of the four candidate archival schemes (a.k.a., coding algorithms) to guide the data nodes to accomplish archival tasks. The management node judiciously nominates a data node to perform the master-node duty, which is in charge of archiving a data stripe using the coding algorithm. After completing the archival process of one stripe, the master node notifies the management node to measure data archiving time. The total archiving time is the time interval between archival-task start time and completion time. Data blocks are randomly distributed across the data nodes. The metadata (e.g. data layout information) of the data blocks can be accessible by the manager node. 6.1.3. Testing methodology The evidence shows that a configuration of ‘ r=3’ achieves a sufficiently large mean time to data loss (namely, MTTDL) for archival storage systems [7]. For example, code parameter ‘ r=3’ is adopted in the Google’s new GFS [8] and HYDRAStor storage system [30]. We adopt ‘ r=3’ in our experiments to resemble real-world storage clusters. To guarantee the reliability of each stripe, k data blocks and r parity blocks are exclusively distributed to k+ r nodes. In the following tests, three (i.e. ‘r = 3’) separate storage nodes are used to keep generated parity blocks. Nnode storage nodes are employed to store source data blocks (i.e. data replicas) for data archival. One hundred and forty-four source data blocks are distributed to Nnode nodes in three-way replication; the data block size is set to 64 MB. Additionally, configurations ‘ Nnode=8, k=6, Nprefetch=4 ⁠, Nc=1024 ⁠, SRU=64 KB’ are adopted in the following experiments, unless otherwise specified. The performance metric used in our experiments is archiving time; a small archiving time means high archival performance. Before running each experiment, we clear data buffered in main memory to ensure that the cache is empty. We repeat each experiment five times to calculate the average archiving time. During the data archiving process, the master node retrieves multiple data blocks from the provider nodes through the network interconnect; such a data-block-retrieving process is regarded as synchronized read, which is the root cause of TCP Incast [31]. To overcome the Incast problem in our cluster, we use the high-resolution retransmission by setting RTO to 200μs ⁠. In order to quantitatively compare all the four archival schemes, we conduct four groups of experiments to examine the sensitivity of archival schemes to the following factors: the number of data blocks in (k+r,k) erasure codes; the number of nodes keeping source data blocks; the number of prefetched data chunks; and the size of storage request unit (SRU). 6.2. Experimental results 6.2.1. k—number of data blocks in erasure encoding In a storage cluster applying (k+r,k) erasure codes, k data blocks of each data stripe generate r parity blocks. We investigate the impact of the number k of data blocks on archiving performance by setting k to 6, 9 and 12, respectively. Figure 6 shows the archiving time of the four schemes when we set parameters SRU ⁠, r ⁠, and Nnode to 64 KB, 3 and 8, respectively. Figure 6. View largeDownload slide The impacts of the number of data blocks on archiving times of the four schemes, with k=6 ⁠, 9, 12, SRU=64KB ⁠, r=3 ⁠, and Nnode=8 ⁠. Figure 6. View largeDownload slide The impacts of the number of data blocks on archiving times of the four schemes, with k=6 ⁠, 9, 12, SRU=64KB ⁠, r=3 ⁠, and Nnode=8 ⁠. We observe from the experimental results that regardless of parameter k ⁠, our P-DArch, B-DArch and PB-DArch outperform the baseline archival scheme. That is, it is confirmed that both the prefetching mechanism and balancing strategy help to optimize the performance of random-layout archival schemes. For example, when k is set to 9, P-DArch, B-DArch and PB-DArch significantly speed up the archiving performance DArch by a factor of 2.95, 1.72 and 3.85, respectively. The performance improvements offered by P-BArch and B-DArch are attributed to prefetching and balancing, respectively. Surprisingly, parameter k has a marginal performance impact on the four schemes. Take the DArch scheme as an example. The archival performance of DArch is flat when k increases. The reasons are two-fold. First, with the increasing k value, the number of local data blocks is surging since DArch selects a master node for each stripe according to data locality. The large k value exacerbates non-sequential reads and increases the archiving time of one stripe. Second, increasing k can decrease the number of archival stripes. As a result, it is illustrated that k has a little impact on the performance of DArch. From Fig. 6, it is observed that P-DArch and PB-DArch outperform DArch and B-DArch, respectively. Because the prefetching mechanism enables data-block provider nodes in P-DArch and PB-DArch to sequentially fetch correlated data chunks. In particular, P-DArch speeds up the archival process of DArch by a factor of 3.44, 2.95 and 2.93 when k is set to 6, 9, and 12, respectively; PB-DArch outperforms B-DArch by a factor of 2.43, 2.24 and 2.01 when k is set to 6, 9 and 12, respectively. From the experimental results, we observe that B-DArch and PB-DArch outperform DArch and P-DArch, respectively. Because the load-balancing strategy enables B-DArch and PB-DArch to evenly distribute archival tasks to the master and provider nodes for each archival stripe. In particular, B-DArch speeds up the archival process of DArch by a factor of 1.67, 1.72 and 1.60 when k is set to 6, 9 and 12, respectively; PB-DArch outperforms P-DArch by a factor of 1.18, 1.30 and 1.10 when k is set to 6, 9 and 12, respectively. Interestingly, we discover that the space of performance improvement offered by prefetching is suppressed when the load-balancing strategy is incorporated in advance. For instance, P-DArch speeds up DArch by a factor of 2.95 when k is 9, whereas the archival performance of PB-DArch is only 2.24 time of that of B-DArch. The reason lies in the fact that load balancing allows more nodes to be designated as provider nodes, decreasing the number of provider’s local data blocks and weakening the advantage of prefetching mechanism. Nevertheless, load balancing does optimize the archiving process. 6.2.2. Nnode—number of source data nodes In an (k+r,k) erasure-coded archival system, data replicas are dispersed in Nnode data nodes. In this group of experiments, we examine the sensitivity of the four archival schemes to the number Nnode of data nodes in the storage cluster (⁠ Nnode=6 ⁠, 8, 10), where parameters SRU ⁠, k and r are set to 64 KB, 6 and 3, respectively. Figure 7 shows that with the increasing value of Nnode ⁠, the archiving times of all the archiving schemes are reduced. Changing Nnode from 6 to 8, we observe that the archiving times of DArch, P-DArch, B-DArch and PB-DArch decrease by 36.6%, 53.3%, 47.9% and 54.5%, respectively. The fundamental reason is that both disk and network I/O resources become abundant along with the increasing number of data nodes, which in turn brings about high archival performance. Figure 7. View largeDownload slide The impacts of the number of data nodes (⁠ Nnode=6 ⁠, 8, 10) on archiving times of the four schemes, with SRU=64KB ⁠, k=6 and r=3 ⁠. Figure 7. View largeDownload slide The impacts of the number of data nodes (⁠ Nnode=6 ⁠, 8, 10) on archiving times of the four schemes, with SRU=64KB ⁠, k=6 and r=3 ⁠. The bottleneck of non-prefetching-enabled schemes (i.e. DArch and B-DArch) is non-sequential disk reads. When the number of data nodes is large, available disk bandwidth of the entire system increases, the data-block-reading stage is in turn accelerated; thus, the bottleneck of non-prefetching-enabled solutions is alleviated. For non-balancing-enabled archival schemes (i.e. DArch and P-DArch), network bandwidth usually dominates the overall archival performance. To address this issue, the two balancing-enabled schemes (i.e. B-DArch and PB-DArch) select a master node for each archival stripe according to archival task and data locality, making the full advantages of disk and network I/Os in all the nodes. Consequentially, when the value of Nnode becomes large, B-DArch and PB-DArch deliver higher archival performance than DArch and P-DArch, respectively. For example, when Nnode is set to 6, the archiving times of DArch and B-DArch are 217.6 and 179.0 s, respectively; whereas the times become 138.0 and 93.2 s when Nnode is 8. When the number of data nodes increases, the performance bottleneck of the archival schemes is alleviated, however, the space of performance improvements gradually becomes slim. The reason lies in the fact that the optimization of non-sequential disk reads and imbalanced-network load is suppressed. In particular, it is found that the archiving-time reduction is usually smaller when Nnode varies from 8 to 10 than that from 6 to 8. Take P-DArch as an example, the archive time is reduced by 53.3% when Nnode increases from 6 to 8; whereas the reduction is 21.2% when Nnode increases from 8 to 10. 6.2.3. Nprefetch–Number of prefetched chunks In the prefetching-enabled schemes, Nprefetch chunks in a data block are fetched into main memory to mitigate the random-chunk-reading effect. Figure 8 shows the impact of prefetching granularity on PB-DArch’s archiving performance. Figure 8. View largeDownload slide Impact of the number of prefetched chunks on PB-DArch’s archival performance. Nprefetch={1,2,4,8,16,32,64,128,256,512,1024} ⁠, k=6 ⁠, r=3 ⁠, SRU=64KB ⁠. Figure 8. View largeDownload slide Impact of the number of prefetched chunks on PB-DArch’s archival performance. Nprefetch={1,2,4,8,16,32,64,128,256,512,1024} ⁠, k=6 ⁠, r=3 ⁠, SRU=64KB ⁠. The number Nc of data chunks is 1024 when data block and chunk unit are set to 64 MB and 64 KB, respectively. To align prefetched chunks, it is necessary to ensure that the value of Nc equals to a multiple of the number Nprefetch of prefetched chunks. Therefore, we set prefetching granularity Nprefetch to {1,2,4,8,16,32,64,128,256,512,1024} ⁠. If the value of Nprefetch is 1, PB-DArch is degraded to B-DArch; when Nprefetch equals to Nc ⁠, PB-DArch lets a provider node to uninterruptedly fetch an entire data block into memory. We draw two intriguing observations from Fig. 8. First, the archiving time of PB-DArch decreases when the number Nprefetch of prefetched chunks increases from 1 to 16. This trend is reasonable, because when Nprefetch is relatively small, the prefetching-enabled archival schemes accomplish higher performance if more sequential data chunks are prefetched. Note that prefetching reads are able to hide disk seek times. Second, the archiving time increases when the value of Nprefetch varies from 128 to 1024. The reason is that only when k corresponding data chunks in k data blocks are already retrieved can a master node carry out a calculation of r parity chunks; when the number Nprefetch of prefetched chunks is rather large, prefetching reads in a provider node having high-data locality may delay the subsequent encoding operations, thereby resulting in a long waiting time and deteriorating the overall archiving performance. 6.2.4. SRU—size of SRU The data access to storage devices can be accomplished by means of SRU. During the course of erasure-coded data archiving, data blocks are retrieved via an unit of chunk, so SRU is equivalent to data chunk in archival schemes. To examine the impact of the size of SRU, we conduct (9, 6) RS-coded archival experiments where SRU is set to {32 KB, 64 KB, …, 64 MB} for DArch and B-DArch, and {32 KB, 64 KB, …, 16 MB} for P-DArch and PB-DArch. The maximum value is 16 MB in prefetching-enabled archival schemes, because the default value of Nprefetch is 4. Due to the space limit, we only present the experimental results of B-DArch and PB-DArch (see Fig. 9), because DArch exhibits a similar performance trend as that of B-DArch; P-DArch also has a similar performance trend of PB-DArch. Figure 9. View largeDownload slide A comparison of archiving times in B-DArch and PB-DArch with respect to different size of SRU. k=6 and r=3 ⁠. (a) The B-DArch scheme and (b) The PB-DArch scheme. Figure 9. View largeDownload slide A comparison of archiving times in B-DArch and PB-DArch with respect to different size of SRU. k=6 and r=3 ⁠. (a) The B-DArch scheme and (b) The PB-DArch scheme. From Fig. 9a, we observe that B-DArch’s archiving time decreases with the increasing SRU when the value of SRU is anywhere between 32 KB and 2 MB, and its archiving time increases along with the increasing SRU when the value of SRU is in a range between 4 MB and 64 MB. On one hand, when SRU is smaller than 2 MB, a large SRU can strengthen the read sequentiality, thereby boosting the data-block-retrieving performance. On the other hand, when SRU is larger than 4 MB, the time-consuming data-block-reading operation in provider nodes keep the next encoding operation in the waiting state, which in turn enlarges the overall archiving time. PB-DArch and B-DArch share a similar performance trend, although both decreasing and increasing ranges of archiving time caused by varied SRUs are smaller than those in B-DArch. Figure 9a and b reveals that PB-DArch outperforms B-DArch when SRU is smaller than 512 KB, implying that there exists a non-sequential-chunk-reading problem in erasure-coded data archival for a storage cluster with a random data layout. 6.3. A summary of observations We summarize the important observations drawn from Figs 6–9 as follows: The prefetching mechanism helps to optimize the performance of the random-layout archival schemes, because fetching correlated data chunks mitigates non-sequential I/Os. For example, when k is set to 9, P-DArch and PB-DArch significantly speed up the archival processes of DArch and B-DArch by 2.95% and 2.24%, respectively. The load-balancing strategy is conducive to performance improvement of the distributed archival schemes, since it enables archival tasks to be evenly distributed to both master and provider nodes for each archival stripe. For instance, B-DArch outperforms DArch by a factor of 1.67, 1.72 and 1.60 when k is set to 6, 9 and 12, respectively. The archiving time decreases with the increasing value of Nnode for all the archiving schemes. Because more data nodes can offer ampler disk and network I/O resources, which in turn boost archival performance. Both the non-prefetching-enabled and prefetching-enabled archival schemes gains performance benefit from a large prefetching granularity or big data chunks, which strengthen the read sequentiality in the data-block-retrieving stage. 7. RELATED WORK The replica-based redundancy schemes introduce expensive storage overhead to storage systems. In contrast, erasure-coded redundancy schemes embrace the capability of tolerating multiple failures while saving storage overhead; thus, erasure-coded schemes increasingly become popular in the area of archival storage. Nowadays, archival storage can be divided into two categories: (i) archival storage systems and (ii) data archival techniques. As to the former, archival source comes from newly created data buffered in main memory; such archival storage systems are focused on fault tolerance. That is, archival storage systems are designed to store newly created data using erasure codes. As to the latter one, archival source is existing replicas stored in disks; the data archival techniques aim to improve the performance of migrating the existing data replicas into erasure-coded archival. A wide range of erasure-coded archival systems have been proposed. For example, erasure-coded storage are adopted by the following archival storage systems: (i) Pergamum is an energy-efficient disk-based archival storage [7]; (2) POTSHARDS is a secure, recoverable, long-term archival storage system [10]; (3) Tahoe-LAFS is a decentralized storage system with provider-independent security for long-term storage [32] and (4) HDFS-RAID is aHDFS module deploying RS codes to store old data sets that are accessed by few jobs in Hadoop [9]. Apart from erasure-coded archival systems, erasure coding schemes are designed for efficient archival of low-access-frequency replicas. Some storage systems (e.g. WAS and Facebook) concurrently support both three-way replication and erasure codes, which are used to keep frequently and rarely accessed data sets, respectively. WAS and Facebook adopt local reconstruction codes (LRC) and locally repairable codes (LRCs) to reap storage capacity benefits against three-way replication, respectively [16, 25]. LRC in WAS generates r global parities and l local parities from k data blocks, so as to reduce bandwidth and I/Os required for repair reads; LRCs in Facebook are constructed on top of MDS codes, with additional local parities added to provide efficient repair in the case of single-block failures. Furthermore, some prior projects were focused on novel erasure codes for archival storage. For example, Paris et al. implemented three-dimensional redundancy codes for archival storage [33]; Pha et al. investigated an extension of traditional erasure coding for distributed archival storage [34]; Thomas et al. introduced two layouts for two-failure tolerant archival storage systems [35]. However, these studies are mainly focused on building new erasure codes with good repairability properties for archival storage, without considering an issue of efficient migration from replicas into erasure-coded archival. In the realm of data archival techniques, the study of migrating existing data replicas into erasure-coded archival is in its infancy. Among all the related work found in the literature, synchronous encoding [4], decentralized erasure coding [12], pipelined erasure codes [13], pipelined encoding [14] and parallel data archiving [15] are the most relevant solutions to our approaches. However, there exist remarkable differences between our approaches and the above-mentioned solutions. The synchronous encoding approach creates parity blocks from three-replica redundancy using classical erasure codes by a dedicated encoding node. The parity generation procedure in synchronous encoding does not exploit data locality in existing data replicas. Pamies-Juarez et al. proposed two families of erasure codes, namely, decentralized erasure codes for efficient data archival and pipelined erasure codes for fast data archival. There are three distinctions between our schemes and the two families of erasure codes. First, their solutions speed up the data archival process by reducing encoding time, whereas ours optimize archival performance by decreasing data-block-retrieving time. Second, their methods utilize the existence of source replicas to reduce traffic; on the contrary, our archival schemes exploit the access characteristic—data locality and data distribution—of source data blocks to take the full advantage of a cluster’s I/O resources. Third, they construct new erasure codes to accomplish data archival; new erasure codes may introduce extra I/O overhead. For example, the pipelined erasure codes are non-systematic codes, meaning that decoding computations are required when reading the archived data. Our archival schemes, on the other hand, adopt classical Reed–Solomon codes—systematic codes—so that archived data consist of the source data blocks, and the archived data can be directly accessed without decoding operations. Different from pipelined encoding schemes (i.e. DP and 3X) tailored for non-random placement layouts, our archival schemes (i.e. P-DArch, B-DArch and PB-DArch) are proposed for replica-based storage clusters with random data layouts. In particular, the DP and 3X schemes are only appropriate for two chained-declustering-enabled data layouts— [D+P]cd and [3X]cd—of Mirrored RAID-5 and three-way replication, respectively. Our schemes exhibit good flexibility, since source data blocks are allowed to be randomly placed across the replica cluster, which is a common case in real-world production storage clusters. The parallel data archiving (i.e. aHDFS-Grouping) adopts the Map/Reduce model to organize the parity-generation procedure for erasure-coded data archival in Hadoop clusters. The aHDFS-Grouping scheme keeps each mapper’s intermediate output key-value pairs in a local key-value store, and merges all the intermediate key-value pairs with the same key into one single key-value pair, and shuffles the single Key-Value pair to reducers to generate final parity blocks. Unlike our schemes, aHDFS-Grouping does not take both I/O access and load balance into consideration. From the methodological perspective, our prefetching-enabled archival schemes are orthogonal to the aforementioned erasure codes using for data archival like LRC in WAS, LRCs in Facebook, decentralized erasure codes and pipelined erasure codes. This is because prefetching mechanism optimizes the archiving performance by the virtue of I/O scheduling, whereas the existing schemes offer or accelerate data archival by applying new erasure codes. Therefore, the prefetching mechanism can be integrated with these solutions to speed up the data-block-retrieving process. 8. FURTHER DISCUSSION B-DArch and PB-DArch keep track of the number of archival tasks to measure node load. This simple measurement approach is adopted to demonstrate the feasibility of the balancing-enabled archival schemes. Compared to this simple measurement approach, a comprehensive measurement approach may more efficiently guide the master/provider node selections. Apart from archival task number, the comprehensive measurement approach should consider node configuration, I/O workload, available network bandwidth and the like. Prefetching mechanism is employed to improve the data-block-reading stage performance of provider nodes, whereas load-balancing strategy is served to determine the group of provider nodes. Therefore, the load-balancing strategy is orthogonal to the prefetching mechanism, and PB-DArch can be viewed as either prefetching-enabled B-DArch or balancing-enabled P-DArch. Both prefetching mechanism and balancing strategy help to accelerate archival processes by scheduling involved archival operations among nodes. In essential, prefetching mechanism and balancing strategy are associated with space and computation overheads, respectively. Especially, as to the prefetching mechanism, source-data-block-reading performance is improved by mitigating non-sequential disk reads using pre-allocated main memories; as for the balancing strategy, the total computation overhead caused by parity-block-generating operations is unchanged for a group of archival tasks, and the encoding bottleneck occurred in parity-block-generating operations is eliminated by exploiting the computation capability of all involved nodes and, thus distributing archival tasks among multiple nodes in a balanced manner. The four evaluated archival schemes can be implemented in the form of a MapReduce application running on Hadoop clusters. Alternatively, these schemes may be realized as an individual module integrated on top of an existing storage system, and the module manages and schedules archival tasks on multiple nodes in a cluster. Furthermore, the prefetching mechanism tailored for erasure-coded data archival can be incorporated into underlying file systems of storage nodes and, thus be used to enhance the readahead strategies in existing file systems. 9. CONCLUSIONS AND FUTURE WORK Both non-sequential-read and imbalanced-I/O issues degrade overall data archival performance. To address such issues, we incorporate both prefetching mechanism and balancing strategy into erasure-coded archival for replica-based storage clusters. With the prefetching mechanism in place, multiple correlated data chunks are fetched into main memory to mitigate adverse per-chunk reads. By the virtue of load balancing, archival tasks tend to be evenly distributed among master and provider nodes. Consequently, load balancing maximizes the network utilization of storage clusters. Four distributed archival schemes are quantitatively evaluated upon a proof-of-concept prototype. The experimental results show that prefetching mechanism coupled with load-balancing strategy can significantly improve the archival performance for a replica-based storage cluster with a random data layout. In the future, we plan to extend balancing-enabled archival schemes to heterogeneous clusters, and adopt a comprehensive measurement approach to determine master and provider nodes. The comprehensive measurement approach may consider such factors as node hardware configuration, I/O workload, available network bandwidth and archival tasks. FUNDING This work is supported in part by the National Science Foundation of China under grant nos 61572209, 61300047, 61472152 and 61702004. X.Q.’s work is supported by the U.S. National Science Foundation under grants CCF-0845257 (CAREER), CNS-0917137, CCF-0742187 and IIS-1618669, and the 111 Project under Grant B07038. Footnotes 1 6 TB, 8 TB and 10 TB Hard disks have been found in the market, and it is likely that a single disk of large capacity keeps a large amount of cold data blocks to be archived. REFERENCES 1 Ghemawat , S. , Gobioff , H. and Leung , S.-T. ( 2003 ) The Google File System. . ACM Press , New York, NJ, USA . 2 Borthakur , D. ( 2007 ) The hadoop distributed file system: architecture and design . Hadoop Proj. Website , 11 , 21 – 34 . 3 Nightingale , E.B. , Elson , J. , Fan , J. , Hofmann , O. , Howell , J. and Suzue , Y. ( 2012 ) Flat datacenter storage. Proc.10th USENIX Symp. Operating System Design and Implementation (OSDI’12), Hollywood, CA, USA, 8–10 October, pp. 1–15. USENIX Association. Berkeley, CA, USA. 4 Fan , B. , Tantisiriroj , W. , Xiao , L. and Gibson , G. ( 2011 ) Diskreduce: Replication as a Prelude to Erasure Coding in Data-intensive Scalable Computing. Proc. 2011 Int. Conf. High Performance Computing Networking, Storage and Analysis (SC’11), Seattle, WA, USA, 12–18 November, pp. 6–10. ACM Press. New York, NJ, USA. 5 Weatherspoon , H. and Kubiatowicz , J.D. ( 2002 ) Erasure Coding vs. Replication: A Quantitative Comparison. Proc. 2002 Int. Workshop on Peer-to-Peer Systems (IPTPS’02), Cambridge, MA, USA, 7–8 March, pp. 328–337. Springer, Berlin. 6 Calder , B. et al. ( 2011 ) Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. Proc. 23rd ACM Symp. Operating Systems Principles (SOSP’11), Cascais, Portugal, 23–26 October, pp. 143–157. ACM Press. New York, NJ, USA. 7 Storer , M. , Greenan , K. , Miller , E. and Voruganti , K. ( 2008 ) Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-based Archival Storage. Proc. 6th USENIX Conf. File and Storage Technologies (FAST’08), San Jose, CA, USA, 26–29 February, pp. 1–16. USENIX Association. Berkeley, CA, USA. 8 Ford , D. , Labelle , F. , Popovici , F. , Stokely , M. , Truong , V. , Barroso , L. , Grimes , C. and Quinlan , S. ( 2010 ) Availability in Globally Distributed Storage Systems. Proc. 9th USENIX Symp. Operating Systems Design and Implementation (OSDI’10), Vancouver, BC, Canada, 4–6 October, pp. 61–74. USENIX Association. Berkeley, CA, USA. 9 Thusoo , A. , Shao , Z. , Anthony , S. , Borthakur , D. , Jain , N. , Sarma , J. , Murthy , R. and Liu , H. ( 2010 ) Data Warehousing and Analytics Infrastructure at Facebook. Proc. 2010 ACM SIGMOD Int. Conf. Management of Data, Indianapolis, Indiana, USA, 6–11 June, pp. 1013–1020. ACM Press. New York, NJ, USA. 10 Storer , M.W. , Greenan , K.M. , Miller , E.L. and Voruganti , K. ( 2009 ) Potshards¡ªa secure, recoverable, long-term archival storage system . ACM Trans. Storage (TOS) , 5 , 1 – 35 . Google Scholar Crossref Search ADS 11 Adams , I. , Miller , E.L. and Rosenthal , D.S. ( 2011 ) Using storage class memory for archives with dawn, a durable array of wimpy nodes. Technical Report UCSC-SSRC-11-07. University of California, Santa Cruz. 12 Pamies-Juarez , L. , Oggier , F. and Datta , A. ( 2013 ) Decentralized erasure coding for efficient data archival in distributed storage systems. Proc. 2013 Int. Conf. Distributed Computing and Networking (ICDCN’12), Hong Kong, China, 3–6 January, pp. 42–56. Springer, Berlin. 13 Pamies-Juarez , L. , Datta , A. and Oggier , F. ( 2013 ) RapidRAID: Pipelined Erasure Codes for Fast Data Archival in Distributed Storage Systems. Proc. 32nd IEEE Inter. Conf. Computer Communications (INFOCOM’13), Turin, Italy, 14–19 April, pp. 1294–1302. IEEE Computer Society, Los Alamitos, CA, USA. 14 Huang , J. , Wang , Y. , Qin , X. , Liang , X. and Xie , C. ( 2015 ) Exploiting pipelined encoding process to boost erasure-coded data archival . IEEE Trans. Parallel Distrib. Syst. , 26 , 2984 – 2996 . Google Scholar Crossref Search ADS 15 Chen , Y. , Zhou , Y. , Taneja , S. , Qin , X. and Huang , J. ( 2017 ) aHDFS: an erasure-coded data archival system for hadoop clusters . IEEE Trans. Parallel Distrib. Syst. , 28 , 3060 – 3073 . Google Scholar Crossref Search ADS 16 Huang , C. , Simitci , H. , Xu , Y. , Ogus , A. , Calder , B. , Gopalan , P. , Li , J. and Yekhanin , S. ( 2012 ) Erasure Coding in Windows Azure Storage. Proc. 2012 USENIX Conf. Annual Technical Conference (ATC’12), Boston, MA, USA, 13–15 June, pp. 15–26. USENIX Association. Berkeley, CA, USA. 17 Huang , J. , Zhang , F. , Qin , X. and Xie , C. ( 2013 ) Exploiting redundancies and deferred writes to conserve energy in erasure-coded storage clusters . ACM Trans. Storage (TOS) , 9 , 1 – 29 . 18 Muralidhar , S. et al. ( 2014 ) f4: Facebooks Warm Blob Storage System. Proc. 11th USENIX Conf. Operating Systems Design and Implementation (OSDI’14), Broomfield, CO, USA, 6–8 October, pp. 383–398. USENIX Association, Berkeley, CA, USA. 19 Manasse , M. , Thekkath , C. and Silverberg , A. ( 2009 ) A Reed-Solomon code for disk storage, and efficient recovery computations for erasure-coded disk storage . Proc. Inform. , 1 , 1 – 11 . 20 Reed , I. and Solomon , G. ( 1960 ) Polynomial codes over certain finite fields . J. Soc. Ind. Appl. Math. , 8 , 300 – 304 . Google Scholar Crossref Search ADS 21 Plank , J. et al. ( 1997 ) A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems . Softw. Pract. Exp. , 27 , 995 – 1012 . Google Scholar Crossref Search ADS 22 Patterson , D. et al. ( 1988 ) A case for redundant arrays of inexpensive disks. Proc. 1988 ACM SIGMOD Int. Conf. Management of Data, Chicago, IL, USA, 1–3 June, pp. 59–66. ACM. New York, NY, USA. 23 Plank , J. , Luo , J. , Schuman , C. , Xu , L. and Wilcox-O’Hearn , Z. ( 2009 ) A Performance Evaluation and Examination of Open-source Erasure Coding Libraries for Storage. Proc. 7th USENIX Conf. File and Storage Technologies (FAST’09), San Francisco, CA, USA, 24–27 February, pp. 253–265. USENIX Association, Berkeley, CA, USA. 24 Ovsiannikov , M. , Rus , S. , Reeves , D. , Sutter , P. , Rao , S. and Kelly , J. ( 2013 ) The quantcast file system . Proc. VLDB Endowment , 6 , 1092 – 1101 . Google Scholar Crossref Search ADS 25 Sathiamoorthy , M. , Asteris , M. , Papailiopoulos , D. , Dimakis , A.G. , Vadali , R. , Chen , S. and Borthakur , D. ( 2013 ) Xoring elephants: Novel erasure codes for big data . Proc. VLDB Endowment , 6 , 325 – 336 . Google Scholar Crossref Search ADS 26 Alexander , T. and Kedem , G. ( 1996 ) Distributed Prefetch-Buffer/Cache Design for High Performance Memory Systems. Proc. 2nd Int. Symp. High-Performance Computer Architecture (HPCA’96), San Jose, CA, USA, 3–7 February, pp. 254–263. IEEE Computer Society, Los Alamitos, CA, USA. 27 Panda , R. , Gratz , P.V. , Jiménez , D. et al. ( 2012 ) B-fetch: Branch prediction directed prefetching for in-order processors . Comput. Architecture Lett. , 11 , 41 – 44 . Google Scholar Crossref Search ADS 28 Gu , P. , Zhu , Y. , Jiang , H. and Wang , J. ( 2006 ) Nexus: A Novel Weighted-Graph-Based Prefetching Algorithm for Metadata Servers in Petabyte-Scale Storage Systems. Proc. 6th IEEE Int. Symp. Cluster Computing and the Grid (CCGRID’06), Singapore, 16–19 May, pp. 8–16. IEEE Computer Society, Los Alamitos, CA, USA. 29 Lin , L. , Li , X. , Jiang , H. , Zhu , Y. and Tian , L. ( 2008 ) Amp: an affinity-based metadata prefetching scheme in large-scale distributed storage systems. Proc. 8th IEEE Int. Symp. Cluster Computing and the Grid (CCGRID’08), Lyon, France, 19–22 May, pp. 459–466. IEEE Computer Society, Los Alamitos, CA, USA. 30 Ungureanu , C. , Atkin , B. , Aranya , A. , Gokhale , S. , Rago , S. , Calkowski , G. , Dubnicki , C. and Bohra , A. ( 2010 ) HydraFS: A High-Throughput File System for the HYDRAstor Content-Addressable Storage System. Proc. 8th USENIX Conf. File and Storage Technologies (FAST’10), San Jose, CA, USA, 23–26 February, pp. 225–238. USENIX Association. Berkeley, CA, USA. 31 Phanishayee , A. , Krevat , E. , Vasudevan , V. , Andersen , D. , Ganger , G. and Gibson , G. ( 2008 ) Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems. Proc. 6th USENIX Conf. File and Storage Technologies (FAST’08), San Jose, CA, USA, 26–29 February, pp. 1–14. USENIX Association. Berkeley, CA, USA. 32 Wilcox-O’Hearn , Z. and Warner , B. ( 2008 ) Tahoe: The Least-authority Filesystem. Proc. 4th ACM Int. Workshop on Storage Security and Survivability (StorageSS’08), Alexandria, VA, USA, 27–31 October, pp. 21–26. ACM Press. New York, NJ, USA. 33 Jehan-Francois , P. , Darrell , D.E.L. and Witold , L. ( 2013 ) Three-Dimensional Redundancy Codes for Archival Storage. Proc. 2013 IEEE 21st Int. Symp. Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’13), San Francisco, CA, USA, 14–16 August, pp. 328–332. IEEE Computer Society, Los Alamitos, CA, USA. 34 Cuong , P. , Feng , Z. and Duc , A.T. ( 2011 ) Maintenance-Efficient Erasure Coding for Distributed Archival Storage. Proc. 20th Int. Conf. Computer Communications and Networks (ICCCN’11), Maui, Hawaii, USA, 31 July–4 August, pp. 1–5. IEEE Computer Society. Los Alamitos, CA, USA. 35 Schwarz , T. , Amer , A. and Pâris , J.-F. ( 2015 ) Combining Low IO-Operations During Data Recovery with Low Parity Overhead in Two-failure Tolerant Archival Storage Systems. Proc. 2015 IEEE 21st Pacific Rim Int. Symp. Dependable Computing (PRDC’15), Zhangjiajie, China, 18–20 November, pp. 235–244. IEEE Computer Society, Los Alamitos, CA, USA. © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Optimizing Erasure-Coded Data Archival for Replica-Based Storage Clusters JF - The Computer Journal DO - 10.1093/comjnl/bxy079 DA - 2019-02-01 UR - https://www.deepdyve.com/lp/oxford-university-press/optimizing-erasure-coded-data-archival-for-replica-based-storage-4bdVXaLYDd SP - 247 VL - 62 IS - 2 DP - DeepDyve ER -