Fast Packet Classification using Recursive Endpoint-Cutting and Bucket Compression on FPGA

Fast Packet Classification using Recursive Endpoint-Cutting and Bucket Compression on FPGA Abstract Packet classification is one of the important functions in today’s high-speed Internet routers. Many existing FPGA-based approaches can achieve a high throughput but cannot accommodate the memory required for large rule tables because on-chip memory in FPGA devices is limited. In this paper, we propose a high-throughput and low-cost pipelined architecture using a new recursive endpoint-cutting (REC) decision tree. In the software environment, REC needs only 5–66% of the memory needed in Efficuts for various rule tables. Since the rule buckets associated with leaf nodes in decision trees consume a large portion of total memory, a bucket compression scheme is also proposed to reduce rule duplication. Based on experimental results on Xilinx Virtex-5/6 FPGA, the block RAM required by REC is much less than the existing FPGA-based approaches. The proposed parallel and pipelined architecture can accommodate various tables of 20 K or more rules, in the FPGA devices containing 1.6 Mb block RAM. By using dual-ported memory, throughput of beyond 100 Gbps for 40-byte packets can be achieved. The proposed architecture outperforms most FPGA-based search engines for large and complex rule tables. 1. INTRODUCTION Nowadays, routers play a major role for communication on the Internet. When a packet arrives at a router, its destination address extracted from the packet header will be used to determine where to send. Packet pass through different Internet routers until they reach their destination. Routers not only forward packets through the Internet, but also provide various Internet services such as firewall, Quality of Service (QoS), traffic control and Virtual Private Networks (VPNs). To support these services, packet classification is required. Due to rapid growth of Internet traffics, routers need to adapt to the high-speed links such as OC-768 (40 Gbps) connections. In other words, routers need to finish processing an incoming packet every 8 ns if the packet is of 40 bytes, the minimal packet size. In general, the function of packet classification is often considered as a bottleneck of the routers. However, not only the Internet traffics but also the size of rule tables grows rapidly. It becomes even a bigger challenge to design a high-performance and low hardware cost router. There are two aspects of Internet router design: software and hardware. Software-based router implements the services on software platforms by using the general purpose CPU’s or network processors. Software-based routers have the advantages of easy implementation and modification. Furthermore, their cost is much lower than hardware-based routers. But the performance of software-based routers is generally inferior due to the overheads for performing necessary operations and the bottleneck for accessing memory. On the contrary, hardware-based routers are costly but their higher throughput could afford the rapid growth of Internet traffics. In general, the hardware-based routers could be implemented in Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC) by using the on-chip Static Random Access Memory (SRAM) to shorten the memory access delays. To differentiate whether a packet classification algorithm is good stands on the following factors: Memory: Routers support effective hardware architecture and high-speed SRAM memory to sustain the high performance required by packet classification. To implement the proposed architecture on FPGA, the memory is a key factor because the block RAM (BRAM) on FPGA is limited. Due to the rapid growth of rule table size, how to reduce the memory usage becomes an important issue. Throughput: Packet classification is the major bottleneck that influences the router’s throughput. The routers must achieve the throughput of 40 Gbps (OC-768) based on a parallel and pipelined architecture. Scalability: A well-defined packet classification scheme should support large rule tables. Packet classification has been studied widely in the past. There are many packet classification algorithms in the literatures. The detailed survey and taxonomy can be found in [1–3]. Many of the existing hardware-based packet classification designs that can achieve the throughput of up to 40 Gbps. However, most of these designs are only applicable to smaller or simple rule tables such as access control list (ACL) tables of 10 K rules. The reason is that if the rule tables become more complicated, the memory storage of their data structures grows sharply due to rule duplication. Hence, these schemes cannot fit their entire data structures into the limited on-chip memory provided by the hardware-based routers. In this paper, we solve this drawback by proposing a pipeline and parallel hardware architecture to be implemented on FPGA. The proposed scheme effectively decreases the hardware cost and also achieves a very high throughput better than the most of the existing hardware designs. The rest of the paper is organized as follows. In Section 2, we introduce the background and related work. Section 3 illustrates the proposed data structure and algorithm. Section 4 describes the hardware implementation of our proposed scheme. In Section 5, we present our experimental results. The last section is the conclusion. 2. BACKGROUND AND RELATED WORK 2.1. Packet classification problem statement The packet classification process (packet classifier) classifies incoming packets into different flows based on a table pre-defined rule. In general, for a d-field (or called d-dimensional) packet classification, each rule is a d-tuple data structure denoted by R = [F1,…, Fd], where Fi = [li, ri] is a range of values from li to ri for i = 1 to d. The search for an incoming packet p is done by presenting the header fields [f1,…, fd] of p as the keys to match the rules in the classifier, where each fi is a singleton value. The rule R is said to match packet p, if for all dimensions i, the field value fi of packet p lies in the range Fi. The packet classifier determines the least cost rule that matches the packet’s headers. The layer-four switching of the Internet protocol studied in this paper consists of five dimensions (also called fields) that include 32-bit source/destination IP addresses (denoted by SA/DA), 16-bit source/destination port numbers (denoted by SP/DP) and 8-bit network layer protocol. SA/DA fields are prefixes usually represented by an IP address and a mask or a prefix length. SP/DP fields are ranges represented by a pair of numbers called the endpoints. For the source and destination ports, the two endpoints could be arbitrary numbers. The protocol field is either a singleton number or a don't-care value (denoted by*). Each rule is also associated with a priority. The classifier matches the pre-defined rules against header values of the incoming packet to find the best match with the highest priority (single match) or find all the matched rules (multiple matches). Table 1 shows an example 5D real-life classifier in which by convention, the first rule R1 has the highest priority and the last rule R4 has the lowest priority. Table 1 also illustrates the classification results for three incoming packets. Table 1. A real-life 5D classifier. Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Table 1. A real-life 5D classifier. Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit 2.2. Related work To match the header values of the packets against the rules, the simplest algorithm is the linear search. For a large number of rules, this approach implies a long query time, but it is very efficient in terms of memory and rule updates. To improve the search performance, it is also straightforward to build a hierarchical trie from the multiple header fields of the rules. A hierarchical trie is a simple extension of binary trie except that these tries can accommodate more fields. For the set of k-field rules, the k-field hierarchical trie Tk is built recursively as follows. A binary trie called F1 trie is first constructed based on distinct F1 field values of all rules. Let NF1 be the node corresponding to a prefix PF1 in the F1 trie. Each node NF1 in the F1 trie is associated with a subset of rules (denoted by SET(NF1)) and all rules in SET(NF1) have the same F1 field value PF1. Since all the rules in SET(NF1) have the same F1 field value, they are stored as a (k−1)-field classifier. We then construct a (k−1)-field hierarchical trie Tk−1(NF1) for SET(NF1) recursively and set a pointer at node NF1 pointing to the hierarchical trie Tk−1(NF1). When k = 1, T1 is actually a binary trie. As a result, Tk is a k-field hierarchical binary trie, one binary trie per field. Since the 5-field rules we consider in this paper contain only two prefix fields, a 2-field hierarchical trie is constructed. Any rules that have the same first two field values are stored in the corresponding node of the second field binary trie. Each rule is only stored in exactly one node of a F2 trie. In other words, no rule is duplicated. When the search traverses a node in the F1 trie, the F2 trie pointed to by the node must also be traversed. The rules that are associated with the prefix node in the traversed F2 trie have to be checked one by one to find the match against the header values of the packets. Since many F2 tries must be traversed, the search delay is very long. In order to reduce the search delay of the hierarchical trie, set pruning trie [4] was proposed by pushing all the rules associated with the internal nodes to the leaf nodes of the F1 trie. As a result, in the set pruning trie, only the F2 trie associated with a leaf node in the F1 trie needs to be traversed. However, the rules in the set pruning trie may be duplicated so many times due to the leaf pushing operations that result in a serious memory explosion problem. To eliminate the memory explosion problem of the set pruning trie [4], Grid of Trie that uses switch pointers to avoid backtracks and the rule duplications is proposed. Since GoT cannot be easily extended to more than two fields, they also proposed a better generalized scheme called Cross-Producting [4]. Unfortunately, the size of the table in Cross-Producting scheme grows astronomically with the number of rules. Baboescu et al. [5] proposed an extended version of GoT called extended grid of trie (EGT). They also proposed an improved version of EGT called EGT-PC (Path Compression) which is a standard compression scheme for tries that removes single branching paths. Many hierarchical decision trees [6–11] have also been developed using various divide-and-conquer techniques. There are two goals to build the decision tree. The first goal is to, starting from the root node, partition the box covered a node called parent into many sub-boxes (equal-sized or not) such that some memory usage threshold is satisfied. These sub-boxes are the children of the parent node in the decision tree. The address space covered by the parent node is the sum of the subspaces covered by its child nodes. The rules that are completely contained in the sub-box of a child node are stored in the bucket of that child node. There are two ways to store the rules that are partially overlapped with the sub-box of any child node. If the rule replication is not permitted, the partially overlapped rules are stored in the parent node. Otherwise, they will be duplicated in the buckets of the child nodes that are partially overlapped. The decision tree is built by partitioning the nodes recursively until the bucket size associated with the node is not greater than a pre-defined threshold. The second goal of building a decision tree is that the height of decision tree must be minimal. However, it is a challenge to find out which dimensions at each node are selected first for partitioning and how many sub-spaces are to be obtained on the selected dimensions in order to fulfill these two goals. Usually, the larger the bucket size the shorter the decision tree and the longer it takes to search the rules in the bucket sequentially. The decision tree is traversed according the packet’s header values until a leaf is reached. Then, all the rules of the leaf’s bucket are matched against the packet’s header values sequentially to yield the desired matching result. Hierarchical Intelligent Cuttings (HiCuts) [7] is one of such decision trees. Assume a node v covers a k-dimensional box ([l1:r1],…, [lk:rk]) and there are NumRules(v) rules in its bucket. HiCuts selects only one dimension say i ∈ {1…k} and decides how many sub-boxes (denoted by M) are needed in the space decomposition process, where M is a power of two. When performing M cuts along dimension i, HiCuts evenly partitions the interval [li:ri] and generate M equal-sized subboxes ([l1:r1],…, [li:li + t × w − 1],…, [lk:rk]) for t = 1 to M, where w = (ri − li + 1)/M and M = 2m. Since HiCuts uses the equal-sized partition, the interval [li:ri] can be represented as a prefix. When [li:ri] is partitioned into 2m equal-sized subintervals, we actually select the m most significant don’t-care bits to perform the partition. Thus, we say that HiCuts uses a bit cutting scheme (i.e. selects an appropriate number of cut bits) to perform the space decomposition. These 2m sub-boxes are connected to node v as its 2m child nodes in the decision tree. In HiCuts, rules in a node may be duplicated in its child nodes. Taking more cuts may decrease the height of the decision tree at the expense of increasing the memory usage. If the cut dimension is i, HiCuts tries to balance this tradeoff by choosing the largest number of cuts denoted by mi so that the following constraint is met based on a pre-defined memory usage threshold called space factor (sf). sf×NumRules(v)≥∑j=1miNumRules(childj)+mi How to select a dimension to cut has a major impact on the height of the decision tree and its memory usage. Four heuristics were proposed to select a dimension to cut a node. Which heuristic performs better depends on the characteristics of the rule table. Figure 1(a) illustrates the decision tree built for HiCuts from a small sample rule table. Since the rule table is small, only one cut bit is used in each stage of the space decomposition. This decision tree consists of 12 internal nodes and 13 leaf nodes and the average tree height is 3.77. Figure 1. View largeDownload slide (a) HiCuts with 12 internal nodes and 13 leaves. (b) HyperCuts with 10 internal nodes and 13 leaves. (c) HyperSplit with 10 internal nodes and 11 leaves. (d) The proposed decision tree with 8 internal nodes and 11 leaves. Figure 1. View largeDownload slide (a) HiCuts with 12 internal nodes and 13 leaves. (b) HyperCuts with 10 internal nodes and 13 leaves. (c) HyperSplit with 10 internal nodes and 11 leaves. (d) The proposed decision tree with 8 internal nodes and 11 leaves. HiCuts cutting nodes only along one dimension results in a higher decision tree, i.e. longer search time. HyperCuts [9] solves this problem by selecting multiple dimensions to perform the space decomposition. HyperCuts picks up the set of dimensions with a larger number of distinct field values than the mean number of distinct field values of all dimensions. HyperCuts also modifies HiCuts’s heuristics to compute the number of cuts needed for each of the selected cut dimensions. However, it is infeasible to compute all possible ways to compute the number of cuts needed for each cut dimension because the preprocessing time for doing this will be unimaginably long. Hence, HyperCuts uses a greedy approach to compute the local optimal number of cuts for each selected dimension. Figure 1(b) illustrates the decision tree built for HyperCuts from the same rule table. This decision tree consists of 10 internal nodes and 13 leaf nodes and the average tree height is 2.77. Both HiCuts and HyperCuts partition the box of a node into a power of two equal-sized sub-boxes. Because the rules are distributed over the entire address space unevenly, HiCuts and HyperCuts may generate many light-weight nodes that hold only a small number of rules. To solve this problem, HyperSplit [8] uses the following heuristics. First, as in HiCuts, HyperSplit only selects one dimension to cut the address space covered by a node. Second, HyperSplit does not cut the box of a node into smaller equal-sized subboxes as in HiCuts and HyperCuts but uses a more flexible scheme called endpoint-cutting scheme. Let [li:ri] be the corresponding interval in dimension i covered by a node v and array Pti[j] for j = 0 to M − 1 be the M field-i endpoints between endpoints li and ri computed from the field-i values of the rules overlapped with node v. Notice that Pti[0] = li and Pti[M−1] = ri. HyperSplit does not use the middle point between li and ri but uses one of the endpoints in array Pti[0…M − 1] as the cut-point computed by the weighted segment-balanced strategy due to its superior performance. Since a binary decision tree is targeted, only one cut-point is selected to cut the box of a node into two sub-boxes, mostly unequal-sized. Recently, a deterministic cutting algorithm called boundary cutting (BC) has been proposed in [12]. Similar to HyperSplit, BC uses rule boundaries to perform the space decomposition. Additionally, a refined BC called selective BC is proposed to reduce the endpoints in the internal nodes. In the weighted segment-balanced strategy, the array Pti[0…M − 1] and the M − 1 elementary intervals can be constructed [13] for a node v with interval [li:ri] in dimension i. Then, HyperSplit calculates the number of rules that cover the the jth elementary interval and store it in Sri[j] for 1 ≤ j < M. HyperSplit chooses the smallest endpoint m such that ∑j=1mSri[j]>12∑j=1M−1Sri[j]. This strategy tries to equalize the accumulated number of covering rules of all the intervals at the left side and at the right side of the endpoint m. HyperSplit selects the dimension i to cut if the value of 1M−1∑j=1M−1Sri[j] is the minimum for all dimensions. Figure 1(c) illustrates the decision tree built for HyperSplit from the rule table. This decision tree consists of 10 internal nodes and 11 leaf nodes and the average tree height is 3.55. In addition, multiple decision tree algorithms were proposed to reduce the rule replication problem such as EffiCuts [10] and Decision Forest [14]. Decision Forest employs HyperCuts algorithm to construct the decision tree in space decomposition procedure. When constructing a decision tree for a group of rules, Decision Forest moves the rules with heavy replication potential to a new subgroup. This new subgroup is used to construct another decision tree where some rules that may incur heavy rule duplications are again moved to another new subgroup. This construction process along with new subgroup generation process is repeated until the predefined number of decision trees is reached. Efficuts uses a notion of largeness in a dimension to define a large rule that covers a large part (e.g. 95%) of address space in that dimension. Based on the largeness of each of the five dimensions in a set of 5D rules, EffiCuts divides rules into 32 subgroups. It uses selective tree merging to reduce the number of subgroups by merging one subgroup into another if the former contains fewer rules. EffiCuts uses the idea of equi-dense cuts to tackle the variation in the rule-space density to eliminate unnecessary pointers pointing to NULL or to the same child nodes. EffiCuts also co-locates parts of information in a node and its children to achieve fewer memory accesses per node than HiCuts and HyperCuts. The hardware approaches for packet classification could be implemented in many ways. Ternary Content Addressable Memory (TCAM) is a very simple device that can complete a search operation in one cycle. Haoyu Song et al. [15] proposed the BV-TCAM architecture that combines TCAM and Bit Vector algorithm to effectively compress the data representations and boost throughput. However, the major drawbacks of TCAM are high power consumption and high cost-to-density ratio. On the other hand, the prefix-to-range expansion exacerbates the problem of TCAMs by significantly decreasing the already limited capacity of TCAMs as each rule typically has to be converted to multiple rules. Some results proposed to solve the range problems can be found in [16–18]. NLTMC [19] modified the Cross-Producting scheme and divided the rules into multiple subsets to avoid memory overhead. B2PC [20] and 2sBFCE [21] are implemented in ASIC and FPGA, respectively. The number of clock cycles needed for a search varies in a wide range and thus resulting performance is inferior. As a result, the Bloom-filter based architectures cannot achieve the throughput of 40 Gbps required by the high-speed link of OC-768. Many hardware search engines [22] for packet classification use pipeline or parallel architectures to improve the throughput. A set-pruning multi-bit trie data structure was proposed in [23]. To reduce the rule duplication, the rules are partitioned into many groups by its lengths or the wildcard field values. A search engine is constructed for each group based on the set-pruning multi-bit trie. By the parallel and pipelined architecture, the throughput of 100 Gbps can be achieved with dual-ported memory. Jiang et al. [24] used HyperCuts [9] as the data structure of the decision tree. By using the 2D linear dual-pipeline architecture, its FPGA implementation can achieve a throughput of 80 Gbps also with dual-ported memory. Wagner et al. [25] proposed a scalable pipeline architecture, named BiConOLP. They also used the HyperCuts as the decision trees. By using the dual-ported memory, BiConOLP can achieve a throughput of closing 40 Gbps. Yaxuan Qi et al. also implemented HyperSplit [8] in FPGA [26] and proposed a node merging algorithm to reduce the height of the decision tree. Based on the dual-ported memory of Xilinx FPGA Virtex-6 XC6VSX475T, a throughput of 74 Gbps for minimum packet size 40 bytes can be achieved. Yang et al. [27] targeted on better search speed by proposing a decision-tree-based, 2D multi-pipeline architecture called Dynamic Discrete Bit Selection (D2BS) that uses the same cut bit for all the nodes in the same tree level. The multi-pipeline architecture proposed in [28] is hybrid approach that combines the schemes of field decomposition, hierarchical trie and decision trees. Multiple pipelines correspond to the 5 rule subsets using a predefined rule partition scheme based on wildcard field values. The implementation results on Virtex-6 FPGA show that a throughput of 340 MPP can be achieved. Also, in [29], an FPGA hardware accelerator that uses a modified version of Hypercuts algorithm is proposed. The max throughput that can be achieved this hardware accelerator is 433 MPP. However, for some rulesets with more wildcard field values, the achieved throughput may be degraded to a half of the maximum throughput. As described above, both EffiCuts and Decision Forest use multiple decision trees. For software approaches, we need longer search time to get the best matched rule since all decision trees must be searched for the highest priority matched rule. However, in hardware environment, it is no longer a problem because it is easy to design a parallel search architecture. Therefore, the proposed architecture in this paper also employs multiple decision trees along with the parallel architecture designed on FPGA devices. 3. PROPOSED SCHEME In order to design a high performance and memory-efficient FPGA-based search engine, some design issues for selecting appropriate data structures are considered as follows: Parallel and Pipelined architecture: To increase the packet classification throughput, parallel and pipeline architectures are considered to be a better choice. Pipelined architectures allow many incoming packets to be processed in the pipeline stages concurrently and a classification result can be output in every clock cycle. In this paper, we will propose a recursive scheme to divide the rule table into many sub-tables for reducing the degree of rule duplications (i.e. memory consumption). Then we use a parallel architecture that consists of multiple pipelines each of which processes a sub-table for obtaining a final search result. Memory consumption: Our goal is to put the entire data structure of the rule table into the on-chip BRAM of FPGA so that the search speed will not be dragged down by the off-chip memory. For example, there are 456 BRAM blocks of size 18 Kb in Xilinx Virtex-5/6 device which amounts to 16/26 Mb approximately. Most existing algorithms need more than 16 Mb when dealing with the large and complex tables of 10 000 or more rules. Hence we need a memory-efficient data structure for the large rule tables to be fit in the BRAM of FPGA devices. In this paper, the endpoint cutting scheme and a rule bucket compression scheme are proposed to achieve this goal. In addition, the proposed recursive table cutting scheme gives us a flexibility to decide the degree of rule duplication to be allowed in the construction of the decision trees. A trie or decision tree based data structure can be easily implemented by using a pipelined architecture. Binary tries have the advantage of simple operations and short processing time in each stage over the decision trees used in HiCuts and HyperCuts. However, binary tries are only suitable for the data in prefix format and usually consume more memory than the decision trees. Packet classification algorithms based on decision trees focus on two aspects to determine how to perform the search space decomposition process. The first one is how to select the cut dimensions and the second is how to decide the cut-points for dividing the address space covered by a node in the decision tree into many subspaces. As stated above, we can select a single dimension or multiple dimensions to perform the space decomposition at a node. When choosing a single dimension, the height of decision tree is usually higher than that by choosing multiple dimensions at a time. But the node size of the former is smaller than the latter. Also, there are two methods to decompose the address space of a node after one or more dimensions are selected: one uses a bit cutting scheme by treating the field values as prefixes and the other uses an endpoint cutting scheme by treating the field values as ranges. Since prefix values can be represented as ranges, it is more flexible to treat all the field values as ranges. In general, we can select many cut-points for the chosen dimensions. But in the hardware architecture, the node size varies when different numbers of cut-points are used and as a result, the memory design becomes more complicated. Therefore, the proposed endpoint cutting scheme selects one or more cut dimensions and only one cut-point for each selected cut dimension is used for space decomposition. The work presented in this paper is an extension of [30]. 3.1. Build the basic decision tree We first describe the data structure of the basic decision tree based on the proposed endpoint cutting scheme called the NewHypersplits. Then, we will describe the recursive endpoint cutting (REC) scheme that recursively uses NewHypersplits to remove the duplicated rules from the decision tree currently being built. The removed duplicated rules are then collected as the second rule table called recursive table to build the second decision tree. It is possible that there still exist duplicated rules in the second decision tree and some of them are also removed and used to build the third decision tree. This decision tree building process is performed recursively until no duplicated rule exists in the last decision tree. This REC scheme generates no duplicated rule, which is called basic REC scheme. We have tried many heuristics to select cut dimensions and cut-points designed by the existing decision trees such as HiCuts, HyperCuts and HyperSplit in order to obtain a decision tree that requires a minimal amount of memory. The heuristics that are best suitable for our REC scheme are described as follows. The proposed NewHypersplits scheme selects the cut dimensions based on the heuristic similar to the one proposed in HyperCuts [9]. We select the ‘larger’ dimensions as the cut dimensions. By ‘larger’, we mean the number of distinct field values in the selected dimension is greater than or equal to the mean number of distinct field values for all dimensions under consideration. For example, if the numbers of distinct field values for all dimensions in a 5-field rule table are 40, 22, 33, 18, 12 with a mean 25, then the first and third dimensions are selected as the cut dimensions. Notice that selecting larger dimensions is simpler in terms of time complexity than selecting the dimension with the minimum value of 1M−1∑j=1M−1Sri[j] in HyperSplit described in Section 2. After the cut dimensions are decided, we use the weighted segment-balanced strategy proposed by HyperSplit to perform the space decomposition. If the number of cut dimensions is d for a node in the decision tree, the address space associated with the node is decomposed into 2d subspaces because only one cut-point is used for each selected cut dimension. Figure 1(d) shows the decision tree with buckets of size 1 built by the heuristics described above. There are six distinct field values in field X and four distinct field values in field Y and thus the average number of distinct field values for all fields is 5. Therefore, field X is selected as the cut dimension at the root node of the decision tree. Then, the cut-points for field X are calculated as follows. First, the endpoint array is created for field X. The endpoints are calculated by the minus-1 endpoint scheme proposed in [13]. As a result, the sorted endpoints in the increasing order are {0, 1, 2} in field X in the address space 0 to 3 and the elementary intervals are {[0,0], [1,1], [2,2], [3,3]}. The rule sets that cover elementary intervals [0,0], [1,1], [2,2] and [3,3] are {R1, R2, R3}, {R1, R3, R4}, {R1, R5} and {R1, R6}, respectively. Therefore, the cut-point at the root can be set between [1] and [2]. The final decision tree built consists of 8 internal nodes and 11 leaf nodes and the average tree height is 2.91. As the results shown in Fig. 1, the endpoint cutting scheme used by HyperSplit and our NewHypersplits scheme are more memory efficient than the bit cutting scheme used by HiCuts and HyperCuts. This conclusion will be further verified by the larger rule tables used in the section of performance evaluation later. Similar to the advantage of HyperCuts over HiCuts, NewHypersplits scheme performs better than HyperSplit because multiple cut dimensions can be used. Although our decision tree also generates many duplicated rules, the recursive endpoint cutting (REC) scheme proposed below can make the tree even shorter by removing the duplicated rules. As a result, the memory storage needed in NewHypersplits can be decreased dramatically. 3.2. Recursive endpoint cutting (REC) scheme Since rules are not uniformly distributed in the address space, many rules in the rule table overlap each other and the number of mutually overlapped rules varies vastly. The problem caused by rule overlapping is very serious for larger rule tables like IPC and FW tables. No matter what existing decision tree is used, some rules may be replicated many times. It is not easy to avoid the rule duplication completely. For example, consider three rules where rules A and B are disjoint and rule C completely covers both A and B. If we want to partition these three rules, no matter how the cutting operation is performed, the rule C always needs to be replicated. The proposed REC scheme can solve the rule duplication problem and thus the total required memory is reduced significantly. Now we shall use the same rule table in Fig. 1 to show how the rule duplication problem is solved in the proposed REC. In the first step, we apply the push-up technique to move up the duplicated rules to the node at which rules start to be duplicated. We consider two duplication cases for applying push-up operations. A rule in a node that is duplicated into all of its child nodes is called fully duplicated rule and, a rule in a node that is duplicated in only some of its child nodes is called partially duplicated rule. If the partially duplicated rules are pushed up, no duplicated rule will be generated in the decision tree. As shown in Fig. 2(a), R1 is the fully duplicated rule to be pushed up. At the node associated with address space {[0:1],[0:3]}, R3 is also a fully duplicated rule to be pushed up. By pushing up rules R1 and R3, no rule is duplicated. Normally, the pushed-up rules are stored in the internal nodes. However, searching the rules in the internal nodes along the path from the root to a leaf is a slow process. To speedup this process, all the rules in internal nodes are removed from the tree into a separate table called recursive-1 table. We then follow the same decision tree building heuristics to construct the decision tree from the recursive-1 table as shown in Fig. 2(b). Here, we assume only the highest priority rule is the final match and so the left child node of the root contains only rule R3 in Fig. 2(b). In a more complicated case, we may have to split recursive-1 table into two again to avoid the rule duplication. This recursive tree building process continues until no more recursive table is generated. Notice that allowing a minimum number of duplicated rules may be beneficial because of generating less number of decision trees and less number of memory accesses for completing a search operation. In this paper, two additional extensions are proposed. The first extension only allows the last decision tree to have duplications as long as the last rule table contains ≤ ts_threshold rules, called tree size threshold, the recursion in tree construction process is not applied. The rational is that when the rule table is very small, the rule duplication will be not a big concern. In the second extension, duplications are allowed in all the decision trees with the following restriction: the number of times each rule is duplicated in a decision tree cannot be larger than a rule duplication threshold (dupl_threshold). In other words, when a rule needs to be duplicated more than dupl_threshold times, it will be inserted into the next decision tree. Figure 2. View largeDownload slide The recursive endpoint based cutting scheme. (a) Push-up the duplicated rules. (b) The decision trees constructed by REC. Figure 2. View largeDownload slide The recursive endpoint based cutting scheme. (a) Push-up the duplicated rules. (b) The decision trees constructed by REC. Rule updates are supported for the proposed REC scheme as follows. Figure 3 shows the rule insertion algorithm. Since REC scheme is implemented in a pipeline architecture, the heights of design trees are fixed at that of the highest decision tree (decision tree 0). Assume rule x is to be inserted in tree i. If tree i is the last tree, the threshold-based rule removal policy will not be employed as in line 1. We temporarily insert rule x into the tree based on the cut dimensions and endpoints already computed in the process of tree construction. If the number of duplicated copies of rule x is greater than dupl_threshold, the process of inserting rule x is aborted and tree i is unchanged. Instead, we will try to insert rule x in tree i + 1. All other details of the insertion operations are self-descriptive in Fig. 3. The time complexity of the insertion algorithm depends on NumofTrees and H that is the height of the highest tree (i.e. tree 0). Based on the rule sets generated by Classbench, the number of rules that are overlapped with each other is a constant. Therefore, NumofTrees is a constant that will be verified in the performance evaluation section. Also, since each internal node of the NewHypersplits decision tree has 2–8 children, the tree height H will be O(log2N), where N is the number of rules in the rule set. As a result, the time complexity of inserting a rule is NumofTrees * O(log2N) = O(log2N) because NumofTrees is a constant. Figure 3. View largeDownload slide Insertion algorithm for the proposed REC scheme. Figure 3. View largeDownload slide Insertion algorithm for the proposed REC scheme. In order to delete rule y, we employ a simple approach as follows. We wipe out the records of all the duplications of rule y in the associated buckets and leave those bucket slots unused for holding the newly inserted rules afterwards. Our update design depends on height of the largest tree because the number of pipeline stages for the decision tree must be fixed at a predefined number H. Therefore, many rules may be accumulated in the last tree whose height becomes larger than H. If this happens, the hardware pipeline structure for the proposed REC scheme must be rebuilt. 3.3. Compress memory buckets In addition to the memory reduction from the decision trees, the memory usage for rule buckets that are usually ignored in many existing software approaches must be also reduced. In software approaches, only one bucket will be searched at a time. So, not the contents of rules but the rule ID’s have to be stored in the buckets. Based on our experimental results, the memory needed for all rule buckets accounts for 55–82% of the total memory in the pipeline architecture. Since our goal is to put entire rule table in the on-chip memory of the current FPGA devices, it is very important to develop an efficient data structure for storing the contents of the rules in buckets. In the decision tree based packet classification, buckets are used to store the rules associated with the internal and leaf nodes of the decision tree. By traversing the decision tree from root to a leaf node based on the header values of the incoming packet, the rules contained in the internal and leaf node’s buckets are the possible candidates for a match. After matching every rule in the buckets, the matched rule with the highest priority is output from the memory bucket pipeline as the search result. If the sequential search in software platform is considered, we only need to store one copy of the rule table and use rule pointers (i.e. ID’s) to access the rules in the bucket. However, in a pipelined architecture, one copy of the rule table is insufficient to support concurrent accesses to many rules per cycle. Each pipeline stage needs one independent memory unit so that all the stages could access their data in parallel. Subsequently, we will first review some existing bucket rule mapping schemes. Most FPGA-based packet classification schemes use a direct mapping scheme to map a leaf node bucket onto one memory bucket. In other words, each rule of the bucket is mapped to a stage of the memory bucket pipeline. Figure 4 shows a rule mapping example in which there are 9 leaf buckets of size 4 that are mapped onto 9 memory buckets based on direct mapping scheme. By recording the index of memory buckets in the leaf buckets, the rules of the memory bucket at the recorded index can be accessed in each stage. Direct mapping is straightforward but has the following disadvantages: First, the memory requirement is unacceptable. The number of needed memory buckets is equal to the number of non-empty leaf nodes in the decision tree. In large complex rule sets such as IPC or FW, there are tens of thousands of non-empty buckets in the decision tree. Second, the number of rules in each bucket is not always equal to the bucket size. Thus, the direct mapping leaves too many non-used slots. Third, the problem of duplicated rules is serious. As shown in stage 1 of Fig. 4(a), there are only two distinct rules, R1 and R2, but 4 copies of R1 and 5 copies of R2 are stored. Figure 4. View largeDownload slide Mapping schemes for memory bucket pipeline. (a) Direct mapping. (b) Variable-width mapping. (c) The proposed bucket merge mapping. Figure 4. View largeDownload slide Mapping schemes for memory bucket pipeline. (a) Direct mapping. (b) Variable-width mapping. (c) The proposed bucket merge mapping. Wagner et al. [25] proposed a variable-width mapping to reduce the unused memory slots for memory bucket pipeline. Figure 4(b) shows the variable-width mapping. The leaf buckets are sorted first in the increasing order of their sizes. Instead of always allocating a new memory bucket for a leaf bucket, we select an existing memory bucket whose unused slots can accommodate the rules in the newly coming leaf bucket. This mapping approach decreases the unused spaces, but the problem of rule duplication remains the same. As we observe, there are many similar leaf node buckets that share common rules between them. There is a high probability that these similar leaf node buckets come from the same ancestors in the decision tree. If we can merge these similar buckets, the problem of rule duplication would be mitigated significantly. For this reason, we propose the following greedy bucket compression scheme to reduce the duplicated rules in the memory bucket pipeline. We try to reuse the rules already assigned into the existing memory bucket pipeline. Our heuristic is very simple and efficient. The number of stages in the memory bucket pipeline may be varied. Adding more stages usually decreases the memory requirement and the overall throughput remains the same, but the response time becomes longer and hardware cost is more. Figure 5 is the pseudo code for the proposed bucket compression algorithm, BucketCompression(). First, all the buckets are sorted in the decreasing order of their sizes. When adding a new bucket (say A), we check the memory bucket (say B) in which some rules are already assigned. If the number of the rules that are in bucket A or in memory bucket B is not larger than the number of pipeline stages (numofstages), the rules of bucket A could be inserted into the memory bucket B. We say memory bucket B is the candidate memory bucket into which the rules can be inserted. In this paper, we select the first candidate memory bucket to perform the compression operations. If no memory bucket can accommodate the rules of the new leaf bucket, we create a new memory bucket to hold them. Figure 4(c) shows the bucket compression result for the same leaf bucket example where only four memory buckets of size four are needed.There may be more than one bucket that are merged into one memory bucket. The proposed bucket compression scheme can resolve the serious rule duplication problem and the total memory usage is much smaller than direct mapping and variable-width mapping schemes. Figure 5. View largeDownload slide The proposed bucket compression algorithm. Figure 5. View largeDownload slide The proposed bucket compression algorithm. 4. PARALLEL AND PIPELINED SEARCH ENGINE 4.1. Architecture overview Many hardware-based packet classification solutions were proposed in recent years [23–27]. These solutions can only implement the ACL rule table in their hardware architectures because the memory usages of their data structures are larger than the on-chip BRAM that FPGA devices can provide. We can reduce the memory requirement by using a parallel and pipelined search engine to implement the multiple decision trees generated by the proposed REC scheme. In order to achieve the high link rate as OC-768 (40 Gbps), we implement our search engine in the modern FPGA devices that can achieve the throughput of more than 100 Gbps. In the proposed parallel and pipelined search engine, each decision tree corresponds to a pipeline and all the pipelines are executed in parallel to improve the throughput. Figure 6 shows the block diagram of the proposed search engine. Each pipeline outputs the ID of the matching rule. In this paper, the rule ID is assumed to be the rule priority. All pipelines have the same number of stages so that the outputs of all the pipelines can reach the priority encoder at the same time to compute the final match with the highest priority. Figure 6. View largeDownload slide The hardware architecture. Figure 6. View largeDownload slide The hardware architecture. 4.2. Decision tree pipeline In the decision tree pipeline, we have to map the nodes in decision tree onto pipeline stages and design the circuits to compare the selected dimensions and cut-points against the header values of incoming packets to perform the matching process. We map the tree nodes to pipeline stages based on the tree level and thus the nodes at the same level will be mapped to the same pipeline stage. The number of decision tree pipeline stages is equal to the height of the decision tree. Because our decision tree is not a complete tree, the leaf nodes are not necessarily at the bottom level. We push all the leaf nodes to the bottom tree level which is then mapped onto the last stage. Each internal node has a flag (called Nop_out) to indicate whether it is the lowest internal node connecting to a leaf node in the decision tree. In other words, the lowest internal node will forward an asserted Nop_out flag to the next stage. When the stage corresponding to an internal node receives an asserted Nop_out flag, this stage does nothing but passing the information from previous sage to the next stage. When the leaf node stage is reached, the index of the corresponding memory bucket will be output. We push all leaf nodes to the last stage for two reasons: The data structures of the internal and leaf nodes have different sizes. Storing them in the same memory unit will be complicated and waste memory space. The operations in internal nodes and leaf nodes are also different, i.e. internal nodes need to perform the comparison operations for the input packet header values and the leaf nodes only read the index of memory bucket. If we place the internal and leaf nodes in the same stage, each stage needs two memory units (one for internal nodes and one for leaf nodes) or a complex data structure to merge these two along with two set of logics. The data structure of internal nodes is shown in Fig. 7(a). No more than three dimensions are selected as the cut dimensions and at most three cut-points are needed. The CutDimBitmap field records which dimensions are selected as the cut dimensions. There are at most eight child nodes whose type could be leaf or internal node. Therefore, we use an 8-bit bitmap called ChildTypeBitmap to record the child types. When the ith bit of ChildTypeBitmap is 0, the ith child is an internal node; otherwise, it is a leaf node. Since the sizes of leaf and internal nodes are different and they are stored in two different arrays, we need two individual base addresses, LeafBase and InternalNodeBase, respectively. In order to speed up the computation of the child node address, we use a precomputed array of size 8, PrecomputedOffset[], whose usage can be understood easily by the following example. Assume that we have to go to the ith child node after comparing the selected header values with the three cut-points. The type of this child node can be determined by checking ChildTypeBitmap[i]. If ChildTypeBitmap[i] indicates that the node is a leaf node, we use the precomputed offset, PrecomputedOffset[i] added to LeafBase to obtain the address of the ith child node. Otherwise, if ChildTypeBitmap[i] is of type internal node, the address of the ith child node will be InternalNodeBase + PrecomputedOffset[i] * LeafSize. The leaf nodes store the index of the memory bucket to be used to search the specified memory bucket. In order to utilize the BRAMs efficiently, we select an appropriate memory bucket size such that the number of memory buckets is not larger than 1024. Hence, only 10 bits are needed for the index of the memory buckets. Figure 7. View largeDownload slide Structures of internal node and bucket rule. (a) 147-bit structure. (b) 162-bit structure. Figure 7. View largeDownload slide Structures of internal node and bucket rule. (a) 147-bit structure. (b) 162-bit structure. Figure 8 shows the block diagram of the internal node stage. The internal node memory stores the data contents of all the internal nodes. The matching unit computes which branch to be taken based on CutDimBitmap and three cut-points. The address detection unit calculates the address of the next level node by using ChildTypeBitmap, array PrecomputedOffset[] and the two base addresses InternalNodeBase and LeafBase. Each internal node stage is divided into two sub-stages in order to further reduce the stage delay. Because the proposed decision tree is not too high, adding more stages is acceptable. Figure 9 shows the diagram of matching unit consisting of three 5 × 1 multiplexers for extracting at most three header values of the incoming packets selected by Unit_of_Cut_Dimensions based on CutDimBitmap. The 1-bit results from these three 5 × 1 multiplexers are combined to form a 3-bit index to array PrecomputedOffset[]. The Address Detection Unit is responsible for calculating the address of the child node based on InternalNodeBase, LeafBase, ChildTypeBitmap and PrecomputedOffset[], as described above. Figure 8. View largeDownload slide The block diagram for internal node stage. Figure 8. View largeDownload slide The block diagram for internal node stage. Figure 9. View largeDownload slide The matching unit block diagram. Figure 9. View largeDownload slide The matching unit block diagram. 4.3. Memory buckets pipeline The data structure of memory bucket is shown in Fig. 7(b). The RulePriority field is the priority of rule used to find the matched rule with the highest priority. As described earlier, we use the rule ID as the priority of the rule in this paper. The 13-bit RulePriority field indicates that there are 213 rules in the rule table under consideration. The source and destination IP fields are prefixes which are represented by the length format: a 32-bit IP address and a 6-bit length. Obviously, the main task in each memory bucket pipeline stage is to compare the 5-field rule stored in the stage against the packet’s header values to find a matched rule. The rules in each memory bucket are sorted in the increasing order of their priorities. Therefore, the last matched rule in memory bucket is the one with the highest priority. In addition, each stage also outputs its stage ID called the matched stage ID to the next stage if a match is found. The usage of the matched stage ID will be discussed later. It is known that in the FPGA devices, the BRAM block that we can allocate is restricted to the size of 1024n × 18 m or 512n × 36m bits, where n and m are positive integers. The BRAM block of size 1024n × 18m bits is also restricted to be used as 1024n entries of 18m bits wide. The smallest BRAM block is of size 18 Kbits (1 K × 18 bits). Therefore, the 162-bit bucket rule causes no BRAM waste when there are 1024 memory buckets which needs a BRAM block of size 1024 × 18 × 9 bits, i.e. 1024 162-bit entries. However, for the table of 10 K rules, the size of the memory rule needs 163 bits because the RulePriority field (i.e. the rule ID) needs 14 bits. In this case, we have to allocate a BRAM block containing 1024 entries of 180 bits wide, which wastes 17 bits per entry. To solve this problem, the following memory optimization is developed. The rule priority is not used in the memory bucket pipeline when the rule is matched against the header values of the packets. Therefore, we proposed a rule priority split scheme that splits the RulePriority field into two sub-fields: 13-bit and 1-bit. The 1-bit sub-fields of all the rules in the same bucket are then aggregated into an additional memory bucket stage called rule priority combiner stage in the memory bucket pipeline. Figure 10 shows the architecture for the memory bucket pipeline appended with the additional rule priority combiner stage. Each memory bucket stage performs the rule comparison against the header values of the incoming packets. If the packet’s header values match a rule, the 13-bit partial rule priority and the matched stage ID will be output to the next stage. Otherwise, we just forward the input rule priority and the input stage ID to the next stage. After the rule comparisons on all the memory bucket stages, the 1-bit sub-field corresponding to the last stage with the matched rule is reclaimed from the rule priority combiner stage to complete the computation of the rule priority of the matched rule. In other words, at the rule priority combiner stage, the input stage ID is used to extract the corresponding 1-bit sub-field which is then concatenated with the input rule priority to form the priority of the matched rule. Finally, all the rule priorities (i.e. matched rule ID’s) of all search engines converge on the priority encoder to obtain the final matched result. The memory saving of the proposed rule priority split scheme can be understood easily by the following example. If there are n original memory bucket stages, n − 1 BRAM blocks of size 1024 × 18 bits are saved compared to the case without applying the rule priority split scheme, where n is 17 or less for all the tables of 10 K rules we experimented. For the large tables of 32 K or 64 K rules, the same optimization technique can be applied accordingly. Figure 10. View largeDownload slide The memory bucket pipeline enhanced with one additional rule priority combiner stage. Figure 10. View largeDownload slide The memory bucket pipeline enhanced with one additional rule priority combiner stage. 5. PERFORMANCE EVALUATION In this section, we conduct two types of performance evaluations to show the superiority of the proposed schemes. First, we will conduct the performance evaluation on the simple software environment to compare the proposed bucket compression algorithm with variable-width scheme and compare the proposed REC against a state of art scheme, Efficuts [10]. Second, we will evaluate the performance of the proposed REC scheme in terms of memory consumption and speed by the implementation on FPGA devices and compare with other packet classification approaches. The rule tables used in the experiments are Access Control List (ACL), Firewall rules (FW) and IP Chains (IPC) tables of 10 K and 100 K rules, generated by the Classbench [31] with the seed ACLm, FWm and IPCn for m = 1–5 and n = 1–2. The bucket size of all decision trees built in the experiments is set to 8. In the simple software environment, a PC with Intel Core i5–4460 @ 3.2 GHz (4 core) CPU is used. The simulated algorithms are implemented in C language. In the FPGA environment, we use Xilinx ISE 13.1 development tools and the Xilinx Virtex-5 [32] XC5VFX200T containing 30 720 slices (each slice contains four LUTs and four flip-flops) and 912 BRAM blocks of size 18 Kb (16 416 Kb) and Virtex-6 [33] XC6VLX760 containing 118,560 slices and 1 440 BRAM blocks of size 18 Kb (25 920 Kb) with −2 speed grade as our target devices. Table 2 shows the number of buckets needed in New HyperSplits without compression and that after applying variable-width mapping compression and the proposed bucket merge mapping compression for tables of 100 K rules. The percentages shown in the parentheses of the Vwidth and BMerge columns are the bucket compression ratios of the evaluated scheme over the original direct mapping scheme. We can see that the compression ratio of the proposed mapping is in the range of 9–15% that is much better than 45–70% achieved by the variable-width mapping. Table 3 shows the numbers of trees, memory accesses per lookup, inodes and bucket rules, and the memory usage for tables of 100 K rules. The memory usage difference for the proposed REC with dupl_threshold set to 0 and 50 is insignificant. However, the REC with dupl_threshold = 50 needs 4–13 less memory accesses in average per lookup than the REC with dupl_threshold = 0. REC with dupl_threshold = 50 needs only 5–66% of the memory needed in Efficuts. The number of memory accesses per lookup for Efficuts is 9–10 less than the REC with dupl_threshold = 50. Nonetheless, this disadvantage can be remedied by the pipeline design as implemented in the FPGA devices below. Table 2. The number of buckets of size 8 for New_HyperSplits without compression (Direct), with variable-width mapping (VWidth), and the proposed bucket merge mapping (BMerge) for tables of various sizes (up to around 100 K rules). acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) Table 2. The number of buckets of size 8 for New_HyperSplits without compression (Direct), with variable-width mapping (VWidth), and the proposed bucket merge mapping (BMerge) for tables of various sizes (up to around 100 K rules). acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) Table 3. Numbers of trees and memory accesses, inodes and bucket rules, and memory usage for tables of 100 K rules. acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 Table 3. Numbers of trees and memory accesses, inodes and bucket rules, and memory usage for tables of 100 K rules. acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 Subsequently, we evaluate the memory storage and search throughputs of the proposed REC scheme on FPGA devices. Table 4(a) lists the detailed memory usage for 12 rule tables. The D-tree and Bucket rows show the overall memory usages of the decision tree and the memory bucket pipelines for each of these twelve tables, respectively. Numbers of internal nodes (inodes) and bucket rules are also shown. Table 4(b) shows the detailed memory usage of each search engine for tables alc1, fw1 and ipc1. We can see that the total memory required for these rule tables can be accommodated easily in block RAM (denoted by BRAM) of the FPGA devices. The number of decision trees is a trade-off between memory size and hardware cost. If we create more decision trees, then the total required memory would be smaller, but we need more logic hardware to implement all the search engines. The number of stages in memory bucket pipeline also needs to be tuned so that the most efficient memory usage is achieved. How many stages in the memory bucket pipeline is decided by BRAM restriction stated before. In other words, the number of entries of a BRAM block must be a multiple of 1024 and the width of BRAM entries must be a multiple of 18 bits. Table 4. Memory usage in KB and the number of stages for tables of 10 K rules. (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 Table 4. Memory usage in KB and the number of stages for tables of 10 K rules. (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 Therefore, we increase the number of stages in the memory bucket pipeline from the leaf node bucket size (i.e. 8) to an appropriate number denoted by numofstages so that the number of memory buckets is very close to 1024. The chosen values of numofstages are the numbers shown in the parentheses of the Bucket rows in Table 4. For example, in Table 4(b), tables acl1, fw1 and ipc1 need 15, 13 and 10 stages for memory bucket pipelines in the first search engine. As a result, we only need 10-bit index of memory buckets rather than eight 14-bit rule pointers for the sequential search in software environment. As stated earlier, the ability to choose a variable number of states for memory bucket pipeline attributes to the flexibility of the proposed bucket compression scheme. Table 5 shows the rule compression and duplication ratios of the proposed bucket compression algorithm for the tables of 10 K rules. The second and third rows show the numbers of rules in the original leaf node buckets and the memory buckets, respectively. The compression and duplication ratios shown in Rows 4 and 5 are computed as # of rules in the memory buckets divided by # of rules in the leaf buckets, and # of rules in the memory buckets divided by the number of rules in the rule table, respectively. We can see that the proposed bucket rule compression algorithm is very efficient and the compression ratio is as high as 8.48 for FW1_10K table. The rule duplication ratio is reduced to 1.6–2.66. Table 5. Rule compression and duplication ratios of 10 K rules. acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 Table 5. Rule compression and duplication ratios of 10 K rules. acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 Table 6 shows the FPGA resource utilization, clock frequencies, and throughputs of the proposed architecture based on Xilinx Virtex-5 XC5VFX200T FPGA. The achieved throughput is much more than 40 Gbps that OC-768 devices can provide. As stated before, the smallest BRAM block that can be allocated is of size 18 × 1 Kb = 18 Kbits used as 1 K 18-bit entries. The numbers of entries needed in the memory modules of some earlier stages of the pipelined architecture are much smaller than 1 K. In other words, the memory utilization for these early stages is very low. If we use distributed RAMs instead of BRAMs to implement these stages, we can further decrease the size of the BRAM needed. In our proposed architecture, by replacing the BRAM with distributed RAM needed for small memory modules, the required memory size will be less than 50% of total BRAM available. Table 6. FPGA results for tables of 10 K rules (single ported BRAM). Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Table 6. FPGA results for tables of 10 K rules (single ported BRAM). Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 We also evaluate the maximum number of rules that can be supported on PFGA devices used by the proposed pipelined architecture. By using Virtex-5 FPGA, 50 K rules for ACL, 20 K rules for IPC and 25 K rules for FW can be supported. Also, almost double number of rules supported on Virtex-5 FPGA can be supported on Virtex-6 FPGA. The speed and the throughput achieved by Virtex-6 are similar to that on Virtex-5. We compare our design with the existing state-of-the-art FPGA-based packet classification engines [23–25]. Table 7 shows the numbers of slices, BRAM usage, search engine clock rates, and throughputs achieved for ACL_10K. All the engines use the Xilinx Virtex-5 XC5VFX200T with −2 speed grade and dual-port memories. We can see that all slice utilizations are similar. SPMT [23] has the highest throughput but it consumes almost all the BRAM (94%) available on Virtex-5 XC5VFX200T FPGA. In other words, SPMT cannot support the tables of more than 10K rules on Virtex-5 because the set pruning multi-bit trie used is not a memory-efficient data structure for 5-field rules. In order to have a fair comparison, we introduce a new performance metric called performance efficiency defined to be the ratio of throughput and the number of 18 K-bit BRAM blocks used. Our design outperforms the others by a factor of 2.3 to 3 in terms of performance efficiency. Table 7. Performance comparison for ACL1_10 K on Virtex-5 XC5VFX200T FPGA with dual ported BRAM. Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Table 7. Performance comparison for ACL1_10 K on Virtex-5 XC5VFX200T FPGA with dual ported BRAM. Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Table 8 compares the performance of our architecture and other existing schemes based on various hardwares. All the results for throughputs, number of rules experimented, number of LUTs and number of bytes per rule are collected from original papers. We can see that the proposed scheme outperforms all the schemes except SPMT [23] and Modified Hypercuts [29]. As explained above, SPMT is not suitable for large rule tables due to its memory explosion problem. Modified Hypercuts [29] achieves the maximum throughput of 138 Gbps when using the proposed search structure that requires only two memory accesses at worst to classify a packet for some rule sets. However, if the rule tables contain more wildcard field values that incur more than two memory accesses per search, the throughput will be degraded dramatically to 69 Gbps or less. The multi-pipeline architecture in [28] needs much less memory than other schemes because it employs prefix encoding scheme to reduce the memory needed in the decision trees. However, the multi-pipeline architecture needs more LUTs than the proposed scheme, SPMT [23] and HyperSplit [26] Table 8. Performance comparisons of various FPGA implementations with dual-ported BRAM. Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 aThe memory and logic usage were not available for D2BS. bThe number of logic elements (LEs) in Stratix III is converted to the number of 6-input LUTs in Virtex 5/6 FPGAs by setting that one LE is equivalent to 0.8 LUT. Table 8. Performance comparisons of various FPGA implementations with dual-ported BRAM. Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 aThe memory and logic usage were not available for D2BS. bThe number of logic elements (LEs) in Stratix III is converted to the number of 6-input LUTs in Virtex 5/6 FPGAs by setting that one LE is equivalent to 0.8 LUT. 6. CONCLUSIONS In this paper, we proposed a high-throughput and low-cost parallel and pipelined architecture based on the recursive endpoint-cutting scheme. Bucket memory requirement becomes a serious problem when the pipeline architecture is considered. We proposed a bucket compression scheme to reduce rule duplication in memory bucket pipeline. Based on Xilinx Virtex-5/6 FPGA device, our experimental results showed that the proposed scheme needs much less BRAM than other FPGA-based approaches. The proposed scheme can accommodate more than 20 K rules with on-chip memory of 1.6 Mb. A high throughput of beyond 100 Gbps for the packets of minimum size (40 bytes) can be achieved if dual-ported BRAM is used. REFERENCES 1 Chao , H.J. ( 2002 ) Next generation routers . Proc. IEEE , 90 , 1518 – 1558 . Google Scholar CrossRef Search ADS 2 Gupta , P. and McKeown , N. ( 2001 ) Algorithms for packet classification . IEEE Netw. , 6 , 24 – 132 . Google Scholar CrossRef Search ADS 3 Taylor , D.E. ( 2005 ) Survey and taxonomy of packet classification techniques . ACM Comput. Surv. , 37 , 238 – 275 . Google Scholar CrossRef Search ADS 4 Srinivasan , V. , Varghese , G. , Suri , S. and Waldvagel , M. ( 1998 ) Fast and Scalable Layer Four Switching, Proc. ACM SIGCOMM, pp. 191–202. 5 Baboescu , F. , Singh , S. and Varghese , G. ( 2003 ) Packet classification for core routers: is there an alternative to CAMs?. Proc. IEEE INFOCOM, 2003. 6 Cohen , E. and Lund , C. ( 2005 ) Packet classification in large ISPs: design and evaluation of decision tree classifiers. Proc. ACM SIGMETRICS, pp. 73–84, 2005. 7 Gupta , P. and McKeown , N. ( 1999 ) Packet classification using hierarchical intelligent cuttings. Proc. Hot Interconnects VII, 1999. 8 Qi , Y. , Xu , L. , Yang , B. , Xue , Y. and Li , J. ( 2009 ) Packet classification algorithms: from theory to practice. Proc. INFOCOM, pp. 648–656. 9 Singh , S. , Baboescu , F. , Varghese , G. and Wang , J. ( 2003 ) Packet classification using multidimensional cutting. Proc. ACM SIGCOMM, pp. 213–224. 10 Vamanan , B. , Voskuilen , G. and Vijaykumar , T.N. ( 2010 ) EffiCuts: optimizing packet classification for memory and throughput. Proc. ACM SIGCOMM, pp. 207–218, 2010. 11 Lee , J. , Byun , H. , Mun , J.H. and Lim , H. ( 2017 ) Utilizing 2D leaf-pushing for packet classification . Comput. Commun. , 103 , 116 – 129 . Google Scholar CrossRef Search ADS 12 Lim , H. , Lee , N. , Jin , G. , Choi , Y. and Yim , C. ( 2014 ) Boundary cutting for packet classification . IEEE/ACM Trans. Netw. , 22 , 443 – 456 . Google Scholar CrossRef Search ADS 13 Chang , Y.-K. and Lin , Y.-C. ( 2007 ) Dynamic segment trees for ranges and prefixes . IEEE Trans. Comput. , 56 , 769 – 784 . Google Scholar CrossRef Search ADS 14 Jiang , W. , Prasanna , V.K. and Yamagaki , N. ( 2010 ) Decision forest: a scalable architecture for flexible flow matching on FPGA. Proc. FPL, pp. 394–399. 15 Song , H. and Lockwood , J.W. ( 2005 ) Efficient packet classification for network intrusion detection using FPGA. Proc. 13th ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2005. 16 Meiners , C.R. , Liu , A.X. and Torng , E. ( 2011 ) Topological transformation approaches to TCAM-based packet classification . IEEE/ACM Trans. Netw. , 19 , 237 – 250 . Google Scholar CrossRef Search ADS 17 Pao , D. , Li , Y.-K. and Zhou , P. ( 2006 ) Efficient packet classification using TCAMs . Comput. Netw. , 50 , 3523 – 3535 . Google Scholar CrossRef Search ADS 18 Chang , Y.-K. , Su , C.-C. , Lin , Y.-C. and Hsieh , S.-Y. ( 2013 ) Efficient gray code based range encoding schemes for packet classification in TCAM . IEEE/ACM Trans. Netw. , 21 , 1201 – 1214 . Google Scholar CrossRef Search ADS 19 Dharmapurikar , S. , Song , H. , Turner , J. and Lockwood , J. ( 2006 ) Fast packet classification using bloom filters. Proc. ACM/IEEE Symp. Architectures for Networking and Communications Systems, 2006. 20 Papaefstathiou , I. and Papaefstathiou , V. ( 2007 ) Memory-efficient 5D packet classification at 40 Gbps. Proc. IEEE INFOCOM, 2007. 21 Nikitakis , A. and Papaefstathiou , I. ( 2008 ) A memory-efficient FPGA-based classification engine. Proc. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 2008. 22 Kennedy , A. , Wang , X. , Liu , Z. and Liu , B. ( 2008 ) Low power architecture for high speed packet classification. Proc. ACM/IEEE Symp. Architectures for Networking and Communications Systems, 2008. 23 Chang , Y.-K. , Lin , Y.-S. and Su , C.-C. ( 2010 ) A high-speed and memory efficient pipeline architecture for packet classification. Proc. Int. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM) pp. 215–218, 2010. 24 Jiang , W. and Prasanna , V.K. ( 2012 ) Scalable packet classification on FPGA . IEEE Trans. Comput. Very Large Scale Integration (VLSI) Syst. , 20 , 1668 – 1680 . Google Scholar CrossRef Search ADS 25 Wagner , J.M. , Jiang , W. and Prasanna , V.K. ( 2009 ) A scalable pipeline architecture for line rate packet classification on FPGAs. Proc. 21st IASTED Int. Conf. Parallel and Distributed Computing and Systems (PDCS), 2009. 26 Qi , Y. , Fong , J. , Jiang , W. , Xu , B. , Li , J. and Prasanna , V.K. ( 2010 ) Multi-dimensional packet classification on FPGA: 100 Gbps and beyond. Proc. Intl. Conf. Field-Programmable Technology (FPT). 27 Yang , B. , Fong , J. , Jiang , W. , Xue , Y. and Li , J. ( 2012 ) Practical multi-tuple packet classification using dynamic discrete bit selection . IEEE Trans. Comput. , 63 , 424 – 434 . Google Scholar CrossRef Search ADS 28 Pao , D. and Lu , Z. ( 2014 ) A multi-pipeline architecture for high-speed packet classification . Comput. Commun. , 54 , 84 – 96 . Google Scholar CrossRef Search ADS 29 Kennedy , A. and Wang , X. ( 2014 ) Ultra-high throughput low-power packet classification on FPGA . IEEE Trans. Comput. Very Large Scale Integration (VLSI) Syst. , 22 , 286 – 299 . Google Scholar CrossRef Search ADS 30 Chang , Y.-K. and Chen , H.-C. ( 2011 ) Layered cutting scheme for packet classification. Proc. IEEE Int. Conf. Advanced Information Networking and Applications (AINA), pp. 675–681. 31 Taylor , D.E. and Turner , J.S. ( 2005 ) ClassBench: a packet classification benchmark, Proc. IEEE INFOCOM 2005, pp. 2068–2079, 2005. 32 Xilinx , ‘Virtex-5 Family Overview’, product specification, DS100 (v5.0). http://www.xilinx.com (accessed February 21, 2009). 33 Xilinx , ‘Virtex-6 Family Overview’, product specification, DS150 (v2.5). http://www.xilinx.com (accessed August 20, 2015). Author notes Handling editor: Gerard Parr © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

Fast Packet Classification using Recursive Endpoint-Cutting and Bucket Compression on FPGA

Loading next page...
 
/lp/ou_press/fast-packet-classification-using-recursive-endpoint-cutting-and-bucket-13xHY1hJy8
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy052
Publisher site
See Article on Publisher Site

Abstract

Abstract Packet classification is one of the important functions in today’s high-speed Internet routers. Many existing FPGA-based approaches can achieve a high throughput but cannot accommodate the memory required for large rule tables because on-chip memory in FPGA devices is limited. In this paper, we propose a high-throughput and low-cost pipelined architecture using a new recursive endpoint-cutting (REC) decision tree. In the software environment, REC needs only 5–66% of the memory needed in Efficuts for various rule tables. Since the rule buckets associated with leaf nodes in decision trees consume a large portion of total memory, a bucket compression scheme is also proposed to reduce rule duplication. Based on experimental results on Xilinx Virtex-5/6 FPGA, the block RAM required by REC is much less than the existing FPGA-based approaches. The proposed parallel and pipelined architecture can accommodate various tables of 20 K or more rules, in the FPGA devices containing 1.6 Mb block RAM. By using dual-ported memory, throughput of beyond 100 Gbps for 40-byte packets can be achieved. The proposed architecture outperforms most FPGA-based search engines for large and complex rule tables. 1. INTRODUCTION Nowadays, routers play a major role for communication on the Internet. When a packet arrives at a router, its destination address extracted from the packet header will be used to determine where to send. Packet pass through different Internet routers until they reach their destination. Routers not only forward packets through the Internet, but also provide various Internet services such as firewall, Quality of Service (QoS), traffic control and Virtual Private Networks (VPNs). To support these services, packet classification is required. Due to rapid growth of Internet traffics, routers need to adapt to the high-speed links such as OC-768 (40 Gbps) connections. In other words, routers need to finish processing an incoming packet every 8 ns if the packet is of 40 bytes, the minimal packet size. In general, the function of packet classification is often considered as a bottleneck of the routers. However, not only the Internet traffics but also the size of rule tables grows rapidly. It becomes even a bigger challenge to design a high-performance and low hardware cost router. There are two aspects of Internet router design: software and hardware. Software-based router implements the services on software platforms by using the general purpose CPU’s or network processors. Software-based routers have the advantages of easy implementation and modification. Furthermore, their cost is much lower than hardware-based routers. But the performance of software-based routers is generally inferior due to the overheads for performing necessary operations and the bottleneck for accessing memory. On the contrary, hardware-based routers are costly but their higher throughput could afford the rapid growth of Internet traffics. In general, the hardware-based routers could be implemented in Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC) by using the on-chip Static Random Access Memory (SRAM) to shorten the memory access delays. To differentiate whether a packet classification algorithm is good stands on the following factors: Memory: Routers support effective hardware architecture and high-speed SRAM memory to sustain the high performance required by packet classification. To implement the proposed architecture on FPGA, the memory is a key factor because the block RAM (BRAM) on FPGA is limited. Due to the rapid growth of rule table size, how to reduce the memory usage becomes an important issue. Throughput: Packet classification is the major bottleneck that influences the router’s throughput. The routers must achieve the throughput of 40 Gbps (OC-768) based on a parallel and pipelined architecture. Scalability: A well-defined packet classification scheme should support large rule tables. Packet classification has been studied widely in the past. There are many packet classification algorithms in the literatures. The detailed survey and taxonomy can be found in [1–3]. Many of the existing hardware-based packet classification designs that can achieve the throughput of up to 40 Gbps. However, most of these designs are only applicable to smaller or simple rule tables such as access control list (ACL) tables of 10 K rules. The reason is that if the rule tables become more complicated, the memory storage of their data structures grows sharply due to rule duplication. Hence, these schemes cannot fit their entire data structures into the limited on-chip memory provided by the hardware-based routers. In this paper, we solve this drawback by proposing a pipeline and parallel hardware architecture to be implemented on FPGA. The proposed scheme effectively decreases the hardware cost and also achieves a very high throughput better than the most of the existing hardware designs. The rest of the paper is organized as follows. In Section 2, we introduce the background and related work. Section 3 illustrates the proposed data structure and algorithm. Section 4 describes the hardware implementation of our proposed scheme. In Section 5, we present our experimental results. The last section is the conclusion. 2. BACKGROUND AND RELATED WORK 2.1. Packet classification problem statement The packet classification process (packet classifier) classifies incoming packets into different flows based on a table pre-defined rule. In general, for a d-field (or called d-dimensional) packet classification, each rule is a d-tuple data structure denoted by R = [F1,…, Fd], where Fi = [li, ri] is a range of values from li to ri for i = 1 to d. The search for an incoming packet p is done by presenting the header fields [f1,…, fd] of p as the keys to match the rules in the classifier, where each fi is a singleton value. The rule R is said to match packet p, if for all dimensions i, the field value fi of packet p lies in the range Fi. The packet classifier determines the least cost rule that matches the packet’s headers. The layer-four switching of the Internet protocol studied in this paper consists of five dimensions (also called fields) that include 32-bit source/destination IP addresses (denoted by SA/DA), 16-bit source/destination port numbers (denoted by SP/DP) and 8-bit network layer protocol. SA/DA fields are prefixes usually represented by an IP address and a mask or a prefix length. SP/DP fields are ranges represented by a pair of numbers called the endpoints. For the source and destination ports, the two endpoints could be arbitrary numbers. The protocol field is either a singleton number or a don't-care value (denoted by*). Each rule is also associated with a priority. The classifier matches the pre-defined rules against header values of the incoming packet to find the best match with the highest priority (single match) or find all the matched rules (multiple matches). Table 1 shows an example 5D real-life classifier in which by convention, the first rule R1 has the highest priority and the last rule R4 has the lowest priority. Table 1 also illustrates the classification results for three incoming packets. Table 1. A real-life 5D classifier. Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Table 1. A real-life 5D classifier. Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit Rule Network layer Trans. layer Action SA DA Protocol SP DP R1 140.116.82.25/32 140.116.236.10/32 * * * Deny R2 140.116.82.0/24 140.116.20.77/32 tcp * 80 Deny R3 140.116.82.0/32 140.116.20.55/32 udp * >1023 Permit R4 0.0.0.0/0 0.0.0.0/0 * * * Permit Classification examples Pkt Network layer Tran. layer Action SA DA Protocol SP DP P1 140.116. 82.25 140.116.236.10 tcp 1222 80 R1,Deny P2 140.116.82.1 140.116.20.77 tcp 1333 80 R2,Deny P3 140.116.82.0 140.116.20.55 udp 1024 1055 R3,Permit 2.2. Related work To match the header values of the packets against the rules, the simplest algorithm is the linear search. For a large number of rules, this approach implies a long query time, but it is very efficient in terms of memory and rule updates. To improve the search performance, it is also straightforward to build a hierarchical trie from the multiple header fields of the rules. A hierarchical trie is a simple extension of binary trie except that these tries can accommodate more fields. For the set of k-field rules, the k-field hierarchical trie Tk is built recursively as follows. A binary trie called F1 trie is first constructed based on distinct F1 field values of all rules. Let NF1 be the node corresponding to a prefix PF1 in the F1 trie. Each node NF1 in the F1 trie is associated with a subset of rules (denoted by SET(NF1)) and all rules in SET(NF1) have the same F1 field value PF1. Since all the rules in SET(NF1) have the same F1 field value, they are stored as a (k−1)-field classifier. We then construct a (k−1)-field hierarchical trie Tk−1(NF1) for SET(NF1) recursively and set a pointer at node NF1 pointing to the hierarchical trie Tk−1(NF1). When k = 1, T1 is actually a binary trie. As a result, Tk is a k-field hierarchical binary trie, one binary trie per field. Since the 5-field rules we consider in this paper contain only two prefix fields, a 2-field hierarchical trie is constructed. Any rules that have the same first two field values are stored in the corresponding node of the second field binary trie. Each rule is only stored in exactly one node of a F2 trie. In other words, no rule is duplicated. When the search traverses a node in the F1 trie, the F2 trie pointed to by the node must also be traversed. The rules that are associated with the prefix node in the traversed F2 trie have to be checked one by one to find the match against the header values of the packets. Since many F2 tries must be traversed, the search delay is very long. In order to reduce the search delay of the hierarchical trie, set pruning trie [4] was proposed by pushing all the rules associated with the internal nodes to the leaf nodes of the F1 trie. As a result, in the set pruning trie, only the F2 trie associated with a leaf node in the F1 trie needs to be traversed. However, the rules in the set pruning trie may be duplicated so many times due to the leaf pushing operations that result in a serious memory explosion problem. To eliminate the memory explosion problem of the set pruning trie [4], Grid of Trie that uses switch pointers to avoid backtracks and the rule duplications is proposed. Since GoT cannot be easily extended to more than two fields, they also proposed a better generalized scheme called Cross-Producting [4]. Unfortunately, the size of the table in Cross-Producting scheme grows astronomically with the number of rules. Baboescu et al. [5] proposed an extended version of GoT called extended grid of trie (EGT). They also proposed an improved version of EGT called EGT-PC (Path Compression) which is a standard compression scheme for tries that removes single branching paths. Many hierarchical decision trees [6–11] have also been developed using various divide-and-conquer techniques. There are two goals to build the decision tree. The first goal is to, starting from the root node, partition the box covered a node called parent into many sub-boxes (equal-sized or not) such that some memory usage threshold is satisfied. These sub-boxes are the children of the parent node in the decision tree. The address space covered by the parent node is the sum of the subspaces covered by its child nodes. The rules that are completely contained in the sub-box of a child node are stored in the bucket of that child node. There are two ways to store the rules that are partially overlapped with the sub-box of any child node. If the rule replication is not permitted, the partially overlapped rules are stored in the parent node. Otherwise, they will be duplicated in the buckets of the child nodes that are partially overlapped. The decision tree is built by partitioning the nodes recursively until the bucket size associated with the node is not greater than a pre-defined threshold. The second goal of building a decision tree is that the height of decision tree must be minimal. However, it is a challenge to find out which dimensions at each node are selected first for partitioning and how many sub-spaces are to be obtained on the selected dimensions in order to fulfill these two goals. Usually, the larger the bucket size the shorter the decision tree and the longer it takes to search the rules in the bucket sequentially. The decision tree is traversed according the packet’s header values until a leaf is reached. Then, all the rules of the leaf’s bucket are matched against the packet’s header values sequentially to yield the desired matching result. Hierarchical Intelligent Cuttings (HiCuts) [7] is one of such decision trees. Assume a node v covers a k-dimensional box ([l1:r1],…, [lk:rk]) and there are NumRules(v) rules in its bucket. HiCuts selects only one dimension say i ∈ {1…k} and decides how many sub-boxes (denoted by M) are needed in the space decomposition process, where M is a power of two. When performing M cuts along dimension i, HiCuts evenly partitions the interval [li:ri] and generate M equal-sized subboxes ([l1:r1],…, [li:li + t × w − 1],…, [lk:rk]) for t = 1 to M, where w = (ri − li + 1)/M and M = 2m. Since HiCuts uses the equal-sized partition, the interval [li:ri] can be represented as a prefix. When [li:ri] is partitioned into 2m equal-sized subintervals, we actually select the m most significant don’t-care bits to perform the partition. Thus, we say that HiCuts uses a bit cutting scheme (i.e. selects an appropriate number of cut bits) to perform the space decomposition. These 2m sub-boxes are connected to node v as its 2m child nodes in the decision tree. In HiCuts, rules in a node may be duplicated in its child nodes. Taking more cuts may decrease the height of the decision tree at the expense of increasing the memory usage. If the cut dimension is i, HiCuts tries to balance this tradeoff by choosing the largest number of cuts denoted by mi so that the following constraint is met based on a pre-defined memory usage threshold called space factor (sf). sf×NumRules(v)≥∑j=1miNumRules(childj)+mi How to select a dimension to cut has a major impact on the height of the decision tree and its memory usage. Four heuristics were proposed to select a dimension to cut a node. Which heuristic performs better depends on the characteristics of the rule table. Figure 1(a) illustrates the decision tree built for HiCuts from a small sample rule table. Since the rule table is small, only one cut bit is used in each stage of the space decomposition. This decision tree consists of 12 internal nodes and 13 leaf nodes and the average tree height is 3.77. Figure 1. View largeDownload slide (a) HiCuts with 12 internal nodes and 13 leaves. (b) HyperCuts with 10 internal nodes and 13 leaves. (c) HyperSplit with 10 internal nodes and 11 leaves. (d) The proposed decision tree with 8 internal nodes and 11 leaves. Figure 1. View largeDownload slide (a) HiCuts with 12 internal nodes and 13 leaves. (b) HyperCuts with 10 internal nodes and 13 leaves. (c) HyperSplit with 10 internal nodes and 11 leaves. (d) The proposed decision tree with 8 internal nodes and 11 leaves. HiCuts cutting nodes only along one dimension results in a higher decision tree, i.e. longer search time. HyperCuts [9] solves this problem by selecting multiple dimensions to perform the space decomposition. HyperCuts picks up the set of dimensions with a larger number of distinct field values than the mean number of distinct field values of all dimensions. HyperCuts also modifies HiCuts’s heuristics to compute the number of cuts needed for each of the selected cut dimensions. However, it is infeasible to compute all possible ways to compute the number of cuts needed for each cut dimension because the preprocessing time for doing this will be unimaginably long. Hence, HyperCuts uses a greedy approach to compute the local optimal number of cuts for each selected dimension. Figure 1(b) illustrates the decision tree built for HyperCuts from the same rule table. This decision tree consists of 10 internal nodes and 13 leaf nodes and the average tree height is 2.77. Both HiCuts and HyperCuts partition the box of a node into a power of two equal-sized sub-boxes. Because the rules are distributed over the entire address space unevenly, HiCuts and HyperCuts may generate many light-weight nodes that hold only a small number of rules. To solve this problem, HyperSplit [8] uses the following heuristics. First, as in HiCuts, HyperSplit only selects one dimension to cut the address space covered by a node. Second, HyperSplit does not cut the box of a node into smaller equal-sized subboxes as in HiCuts and HyperCuts but uses a more flexible scheme called endpoint-cutting scheme. Let [li:ri] be the corresponding interval in dimension i covered by a node v and array Pti[j] for j = 0 to M − 1 be the M field-i endpoints between endpoints li and ri computed from the field-i values of the rules overlapped with node v. Notice that Pti[0] = li and Pti[M−1] = ri. HyperSplit does not use the middle point between li and ri but uses one of the endpoints in array Pti[0…M − 1] as the cut-point computed by the weighted segment-balanced strategy due to its superior performance. Since a binary decision tree is targeted, only one cut-point is selected to cut the box of a node into two sub-boxes, mostly unequal-sized. Recently, a deterministic cutting algorithm called boundary cutting (BC) has been proposed in [12]. Similar to HyperSplit, BC uses rule boundaries to perform the space decomposition. Additionally, a refined BC called selective BC is proposed to reduce the endpoints in the internal nodes. In the weighted segment-balanced strategy, the array Pti[0…M − 1] and the M − 1 elementary intervals can be constructed [13] for a node v with interval [li:ri] in dimension i. Then, HyperSplit calculates the number of rules that cover the the jth elementary interval and store it in Sri[j] for 1 ≤ j < M. HyperSplit chooses the smallest endpoint m such that ∑j=1mSri[j]>12∑j=1M−1Sri[j]. This strategy tries to equalize the accumulated number of covering rules of all the intervals at the left side and at the right side of the endpoint m. HyperSplit selects the dimension i to cut if the value of 1M−1∑j=1M−1Sri[j] is the minimum for all dimensions. Figure 1(c) illustrates the decision tree built for HyperSplit from the rule table. This decision tree consists of 10 internal nodes and 11 leaf nodes and the average tree height is 3.55. In addition, multiple decision tree algorithms were proposed to reduce the rule replication problem such as EffiCuts [10] and Decision Forest [14]. Decision Forest employs HyperCuts algorithm to construct the decision tree in space decomposition procedure. When constructing a decision tree for a group of rules, Decision Forest moves the rules with heavy replication potential to a new subgroup. This new subgroup is used to construct another decision tree where some rules that may incur heavy rule duplications are again moved to another new subgroup. This construction process along with new subgroup generation process is repeated until the predefined number of decision trees is reached. Efficuts uses a notion of largeness in a dimension to define a large rule that covers a large part (e.g. 95%) of address space in that dimension. Based on the largeness of each of the five dimensions in a set of 5D rules, EffiCuts divides rules into 32 subgroups. It uses selective tree merging to reduce the number of subgroups by merging one subgroup into another if the former contains fewer rules. EffiCuts uses the idea of equi-dense cuts to tackle the variation in the rule-space density to eliminate unnecessary pointers pointing to NULL or to the same child nodes. EffiCuts also co-locates parts of information in a node and its children to achieve fewer memory accesses per node than HiCuts and HyperCuts. The hardware approaches for packet classification could be implemented in many ways. Ternary Content Addressable Memory (TCAM) is a very simple device that can complete a search operation in one cycle. Haoyu Song et al. [15] proposed the BV-TCAM architecture that combines TCAM and Bit Vector algorithm to effectively compress the data representations and boost throughput. However, the major drawbacks of TCAM are high power consumption and high cost-to-density ratio. On the other hand, the prefix-to-range expansion exacerbates the problem of TCAMs by significantly decreasing the already limited capacity of TCAMs as each rule typically has to be converted to multiple rules. Some results proposed to solve the range problems can be found in [16–18]. NLTMC [19] modified the Cross-Producting scheme and divided the rules into multiple subsets to avoid memory overhead. B2PC [20] and 2sBFCE [21] are implemented in ASIC and FPGA, respectively. The number of clock cycles needed for a search varies in a wide range and thus resulting performance is inferior. As a result, the Bloom-filter based architectures cannot achieve the throughput of 40 Gbps required by the high-speed link of OC-768. Many hardware search engines [22] for packet classification use pipeline or parallel architectures to improve the throughput. A set-pruning multi-bit trie data structure was proposed in [23]. To reduce the rule duplication, the rules are partitioned into many groups by its lengths or the wildcard field values. A search engine is constructed for each group based on the set-pruning multi-bit trie. By the parallel and pipelined architecture, the throughput of 100 Gbps can be achieved with dual-ported memory. Jiang et al. [24] used HyperCuts [9] as the data structure of the decision tree. By using the 2D linear dual-pipeline architecture, its FPGA implementation can achieve a throughput of 80 Gbps also with dual-ported memory. Wagner et al. [25] proposed a scalable pipeline architecture, named BiConOLP. They also used the HyperCuts as the decision trees. By using the dual-ported memory, BiConOLP can achieve a throughput of closing 40 Gbps. Yaxuan Qi et al. also implemented HyperSplit [8] in FPGA [26] and proposed a node merging algorithm to reduce the height of the decision tree. Based on the dual-ported memory of Xilinx FPGA Virtex-6 XC6VSX475T, a throughput of 74 Gbps for minimum packet size 40 bytes can be achieved. Yang et al. [27] targeted on better search speed by proposing a decision-tree-based, 2D multi-pipeline architecture called Dynamic Discrete Bit Selection (D2BS) that uses the same cut bit for all the nodes in the same tree level. The multi-pipeline architecture proposed in [28] is hybrid approach that combines the schemes of field decomposition, hierarchical trie and decision trees. Multiple pipelines correspond to the 5 rule subsets using a predefined rule partition scheme based on wildcard field values. The implementation results on Virtex-6 FPGA show that a throughput of 340 MPP can be achieved. Also, in [29], an FPGA hardware accelerator that uses a modified version of Hypercuts algorithm is proposed. The max throughput that can be achieved this hardware accelerator is 433 MPP. However, for some rulesets with more wildcard field values, the achieved throughput may be degraded to a half of the maximum throughput. As described above, both EffiCuts and Decision Forest use multiple decision trees. For software approaches, we need longer search time to get the best matched rule since all decision trees must be searched for the highest priority matched rule. However, in hardware environment, it is no longer a problem because it is easy to design a parallel search architecture. Therefore, the proposed architecture in this paper also employs multiple decision trees along with the parallel architecture designed on FPGA devices. 3. PROPOSED SCHEME In order to design a high performance and memory-efficient FPGA-based search engine, some design issues for selecting appropriate data structures are considered as follows: Parallel and Pipelined architecture: To increase the packet classification throughput, parallel and pipeline architectures are considered to be a better choice. Pipelined architectures allow many incoming packets to be processed in the pipeline stages concurrently and a classification result can be output in every clock cycle. In this paper, we will propose a recursive scheme to divide the rule table into many sub-tables for reducing the degree of rule duplications (i.e. memory consumption). Then we use a parallel architecture that consists of multiple pipelines each of which processes a sub-table for obtaining a final search result. Memory consumption: Our goal is to put the entire data structure of the rule table into the on-chip BRAM of FPGA so that the search speed will not be dragged down by the off-chip memory. For example, there are 456 BRAM blocks of size 18 Kb in Xilinx Virtex-5/6 device which amounts to 16/26 Mb approximately. Most existing algorithms need more than 16 Mb when dealing with the large and complex tables of 10 000 or more rules. Hence we need a memory-efficient data structure for the large rule tables to be fit in the BRAM of FPGA devices. In this paper, the endpoint cutting scheme and a rule bucket compression scheme are proposed to achieve this goal. In addition, the proposed recursive table cutting scheme gives us a flexibility to decide the degree of rule duplication to be allowed in the construction of the decision trees. A trie or decision tree based data structure can be easily implemented by using a pipelined architecture. Binary tries have the advantage of simple operations and short processing time in each stage over the decision trees used in HiCuts and HyperCuts. However, binary tries are only suitable for the data in prefix format and usually consume more memory than the decision trees. Packet classification algorithms based on decision trees focus on two aspects to determine how to perform the search space decomposition process. The first one is how to select the cut dimensions and the second is how to decide the cut-points for dividing the address space covered by a node in the decision tree into many subspaces. As stated above, we can select a single dimension or multiple dimensions to perform the space decomposition at a node. When choosing a single dimension, the height of decision tree is usually higher than that by choosing multiple dimensions at a time. But the node size of the former is smaller than the latter. Also, there are two methods to decompose the address space of a node after one or more dimensions are selected: one uses a bit cutting scheme by treating the field values as prefixes and the other uses an endpoint cutting scheme by treating the field values as ranges. Since prefix values can be represented as ranges, it is more flexible to treat all the field values as ranges. In general, we can select many cut-points for the chosen dimensions. But in the hardware architecture, the node size varies when different numbers of cut-points are used and as a result, the memory design becomes more complicated. Therefore, the proposed endpoint cutting scheme selects one or more cut dimensions and only one cut-point for each selected cut dimension is used for space decomposition. The work presented in this paper is an extension of [30]. 3.1. Build the basic decision tree We first describe the data structure of the basic decision tree based on the proposed endpoint cutting scheme called the NewHypersplits. Then, we will describe the recursive endpoint cutting (REC) scheme that recursively uses NewHypersplits to remove the duplicated rules from the decision tree currently being built. The removed duplicated rules are then collected as the second rule table called recursive table to build the second decision tree. It is possible that there still exist duplicated rules in the second decision tree and some of them are also removed and used to build the third decision tree. This decision tree building process is performed recursively until no duplicated rule exists in the last decision tree. This REC scheme generates no duplicated rule, which is called basic REC scheme. We have tried many heuristics to select cut dimensions and cut-points designed by the existing decision trees such as HiCuts, HyperCuts and HyperSplit in order to obtain a decision tree that requires a minimal amount of memory. The heuristics that are best suitable for our REC scheme are described as follows. The proposed NewHypersplits scheme selects the cut dimensions based on the heuristic similar to the one proposed in HyperCuts [9]. We select the ‘larger’ dimensions as the cut dimensions. By ‘larger’, we mean the number of distinct field values in the selected dimension is greater than or equal to the mean number of distinct field values for all dimensions under consideration. For example, if the numbers of distinct field values for all dimensions in a 5-field rule table are 40, 22, 33, 18, 12 with a mean 25, then the first and third dimensions are selected as the cut dimensions. Notice that selecting larger dimensions is simpler in terms of time complexity than selecting the dimension with the minimum value of 1M−1∑j=1M−1Sri[j] in HyperSplit described in Section 2. After the cut dimensions are decided, we use the weighted segment-balanced strategy proposed by HyperSplit to perform the space decomposition. If the number of cut dimensions is d for a node in the decision tree, the address space associated with the node is decomposed into 2d subspaces because only one cut-point is used for each selected cut dimension. Figure 1(d) shows the decision tree with buckets of size 1 built by the heuristics described above. There are six distinct field values in field X and four distinct field values in field Y and thus the average number of distinct field values for all fields is 5. Therefore, field X is selected as the cut dimension at the root node of the decision tree. Then, the cut-points for field X are calculated as follows. First, the endpoint array is created for field X. The endpoints are calculated by the minus-1 endpoint scheme proposed in [13]. As a result, the sorted endpoints in the increasing order are {0, 1, 2} in field X in the address space 0 to 3 and the elementary intervals are {[0,0], [1,1], [2,2], [3,3]}. The rule sets that cover elementary intervals [0,0], [1,1], [2,2] and [3,3] are {R1, R2, R3}, {R1, R3, R4}, {R1, R5} and {R1, R6}, respectively. Therefore, the cut-point at the root can be set between [1] and [2]. The final decision tree built consists of 8 internal nodes and 11 leaf nodes and the average tree height is 2.91. As the results shown in Fig. 1, the endpoint cutting scheme used by HyperSplit and our NewHypersplits scheme are more memory efficient than the bit cutting scheme used by HiCuts and HyperCuts. This conclusion will be further verified by the larger rule tables used in the section of performance evaluation later. Similar to the advantage of HyperCuts over HiCuts, NewHypersplits scheme performs better than HyperSplit because multiple cut dimensions can be used. Although our decision tree also generates many duplicated rules, the recursive endpoint cutting (REC) scheme proposed below can make the tree even shorter by removing the duplicated rules. As a result, the memory storage needed in NewHypersplits can be decreased dramatically. 3.2. Recursive endpoint cutting (REC) scheme Since rules are not uniformly distributed in the address space, many rules in the rule table overlap each other and the number of mutually overlapped rules varies vastly. The problem caused by rule overlapping is very serious for larger rule tables like IPC and FW tables. No matter what existing decision tree is used, some rules may be replicated many times. It is not easy to avoid the rule duplication completely. For example, consider three rules where rules A and B are disjoint and rule C completely covers both A and B. If we want to partition these three rules, no matter how the cutting operation is performed, the rule C always needs to be replicated. The proposed REC scheme can solve the rule duplication problem and thus the total required memory is reduced significantly. Now we shall use the same rule table in Fig. 1 to show how the rule duplication problem is solved in the proposed REC. In the first step, we apply the push-up technique to move up the duplicated rules to the node at which rules start to be duplicated. We consider two duplication cases for applying push-up operations. A rule in a node that is duplicated into all of its child nodes is called fully duplicated rule and, a rule in a node that is duplicated in only some of its child nodes is called partially duplicated rule. If the partially duplicated rules are pushed up, no duplicated rule will be generated in the decision tree. As shown in Fig. 2(a), R1 is the fully duplicated rule to be pushed up. At the node associated with address space {[0:1],[0:3]}, R3 is also a fully duplicated rule to be pushed up. By pushing up rules R1 and R3, no rule is duplicated. Normally, the pushed-up rules are stored in the internal nodes. However, searching the rules in the internal nodes along the path from the root to a leaf is a slow process. To speedup this process, all the rules in internal nodes are removed from the tree into a separate table called recursive-1 table. We then follow the same decision tree building heuristics to construct the decision tree from the recursive-1 table as shown in Fig. 2(b). Here, we assume only the highest priority rule is the final match and so the left child node of the root contains only rule R3 in Fig. 2(b). In a more complicated case, we may have to split recursive-1 table into two again to avoid the rule duplication. This recursive tree building process continues until no more recursive table is generated. Notice that allowing a minimum number of duplicated rules may be beneficial because of generating less number of decision trees and less number of memory accesses for completing a search operation. In this paper, two additional extensions are proposed. The first extension only allows the last decision tree to have duplications as long as the last rule table contains ≤ ts_threshold rules, called tree size threshold, the recursion in tree construction process is not applied. The rational is that when the rule table is very small, the rule duplication will be not a big concern. In the second extension, duplications are allowed in all the decision trees with the following restriction: the number of times each rule is duplicated in a decision tree cannot be larger than a rule duplication threshold (dupl_threshold). In other words, when a rule needs to be duplicated more than dupl_threshold times, it will be inserted into the next decision tree. Figure 2. View largeDownload slide The recursive endpoint based cutting scheme. (a) Push-up the duplicated rules. (b) The decision trees constructed by REC. Figure 2. View largeDownload slide The recursive endpoint based cutting scheme. (a) Push-up the duplicated rules. (b) The decision trees constructed by REC. Rule updates are supported for the proposed REC scheme as follows. Figure 3 shows the rule insertion algorithm. Since REC scheme is implemented in a pipeline architecture, the heights of design trees are fixed at that of the highest decision tree (decision tree 0). Assume rule x is to be inserted in tree i. If tree i is the last tree, the threshold-based rule removal policy will not be employed as in line 1. We temporarily insert rule x into the tree based on the cut dimensions and endpoints already computed in the process of tree construction. If the number of duplicated copies of rule x is greater than dupl_threshold, the process of inserting rule x is aborted and tree i is unchanged. Instead, we will try to insert rule x in tree i + 1. All other details of the insertion operations are self-descriptive in Fig. 3. The time complexity of the insertion algorithm depends on NumofTrees and H that is the height of the highest tree (i.e. tree 0). Based on the rule sets generated by Classbench, the number of rules that are overlapped with each other is a constant. Therefore, NumofTrees is a constant that will be verified in the performance evaluation section. Also, since each internal node of the NewHypersplits decision tree has 2–8 children, the tree height H will be O(log2N), where N is the number of rules in the rule set. As a result, the time complexity of inserting a rule is NumofTrees * O(log2N) = O(log2N) because NumofTrees is a constant. Figure 3. View largeDownload slide Insertion algorithm for the proposed REC scheme. Figure 3. View largeDownload slide Insertion algorithm for the proposed REC scheme. In order to delete rule y, we employ a simple approach as follows. We wipe out the records of all the duplications of rule y in the associated buckets and leave those bucket slots unused for holding the newly inserted rules afterwards. Our update design depends on height of the largest tree because the number of pipeline stages for the decision tree must be fixed at a predefined number H. Therefore, many rules may be accumulated in the last tree whose height becomes larger than H. If this happens, the hardware pipeline structure for the proposed REC scheme must be rebuilt. 3.3. Compress memory buckets In addition to the memory reduction from the decision trees, the memory usage for rule buckets that are usually ignored in many existing software approaches must be also reduced. In software approaches, only one bucket will be searched at a time. So, not the contents of rules but the rule ID’s have to be stored in the buckets. Based on our experimental results, the memory needed for all rule buckets accounts for 55–82% of the total memory in the pipeline architecture. Since our goal is to put entire rule table in the on-chip memory of the current FPGA devices, it is very important to develop an efficient data structure for storing the contents of the rules in buckets. In the decision tree based packet classification, buckets are used to store the rules associated with the internal and leaf nodes of the decision tree. By traversing the decision tree from root to a leaf node based on the header values of the incoming packet, the rules contained in the internal and leaf node’s buckets are the possible candidates for a match. After matching every rule in the buckets, the matched rule with the highest priority is output from the memory bucket pipeline as the search result. If the sequential search in software platform is considered, we only need to store one copy of the rule table and use rule pointers (i.e. ID’s) to access the rules in the bucket. However, in a pipelined architecture, one copy of the rule table is insufficient to support concurrent accesses to many rules per cycle. Each pipeline stage needs one independent memory unit so that all the stages could access their data in parallel. Subsequently, we will first review some existing bucket rule mapping schemes. Most FPGA-based packet classification schemes use a direct mapping scheme to map a leaf node bucket onto one memory bucket. In other words, each rule of the bucket is mapped to a stage of the memory bucket pipeline. Figure 4 shows a rule mapping example in which there are 9 leaf buckets of size 4 that are mapped onto 9 memory buckets based on direct mapping scheme. By recording the index of memory buckets in the leaf buckets, the rules of the memory bucket at the recorded index can be accessed in each stage. Direct mapping is straightforward but has the following disadvantages: First, the memory requirement is unacceptable. The number of needed memory buckets is equal to the number of non-empty leaf nodes in the decision tree. In large complex rule sets such as IPC or FW, there are tens of thousands of non-empty buckets in the decision tree. Second, the number of rules in each bucket is not always equal to the bucket size. Thus, the direct mapping leaves too many non-used slots. Third, the problem of duplicated rules is serious. As shown in stage 1 of Fig. 4(a), there are only two distinct rules, R1 and R2, but 4 copies of R1 and 5 copies of R2 are stored. Figure 4. View largeDownload slide Mapping schemes for memory bucket pipeline. (a) Direct mapping. (b) Variable-width mapping. (c) The proposed bucket merge mapping. Figure 4. View largeDownload slide Mapping schemes for memory bucket pipeline. (a) Direct mapping. (b) Variable-width mapping. (c) The proposed bucket merge mapping. Wagner et al. [25] proposed a variable-width mapping to reduce the unused memory slots for memory bucket pipeline. Figure 4(b) shows the variable-width mapping. The leaf buckets are sorted first in the increasing order of their sizes. Instead of always allocating a new memory bucket for a leaf bucket, we select an existing memory bucket whose unused slots can accommodate the rules in the newly coming leaf bucket. This mapping approach decreases the unused spaces, but the problem of rule duplication remains the same. As we observe, there are many similar leaf node buckets that share common rules between them. There is a high probability that these similar leaf node buckets come from the same ancestors in the decision tree. If we can merge these similar buckets, the problem of rule duplication would be mitigated significantly. For this reason, we propose the following greedy bucket compression scheme to reduce the duplicated rules in the memory bucket pipeline. We try to reuse the rules already assigned into the existing memory bucket pipeline. Our heuristic is very simple and efficient. The number of stages in the memory bucket pipeline may be varied. Adding more stages usually decreases the memory requirement and the overall throughput remains the same, but the response time becomes longer and hardware cost is more. Figure 5 is the pseudo code for the proposed bucket compression algorithm, BucketCompression(). First, all the buckets are sorted in the decreasing order of their sizes. When adding a new bucket (say A), we check the memory bucket (say B) in which some rules are already assigned. If the number of the rules that are in bucket A or in memory bucket B is not larger than the number of pipeline stages (numofstages), the rules of bucket A could be inserted into the memory bucket B. We say memory bucket B is the candidate memory bucket into which the rules can be inserted. In this paper, we select the first candidate memory bucket to perform the compression operations. If no memory bucket can accommodate the rules of the new leaf bucket, we create a new memory bucket to hold them. Figure 4(c) shows the bucket compression result for the same leaf bucket example where only four memory buckets of size four are needed.There may be more than one bucket that are merged into one memory bucket. The proposed bucket compression scheme can resolve the serious rule duplication problem and the total memory usage is much smaller than direct mapping and variable-width mapping schemes. Figure 5. View largeDownload slide The proposed bucket compression algorithm. Figure 5. View largeDownload slide The proposed bucket compression algorithm. 4. PARALLEL AND PIPELINED SEARCH ENGINE 4.1. Architecture overview Many hardware-based packet classification solutions were proposed in recent years [23–27]. These solutions can only implement the ACL rule table in their hardware architectures because the memory usages of their data structures are larger than the on-chip BRAM that FPGA devices can provide. We can reduce the memory requirement by using a parallel and pipelined search engine to implement the multiple decision trees generated by the proposed REC scheme. In order to achieve the high link rate as OC-768 (40 Gbps), we implement our search engine in the modern FPGA devices that can achieve the throughput of more than 100 Gbps. In the proposed parallel and pipelined search engine, each decision tree corresponds to a pipeline and all the pipelines are executed in parallel to improve the throughput. Figure 6 shows the block diagram of the proposed search engine. Each pipeline outputs the ID of the matching rule. In this paper, the rule ID is assumed to be the rule priority. All pipelines have the same number of stages so that the outputs of all the pipelines can reach the priority encoder at the same time to compute the final match with the highest priority. Figure 6. View largeDownload slide The hardware architecture. Figure 6. View largeDownload slide The hardware architecture. 4.2. Decision tree pipeline In the decision tree pipeline, we have to map the nodes in decision tree onto pipeline stages and design the circuits to compare the selected dimensions and cut-points against the header values of incoming packets to perform the matching process. We map the tree nodes to pipeline stages based on the tree level and thus the nodes at the same level will be mapped to the same pipeline stage. The number of decision tree pipeline stages is equal to the height of the decision tree. Because our decision tree is not a complete tree, the leaf nodes are not necessarily at the bottom level. We push all the leaf nodes to the bottom tree level which is then mapped onto the last stage. Each internal node has a flag (called Nop_out) to indicate whether it is the lowest internal node connecting to a leaf node in the decision tree. In other words, the lowest internal node will forward an asserted Nop_out flag to the next stage. When the stage corresponding to an internal node receives an asserted Nop_out flag, this stage does nothing but passing the information from previous sage to the next stage. When the leaf node stage is reached, the index of the corresponding memory bucket will be output. We push all leaf nodes to the last stage for two reasons: The data structures of the internal and leaf nodes have different sizes. Storing them in the same memory unit will be complicated and waste memory space. The operations in internal nodes and leaf nodes are also different, i.e. internal nodes need to perform the comparison operations for the input packet header values and the leaf nodes only read the index of memory bucket. If we place the internal and leaf nodes in the same stage, each stage needs two memory units (one for internal nodes and one for leaf nodes) or a complex data structure to merge these two along with two set of logics. The data structure of internal nodes is shown in Fig. 7(a). No more than three dimensions are selected as the cut dimensions and at most three cut-points are needed. The CutDimBitmap field records which dimensions are selected as the cut dimensions. There are at most eight child nodes whose type could be leaf or internal node. Therefore, we use an 8-bit bitmap called ChildTypeBitmap to record the child types. When the ith bit of ChildTypeBitmap is 0, the ith child is an internal node; otherwise, it is a leaf node. Since the sizes of leaf and internal nodes are different and they are stored in two different arrays, we need two individual base addresses, LeafBase and InternalNodeBase, respectively. In order to speed up the computation of the child node address, we use a precomputed array of size 8, PrecomputedOffset[], whose usage can be understood easily by the following example. Assume that we have to go to the ith child node after comparing the selected header values with the three cut-points. The type of this child node can be determined by checking ChildTypeBitmap[i]. If ChildTypeBitmap[i] indicates that the node is a leaf node, we use the precomputed offset, PrecomputedOffset[i] added to LeafBase to obtain the address of the ith child node. Otherwise, if ChildTypeBitmap[i] is of type internal node, the address of the ith child node will be InternalNodeBase + PrecomputedOffset[i] * LeafSize. The leaf nodes store the index of the memory bucket to be used to search the specified memory bucket. In order to utilize the BRAMs efficiently, we select an appropriate memory bucket size such that the number of memory buckets is not larger than 1024. Hence, only 10 bits are needed for the index of the memory buckets. Figure 7. View largeDownload slide Structures of internal node and bucket rule. (a) 147-bit structure. (b) 162-bit structure. Figure 7. View largeDownload slide Structures of internal node and bucket rule. (a) 147-bit structure. (b) 162-bit structure. Figure 8 shows the block diagram of the internal node stage. The internal node memory stores the data contents of all the internal nodes. The matching unit computes which branch to be taken based on CutDimBitmap and three cut-points. The address detection unit calculates the address of the next level node by using ChildTypeBitmap, array PrecomputedOffset[] and the two base addresses InternalNodeBase and LeafBase. Each internal node stage is divided into two sub-stages in order to further reduce the stage delay. Because the proposed decision tree is not too high, adding more stages is acceptable. Figure 9 shows the diagram of matching unit consisting of three 5 × 1 multiplexers for extracting at most three header values of the incoming packets selected by Unit_of_Cut_Dimensions based on CutDimBitmap. The 1-bit results from these three 5 × 1 multiplexers are combined to form a 3-bit index to array PrecomputedOffset[]. The Address Detection Unit is responsible for calculating the address of the child node based on InternalNodeBase, LeafBase, ChildTypeBitmap and PrecomputedOffset[], as described above. Figure 8. View largeDownload slide The block diagram for internal node stage. Figure 8. View largeDownload slide The block diagram for internal node stage. Figure 9. View largeDownload slide The matching unit block diagram. Figure 9. View largeDownload slide The matching unit block diagram. 4.3. Memory buckets pipeline The data structure of memory bucket is shown in Fig. 7(b). The RulePriority field is the priority of rule used to find the matched rule with the highest priority. As described earlier, we use the rule ID as the priority of the rule in this paper. The 13-bit RulePriority field indicates that there are 213 rules in the rule table under consideration. The source and destination IP fields are prefixes which are represented by the length format: a 32-bit IP address and a 6-bit length. Obviously, the main task in each memory bucket pipeline stage is to compare the 5-field rule stored in the stage against the packet’s header values to find a matched rule. The rules in each memory bucket are sorted in the increasing order of their priorities. Therefore, the last matched rule in memory bucket is the one with the highest priority. In addition, each stage also outputs its stage ID called the matched stage ID to the next stage if a match is found. The usage of the matched stage ID will be discussed later. It is known that in the FPGA devices, the BRAM block that we can allocate is restricted to the size of 1024n × 18 m or 512n × 36m bits, where n and m are positive integers. The BRAM block of size 1024n × 18m bits is also restricted to be used as 1024n entries of 18m bits wide. The smallest BRAM block is of size 18 Kbits (1 K × 18 bits). Therefore, the 162-bit bucket rule causes no BRAM waste when there are 1024 memory buckets which needs a BRAM block of size 1024 × 18 × 9 bits, i.e. 1024 162-bit entries. However, for the table of 10 K rules, the size of the memory rule needs 163 bits because the RulePriority field (i.e. the rule ID) needs 14 bits. In this case, we have to allocate a BRAM block containing 1024 entries of 180 bits wide, which wastes 17 bits per entry. To solve this problem, the following memory optimization is developed. The rule priority is not used in the memory bucket pipeline when the rule is matched against the header values of the packets. Therefore, we proposed a rule priority split scheme that splits the RulePriority field into two sub-fields: 13-bit and 1-bit. The 1-bit sub-fields of all the rules in the same bucket are then aggregated into an additional memory bucket stage called rule priority combiner stage in the memory bucket pipeline. Figure 10 shows the architecture for the memory bucket pipeline appended with the additional rule priority combiner stage. Each memory bucket stage performs the rule comparison against the header values of the incoming packets. If the packet’s header values match a rule, the 13-bit partial rule priority and the matched stage ID will be output to the next stage. Otherwise, we just forward the input rule priority and the input stage ID to the next stage. After the rule comparisons on all the memory bucket stages, the 1-bit sub-field corresponding to the last stage with the matched rule is reclaimed from the rule priority combiner stage to complete the computation of the rule priority of the matched rule. In other words, at the rule priority combiner stage, the input stage ID is used to extract the corresponding 1-bit sub-field which is then concatenated with the input rule priority to form the priority of the matched rule. Finally, all the rule priorities (i.e. matched rule ID’s) of all search engines converge on the priority encoder to obtain the final matched result. The memory saving of the proposed rule priority split scheme can be understood easily by the following example. If there are n original memory bucket stages, n − 1 BRAM blocks of size 1024 × 18 bits are saved compared to the case without applying the rule priority split scheme, where n is 17 or less for all the tables of 10 K rules we experimented. For the large tables of 32 K or 64 K rules, the same optimization technique can be applied accordingly. Figure 10. View largeDownload slide The memory bucket pipeline enhanced with one additional rule priority combiner stage. Figure 10. View largeDownload slide The memory bucket pipeline enhanced with one additional rule priority combiner stage. 5. PERFORMANCE EVALUATION In this section, we conduct two types of performance evaluations to show the superiority of the proposed schemes. First, we will conduct the performance evaluation on the simple software environment to compare the proposed bucket compression algorithm with variable-width scheme and compare the proposed REC against a state of art scheme, Efficuts [10]. Second, we will evaluate the performance of the proposed REC scheme in terms of memory consumption and speed by the implementation on FPGA devices and compare with other packet classification approaches. The rule tables used in the experiments are Access Control List (ACL), Firewall rules (FW) and IP Chains (IPC) tables of 10 K and 100 K rules, generated by the Classbench [31] with the seed ACLm, FWm and IPCn for m = 1–5 and n = 1–2. The bucket size of all decision trees built in the experiments is set to 8. In the simple software environment, a PC with Intel Core i5–4460 @ 3.2 GHz (4 core) CPU is used. The simulated algorithms are implemented in C language. In the FPGA environment, we use Xilinx ISE 13.1 development tools and the Xilinx Virtex-5 [32] XC5VFX200T containing 30 720 slices (each slice contains four LUTs and four flip-flops) and 912 BRAM blocks of size 18 Kb (16 416 Kb) and Virtex-6 [33] XC6VLX760 containing 118,560 slices and 1 440 BRAM blocks of size 18 Kb (25 920 Kb) with −2 speed grade as our target devices. Table 2 shows the number of buckets needed in New HyperSplits without compression and that after applying variable-width mapping compression and the proposed bucket merge mapping compression for tables of 100 K rules. The percentages shown in the parentheses of the Vwidth and BMerge columns are the bucket compression ratios of the evaluated scheme over the original direct mapping scheme. We can see that the compression ratio of the proposed mapping is in the range of 9–15% that is much better than 45–70% achieved by the variable-width mapping. Table 3 shows the numbers of trees, memory accesses per lookup, inodes and bucket rules, and the memory usage for tables of 100 K rules. The memory usage difference for the proposed REC with dupl_threshold set to 0 and 50 is insignificant. However, the REC with dupl_threshold = 50 needs 4–13 less memory accesses in average per lookup than the REC with dupl_threshold = 0. REC with dupl_threshold = 50 needs only 5–66% of the memory needed in Efficuts. The number of memory accesses per lookup for Efficuts is 9–10 less than the REC with dupl_threshold = 50. Nonetheless, this disadvantage can be remedied by the pipeline design as implemented in the FPGA devices below. Table 2. The number of buckets of size 8 for New_HyperSplits without compression (Direct), with variable-width mapping (VWidth), and the proposed bucket merge mapping (BMerge) for tables of various sizes (up to around 100 K rules). acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) Table 2. The number of buckets of size 8 for New_HyperSplits without compression (Direct), with variable-width mapping (VWidth), and the proposed bucket merge mapping (BMerge) for tables of various sizes (up to around 100 K rules). acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg # of rules 99 193 66 280 85 845 84 514 97 310 86 628 65 742 83 659 63 977 67 484 48 376 65 848 88 396 92 678 90 537 Direct 53 074 315 961 261 053 263 352 38 715 186 431 117 534 146 715 55 211 78 086 94 229 98 355 102 897 142 571 122 734 VWidth (%) 25 349 (48) 134 057 (42) 115 634 (44) 122 995 (47) 17 463 (45) 83 100 (45) 53 228 (45) 72 051 (49) 26 389 (48) 39 814 (51) 49 677 (53) 48 232 (49) 52 076 (51) 119 901 (84) 85 988.5 (70) BMerge (%) 17 753 (33) 14 125 (4) 17 311 (7) 16 967 (6) 14 537 (38) 16 139 (9) 9443 (8) 15 285 (10) 8390 (15) 9745 (12) 6255 (7) 9824 (10) 13 259 (13) 20 390 (14) 16 825 (14) Table 3. Numbers of trees and memory accesses, inodes and bucket rules, and memory usage for tables of 100 K rules. acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 Table 3. Numbers of trees and memory accesses, inodes and bucket rules, and memory usage for tables of 100 K rules. acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 acl1 acl2 acl3 acl4 acl5 Avg fw1 fw2 fw3 fw4 fw5 avg ipc1 ipc2 avg New HyperSplits (dupl_thd = 0)  Trees (max depth) 4 (13) 8 (14) 8 (13) 8 (14) 2 (14) 6 (14) 12 (12) 4 (13) 8 (12) 15 (12) 11 (14) 10 (13) 6 (14) 2 (16) 4 (15)   Accesses    inode 22 44 43 42 11 32.4 64 30 48 76 66 56.8 41 26 33.5    Bucket rule 14 28 31 29 2 20.8 47 15 25 43 34 32.8 25 13 19    Total 36 72 74 71 13 53.2 111 45 73 119 100 89.6 66 39 52.5   # of inodes (k) 17.6k 128.5k 101.6k 124.4k 15.55k 77.5k 65.3k 69.97k 29.2k 38.2k 125.1k 65.546k 80.1k 122k 101k   # of bucket rules (k) 142k 133k 158k 159k 140k 146k 107k 135k 90.7k 103k 91.2k 105k 121k 155k 138k   Total Mem (MB) 2.9 4.73 4.71 5.14 2.82 4.06 3.12 3.71 2.17 2.55 3.92 3.09 3.64 5.01 4.33 New HyperSplits (dupl_thd = 50)  Trees (max depth) 2 (14) 6 (15) 5 (15) 6 (15) 2 (14) 4.2 (15) 9 (12) 2 (16) 5 (13) 10 (13) 8 (14) 6.8 (14) 4 (15) 2 (16) 3 (16)   Accesses    inode 19 41 37 39 11 29.4 59 26 43 68 60 51.2 38 26 32    Bucket rule 10 22 25 26 2 17 38 9 18 34 28 25.4 20 13 16.5    Total 29 63 62 65 13 46.4 97 35 61 102 88 76.6 58 39 48.5   # of inodes (k) 17.6k 129k 102k 124k 15.5k 77.6k 65.4k 79.2k 29.3k 38.6k 125k 67.5k 80.2k 122k 101k   # of bucket rules (k) 143k 133k 159k 160k 140k 147k 107k 174k 90.6k 105k 91.3k 114k 122k 155k 138k   Total Mem (MB) 2.9 4.73 4.73 5.15 2.82 4.07 3.13 4.59 2.17 2.6 3.92 3.28 3.66 5.01 4.34 EffiCuts  Trees (max depth) 5 (11) 7 (12) 8 (11) 9 (11) 2 (12) 6.2 (11) 9 (15) 7 (12) 9 (14) 10 (16) 9 (16) 8.8 (15) 9 (12) 4 (11) 6.5 (12)   Accesses    inode 13 18 24 24 4 16.6 29 16 20 25 26 23.2 28 10 19    Bucket rule 7 25 33 28 6 19.8 43 20 32 78 47 44 35 4 19.5    Total 20 43 57 52 10 36.4 72 36 52 103 73 67.2 63 14 38.5 # of inodes 4012 20 834 9639 8955 4549 9598 121 708 13 099 60 292 165 723 181 452 108 455 9573 2882 6227 # of bucket rules (k) 150k 772k 439k 431k 99.8k 378k 3570k 1060k 1945k 5768k 5746k 3618k 615k 100k 357k Total Mem (MB) 2.79 14.42 8.16 8.00 1.90 7.05 67.24 19.47 36.50 107.99 107.93 67.83 11.34 1.87 6.61 Subsequently, we evaluate the memory storage and search throughputs of the proposed REC scheme on FPGA devices. Table 4(a) lists the detailed memory usage for 12 rule tables. The D-tree and Bucket rows show the overall memory usages of the decision tree and the memory bucket pipelines for each of these twelve tables, respectively. Numbers of internal nodes (inodes) and bucket rules are also shown. Table 4(b) shows the detailed memory usage of each search engine for tables alc1, fw1 and ipc1. We can see that the total memory required for these rule tables can be accommodated easily in block RAM (denoted by BRAM) of the FPGA devices. The number of decision trees is a trade-off between memory size and hardware cost. If we create more decision trees, then the total required memory would be smaller, but we need more logic hardware to implement all the search engines. The number of stages in memory bucket pipeline also needs to be tuned so that the most efficient memory usage is achieved. How many stages in the memory bucket pipeline is decided by BRAM restriction stated before. In other words, the number of entries of a BRAM block must be a multiple of 1024 and the width of BRAM entries must be a multiple of 18 bits. Table 4. Memory usage in KB and the number of stages for tables of 10 K rules. (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 Table 4. Memory usage in KB and the number of stages for tables of 10 K rules. (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (a) acl1 acl2 acl3 acl4 acl5 fw1 fw2 fw3 fw4 fw5 ipc1 ipc2  # of Rules 9603 9429 9424 9643 7262 9311 9652 9025 8865 8815 9502 10 000  # of D-trees 2 6 5 6 2 3 4 7 7 7 4 2  # of inodes 3197 6092 16 868 12 321 1474 8362 8514 8137 3960 11 917 9463 10 718  # of bucket rules 15 380 12 328 48 100 28 222 15 740 16 384 11 933 9792 11 116 12 413 24 073 15 264  D-tree (max depth) 66.3 (13) 109 (14) 303 (15) 221 (14) 26 (9) 266 (15) 153 (14) 146 (14) 71 (12) 214 (14) 228 (14) 192 (13)  Bucket (# of  stages) 318.2 (15) 244 (15) 951 (13) 558 (12) 311 (15) 327 (10) 236 (10) 194 (12) 220 (12) 245 (10) 479 (13) 302 (13)  Total Mem (KB) 384.5 353 1 254 779 337 593 389 340 291 459 707 494 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 (b) acl1 fw1 ipc1 Engine0  D-tree (depth) 61.0 (13) 232.1 (12) 156.6 (14)  Bucket (# of stages) 273.6 (15) 179.1 (10) 241.4 (13)  Total 334.6 411.1 398.0 Engine1  D-tree (depth) 5.3 (9) 8.2 (10) 19.0 (10)  Bucket (# of stages) 44.6 (9) 55.9 (9) 48.9 (9)  Total 50.0 64.1 67.9 Engine2  D-tree (depth) 25.8 (7) 5.7 (7)  Bucket (# of stages) 92.0 (9) 5.1 (9)  Total 117.8 10.8 Engine3  D-tree (depth) 46.7 (9)  Bucket (# of stages) 183.6 (2)  Total 230.4 Bucket memory % 82.3% 55.2% 67.8% Total memory (KB) 384.5 592.99 707.0 Therefore, we increase the number of stages in the memory bucket pipeline from the leaf node bucket size (i.e. 8) to an appropriate number denoted by numofstages so that the number of memory buckets is very close to 1024. The chosen values of numofstages are the numbers shown in the parentheses of the Bucket rows in Table 4. For example, in Table 4(b), tables acl1, fw1 and ipc1 need 15, 13 and 10 stages for memory bucket pipelines in the first search engine. As a result, we only need 10-bit index of memory buckets rather than eight 14-bit rule pointers for the sequential search in software environment. As stated earlier, the ability to choose a variable number of states for memory bucket pipeline attributes to the flexibility of the proposed bucket compression scheme. Table 5 shows the rule compression and duplication ratios of the proposed bucket compression algorithm for the tables of 10 K rules. The second and third rows show the numbers of rules in the original leaf node buckets and the memory buckets, respectively. The compression and duplication ratios shown in Rows 4 and 5 are computed as # of rules in the memory buckets divided by # of rules in the leaf buckets, and # of rules in the memory buckets divided by the number of rules in the rule table, respectively. We can see that the proposed bucket rule compression algorithm is very efficient and the compression ratio is as high as 8.48 for FW1_10K table. The rule duplication ratio is reduced to 1.6–2.66. Table 5. Rule compression and duplication ratios of 10 K rules. acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 Table 5. Rule compression and duplication ratios of 10 K rules. acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 acl1 fw1 ipc1 # of rules in leaf buckets 35 024 139 042 149 674 # of rules in memory buckets 15 380 16 384 24 073 Compression ratio 0.44 0.12 0.16 Duplication ratio 1.60 1.76 2.66 Table 6 shows the FPGA resource utilization, clock frequencies, and throughputs of the proposed architecture based on Xilinx Virtex-5 XC5VFX200T FPGA. The achieved throughput is much more than 40 Gbps that OC-768 devices can provide. As stated before, the smallest BRAM block that can be allocated is of size 18 × 1 Kb = 18 Kbits used as 1 K 18-bit entries. The numbers of entries needed in the memory modules of some earlier stages of the pipelined architecture are much smaller than 1 K. In other words, the memory utilization for these early stages is very low. If we use distributed RAMs instead of BRAMs to implement these stages, we can further decrease the size of the BRAM needed. In our proposed architecture, by replacing the BRAM with distributed RAM needed for small memory modules, the required memory size will be less than 50% of total BRAM available. Table 6. FPGA results for tables of 10 K rules (single ported BRAM). Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Table 6. FPGA results for tables of 10 K rules (single ported BRAM). Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 Rule Table # of LUTs (%) # of 18 K-bit BRAM (%) Frequency (MHz) Throughput (Gbps) Virtex 5  acl1 3290 (10.7%) 171 (18.8%) 161.76 51.76  fw1 3613 (11.7%) 264 (28.9%) 161.20 51.58  ipc1 6041 (19.6%) 314 (34.4%) 161.63 51.72 Virtex 6  acl1 3510 (3%) 171 (11.9%) 194.11 62.08  fw1 4197 (3.5%) 264 (18.3%) 194.11 62.08  ipc1 6444 (5.4%) 314 (21.8%) 194.11 62.08 We also evaluate the maximum number of rules that can be supported on PFGA devices used by the proposed pipelined architecture. By using Virtex-5 FPGA, 50 K rules for ACL, 20 K rules for IPC and 25 K rules for FW can be supported. Also, almost double number of rules supported on Virtex-5 FPGA can be supported on Virtex-6 FPGA. The speed and the throughput achieved by Virtex-6 are similar to that on Virtex-5. We compare our design with the existing state-of-the-art FPGA-based packet classification engines [23–25]. Table 7 shows the numbers of slices, BRAM usage, search engine clock rates, and throughputs achieved for ACL_10K. All the engines use the Xilinx Virtex-5 XC5VFX200T with −2 speed grade and dual-port memories. We can see that all slice utilizations are similar. SPMT [23] has the highest throughput but it consumes almost all the BRAM (94%) available on Virtex-5 XC5VFX200T FPGA. In other words, SPMT cannot support the tables of more than 10K rules on Virtex-5 because the set pruning multi-bit trie used is not a memory-efficient data structure for 5-field rules. In order to have a fair comparison, we introduce a new performance metric called performance efficiency defined to be the ratio of throughput and the number of 18 K-bit BRAM blocks used. Our design outperforms the others by a factor of 2.3 to 3 in terms of performance efficiency. Table 7. Performance comparison for ACL1_10 K on Virtex-5 XC5VFX200T FPGA with dual ported BRAM. Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Table 7. Performance comparison for ACL1_10 K on Virtex-5 XC5VFX200T FPGA with dual ported BRAM. Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Approaches LUTs used/available (utilization) Block RAMs used/available (utilization) Frequency (MHz) Throughput (Gbps) Efficiency (throughput/block RAMs) Proposed scheme 7044/122 880 (5.7%) 173/456 (37.9%) 161.76 103.53 0.584 SPMT [23] 6584/122 880 (5.4%) 429/456 (94.1%) 173.02 110.73 0.252 2-D Linear Dual-Pipeline [24] 10 307/122 880 (8.4%) 407/456 (89.2%) 125.36 80.23 0.192 BiConOLP [25] 6611/122 880 (5.4%) 208/456 (45.6%) 143.4 45.88 0.215 Table 8 compares the performance of our architecture and other existing schemes based on various hardwares. All the results for throughputs, number of rules experimented, number of LUTs and number of bytes per rule are collected from original papers. We can see that the proposed scheme outperforms all the schemes except SPMT [23] and Modified Hypercuts [29]. As explained above, SPMT is not suitable for large rule tables due to its memory explosion problem. Modified Hypercuts [29] achieves the maximum throughput of 138 Gbps when using the proposed search structure that requires only two memory accesses at worst to classify a packet for some rule sets. However, if the rule tables contain more wildcard field values that incur more than two memory accesses per search, the throughput will be degraded dramatically to 69 Gbps or less. The multi-pipeline architecture in [28] needs much less memory than other schemes because it employs prefix encoding scheme to reduce the memory needed in the decision trees. However, the multi-pipeline architecture needs more LUTs than the proposed scheme, SPMT [23] and HyperSplit [26] Table 8. Performance comparisons of various FPGA implementations with dual-ported BRAM. Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 aThe memory and logic usage were not available for D2BS. bThe number of logic elements (LEs) in Stratix III is converted to the number of 6-input LUTs in Virtex 5/6 FPGAs by setting that one LE is equivalent to 0.8 LUT. Table 8. Performance comparisons of various FPGA implementations with dual-ported BRAM. Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 Approaches # of rules Platform 6-input LUTs Bytes per Rule Throughput Mpps Gbps Proposed scheme 9603 Virtex-5 7044 38.9 323.5 103.5 Virtex-6 7044 38.9 388.2 124.2 SPMT [23] 9603 Virtex-5 6584 96.5 346.0 110.7 D2BS [27]a 9603 Virtex-5 – – 264 84.5 2D Linear DualPipeline [24] 9603 Virtex-5 41 228 91.6 250.7 80.2 HyperSplit on FPGA [26] 9603 Virtex-6 2988 46.4 230.8 73.9 Multi-pipeline [28] ~9500 Virtex-6 14 400 18 340 108.8 BiConOLP [25] 9603 Virtex-5 26 444 88.7 143.4 45.9 Modified Hypercuts [29]b 10 000 Stratix III 16 028 28.5 216/433 69/138 aThe memory and logic usage were not available for D2BS. bThe number of logic elements (LEs) in Stratix III is converted to the number of 6-input LUTs in Virtex 5/6 FPGAs by setting that one LE is equivalent to 0.8 LUT. 6. CONCLUSIONS In this paper, we proposed a high-throughput and low-cost parallel and pipelined architecture based on the recursive endpoint-cutting scheme. Bucket memory requirement becomes a serious problem when the pipeline architecture is considered. We proposed a bucket compression scheme to reduce rule duplication in memory bucket pipeline. Based on Xilinx Virtex-5/6 FPGA device, our experimental results showed that the proposed scheme needs much less BRAM than other FPGA-based approaches. The proposed scheme can accommodate more than 20 K rules with on-chip memory of 1.6 Mb. A high throughput of beyond 100 Gbps for the packets of minimum size (40 bytes) can be achieved if dual-ported BRAM is used. REFERENCES 1 Chao , H.J. ( 2002 ) Next generation routers . Proc. IEEE , 90 , 1518 – 1558 . Google Scholar CrossRef Search ADS 2 Gupta , P. and McKeown , N. ( 2001 ) Algorithms for packet classification . IEEE Netw. , 6 , 24 – 132 . Google Scholar CrossRef Search ADS 3 Taylor , D.E. ( 2005 ) Survey and taxonomy of packet classification techniques . ACM Comput. Surv. , 37 , 238 – 275 . Google Scholar CrossRef Search ADS 4 Srinivasan , V. , Varghese , G. , Suri , S. and Waldvagel , M. ( 1998 ) Fast and Scalable Layer Four Switching, Proc. ACM SIGCOMM, pp. 191–202. 5 Baboescu , F. , Singh , S. and Varghese , G. ( 2003 ) Packet classification for core routers: is there an alternative to CAMs?. Proc. IEEE INFOCOM, 2003. 6 Cohen , E. and Lund , C. ( 2005 ) Packet classification in large ISPs: design and evaluation of decision tree classifiers. Proc. ACM SIGMETRICS, pp. 73–84, 2005. 7 Gupta , P. and McKeown , N. ( 1999 ) Packet classification using hierarchical intelligent cuttings. Proc. Hot Interconnects VII, 1999. 8 Qi , Y. , Xu , L. , Yang , B. , Xue , Y. and Li , J. ( 2009 ) Packet classification algorithms: from theory to practice. Proc. INFOCOM, pp. 648–656. 9 Singh , S. , Baboescu , F. , Varghese , G. and Wang , J. ( 2003 ) Packet classification using multidimensional cutting. Proc. ACM SIGCOMM, pp. 213–224. 10 Vamanan , B. , Voskuilen , G. and Vijaykumar , T.N. ( 2010 ) EffiCuts: optimizing packet classification for memory and throughput. Proc. ACM SIGCOMM, pp. 207–218, 2010. 11 Lee , J. , Byun , H. , Mun , J.H. and Lim , H. ( 2017 ) Utilizing 2D leaf-pushing for packet classification . Comput. Commun. , 103 , 116 – 129 . Google Scholar CrossRef Search ADS 12 Lim , H. , Lee , N. , Jin , G. , Choi , Y. and Yim , C. ( 2014 ) Boundary cutting for packet classification . IEEE/ACM Trans. Netw. , 22 , 443 – 456 . Google Scholar CrossRef Search ADS 13 Chang , Y.-K. and Lin , Y.-C. ( 2007 ) Dynamic segment trees for ranges and prefixes . IEEE Trans. Comput. , 56 , 769 – 784 . Google Scholar CrossRef Search ADS 14 Jiang , W. , Prasanna , V.K. and Yamagaki , N. ( 2010 ) Decision forest: a scalable architecture for flexible flow matching on FPGA. Proc. FPL, pp. 394–399. 15 Song , H. and Lockwood , J.W. ( 2005 ) Efficient packet classification for network intrusion detection using FPGA. Proc. 13th ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2005. 16 Meiners , C.R. , Liu , A.X. and Torng , E. ( 2011 ) Topological transformation approaches to TCAM-based packet classification . IEEE/ACM Trans. Netw. , 19 , 237 – 250 . Google Scholar CrossRef Search ADS 17 Pao , D. , Li , Y.-K. and Zhou , P. ( 2006 ) Efficient packet classification using TCAMs . Comput. Netw. , 50 , 3523 – 3535 . Google Scholar CrossRef Search ADS 18 Chang , Y.-K. , Su , C.-C. , Lin , Y.-C. and Hsieh , S.-Y. ( 2013 ) Efficient gray code based range encoding schemes for packet classification in TCAM . IEEE/ACM Trans. Netw. , 21 , 1201 – 1214 . Google Scholar CrossRef Search ADS 19 Dharmapurikar , S. , Song , H. , Turner , J. and Lockwood , J. ( 2006 ) Fast packet classification using bloom filters. Proc. ACM/IEEE Symp. Architectures for Networking and Communications Systems, 2006. 20 Papaefstathiou , I. and Papaefstathiou , V. ( 2007 ) Memory-efficient 5D packet classification at 40 Gbps. Proc. IEEE INFOCOM, 2007. 21 Nikitakis , A. and Papaefstathiou , I. ( 2008 ) A memory-efficient FPGA-based classification engine. Proc. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 2008. 22 Kennedy , A. , Wang , X. , Liu , Z. and Liu , B. ( 2008 ) Low power architecture for high speed packet classification. Proc. ACM/IEEE Symp. Architectures for Networking and Communications Systems, 2008. 23 Chang , Y.-K. , Lin , Y.-S. and Su , C.-C. ( 2010 ) A high-speed and memory efficient pipeline architecture for packet classification. Proc. Int. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM) pp. 215–218, 2010. 24 Jiang , W. and Prasanna , V.K. ( 2012 ) Scalable packet classification on FPGA . IEEE Trans. Comput. Very Large Scale Integration (VLSI) Syst. , 20 , 1668 – 1680 . Google Scholar CrossRef Search ADS 25 Wagner , J.M. , Jiang , W. and Prasanna , V.K. ( 2009 ) A scalable pipeline architecture for line rate packet classification on FPGAs. Proc. 21st IASTED Int. Conf. Parallel and Distributed Computing and Systems (PDCS), 2009. 26 Qi , Y. , Fong , J. , Jiang , W. , Xu , B. , Li , J. and Prasanna , V.K. ( 2010 ) Multi-dimensional packet classification on FPGA: 100 Gbps and beyond. Proc. Intl. Conf. Field-Programmable Technology (FPT). 27 Yang , B. , Fong , J. , Jiang , W. , Xue , Y. and Li , J. ( 2012 ) Practical multi-tuple packet classification using dynamic discrete bit selection . IEEE Trans. Comput. , 63 , 424 – 434 . Google Scholar CrossRef Search ADS 28 Pao , D. and Lu , Z. ( 2014 ) A multi-pipeline architecture for high-speed packet classification . Comput. Commun. , 54 , 84 – 96 . Google Scholar CrossRef Search ADS 29 Kennedy , A. and Wang , X. ( 2014 ) Ultra-high throughput low-power packet classification on FPGA . IEEE Trans. Comput. Very Large Scale Integration (VLSI) Syst. , 22 , 286 – 299 . Google Scholar CrossRef Search ADS 30 Chang , Y.-K. and Chen , H.-C. ( 2011 ) Layered cutting scheme for packet classification. Proc. IEEE Int. Conf. Advanced Information Networking and Applications (AINA), pp. 675–681. 31 Taylor , D.E. and Turner , J.S. ( 2005 ) ClassBench: a packet classification benchmark, Proc. IEEE INFOCOM 2005, pp. 2068–2079, 2005. 32 Xilinx , ‘Virtex-5 Family Overview’, product specification, DS100 (v5.0). http://www.xilinx.com (accessed February 21, 2009). 33 Xilinx , ‘Virtex-6 Family Overview’, product specification, DS150 (v2.5). http://www.xilinx.com (accessed August 20, 2015). Author notes Handling editor: Gerard Parr © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

The Computer JournalOxford University Press

Published: Jun 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off