TY - JOUR AU - Devendran,, Valliyammai AB - Abstract Software-defined networking (SDN) is an emerging trend where the control plane and the data plane are separated from each other, culminating in effective bandwidth utilization. This separation also allows multi-vendor interoperability. Link failure is a major problem in networking and must be detected as soon as possible because when a link fails the path becomes congested and packet loss occurs, delaying the delivery of packets to the destination. Backup paths must be configured immediately when a failure is detected in the network to speed up packet delivery, avoid congestion and packet loss and provide faster convergence. Various SDN segment protection algorithms that efficiently reduce CPU cycles and flow table entries exist, but each has drawbacks. An independent transient plane technique can be used to reduce packet loss but is not as efficient when multiple flows try to share the same link. The proposed work focuses on reducing congestion, providing faster convergence with minimal packet loss and effectively utilizing link bandwidth using bandwidth-sharing techniques. An analysis and related studies show that this method performs better and offers a more reliable network without loss, while simultaneously ensuring the swift delivery of data packets toward the destination without congestion, compared to the other existing schemes. 1. INTRODUCTION Link failure is a serious issue faced by current networks, given that all active traffic is disrupted whenever a link or node fails. Data transmission between the source and destination pairs becomes impossible since the links connecting the source and destination nodes get disconnected as a result of link failure. Network reliability has become a major concern in all networks and several techniques have been proposed to provide it. Each networking device, i.e. routers or switches, in a distributed network environment is responsible for forwarding control information such as routing tables, and forwarding information such as user data. The workload of each device subjected to such a distributed environment is high, since it has to process both the control plane and the data plane. The control information is flooded throughout the network whenever a failure occurs in a distributed environment. The flooding of control information acts as a barrier for the data traffic transmission and hence utilizes maximum bandwidth. Certain schemes were proposed previously to separate the control traffic and the data traffic by separating the controller from data forwarding. There were also various schemes in multi-protocol label switching (MPLS)-based networks to provide faster convergence when a failure occurs. Studies show that these schemes did not perform well. Each node has to maintain both the active and the backup paths in a distributed environment, consuming a lot of memory space. The active and backup paths are given in the form of flow entries in the node. Hence the size of the memory associated with each node must be large. The increased size of the entries in the forwarding information base (FIB) maximizes the cycle time involved in searching them in the FIB, consequently prolonging the delay that causes packet loss in the network. There are various reasons under which a failure can occur in the network. Physical layer cable cut, software errors, hardware failure and excess overload of data in transmission link and natural disasters. The link recovery mechanisms usually involve two methods, protection and restoration. During link failures, when the backup path entries are stored along with the primary path entries, it is said to follow protection method. The backup entries are used during link failure. When the flow entries are installed in the switch after the occurrence of a failure, it is a restoration mechanism. The restoration mechanism flushes out the primary path flow entries and installs the backup entries after the failure has occurred. The segment protection scheme uses the protection mechanism to handle the link failure. The priority assignment is given in the segment protection scheme to make use of the corresponding primary and backup entries. The independent transient plane (ITP) mechanism also follows the protection scheme in the form of a transient plane by diverting the flows using VLAN Id until the backup path is configured. The transient plane is used to instruct the packets from any switch in the network to a specific destination switch, when a failure occurs in the network. In this manner, ITP reduces the number of flow entries. However, in both the segment protection scheme and the ITP mechanism, when multiple source and destination transmissions are initiated, the handling of failures with the available backup paths and bandwidth leads to congestion. Also, the stable flow entries required for alternate backup path transmission are not maintained in both segment protection scheme and ITP. Therefore, the proposed congestion-free transient plane (CFTP) in a software-defined networking (SDN) environment detects the congestion when a transmission is made from multiple sources to multiple destinations. The congestion is further avoided by sharing the bandwidth among the available links. SDN has been proposed to provide enhanced utilization of the data plane and support multi-vendor interoperability. Both the control and data planes are separated in SDNs. The OpenFlow application level protocol running on top of TCP is responsible for initiating and providing communication between the controller and the data plane using messages. The connection between OpenFlow controller and switch is determined based on its network type such as LAN and WAN. The ability of the switches and controller to communicate with each other is established through the TCP/IP network model. The bandwidth is determined by the port statistics used by the controller. The far placement of the controller may enhance the delay in exchange of messages to the switches. The proposed CFTP model uses a direct physical connection between the controller and the switch. The out-of-band control plane network is used, where the switch passing control plane traffic between a controller and another switch is a transit switch. The in-band control plane is risky and has high complexity. The traffic volume for the proposed model is less when compared to the ITP, since the controller overhead is reduced. The nodes in SDN detect failures with the help of timers or keep-alive messages. When the keep-alive messages are not sent, or if the timer expires, the corresponding node establishes that the network has failed and sends an OpenFlow message to the controller immediately. On receiving the OpenFlow message, the controller configures the backup path in all the associated nodes. All the nodes can continue to send traffic even if the network has failed, thereby reducing the load on the network and improving efficiency. Each open v-switch in the network has a memory associated with it. The memory is utilized to store the entire path and flow information, such as the active path (i.e. the path with the shortest metric that is currently used by the flow) and the backup path (i.e. the path with a low priority that comes into play when the active path is lost). The flow information in SDN is maintained in the centralized controller. When a packet first arrives at the node, the switch checks its flow table to find a match for the particular flow. The switch directly forwards the packet if a match exists. On the other hand, if a matching entry is not seen, the switch forwards the packet to the controller. The controller then checks its database and updates the respective switch with the corresponding flow entry. The uninterrupted arrival of packets to the switch associated with that particular flow can be directly forwarded without consulting the controller and reduce the size of the ternary content addressable memory (TCAM) associated with the switch. Thus, SDN overcomes the drawbacks associated with traditional networks better. The major contribution of the paper is the proposed CFTP to detect link failures and avoid congestions with minimal recovery time. The manuscript is organized as follows—Section 2 consists of relevant works in the link failure recovery mechanisms and problem definition, Section 3 describes the proposed system, Section 4 elaborates the techniques and algorithms used in the proposed system, Section 5 describes the implementation results and performance analysis of the proposed system and Section 6 involves the conclusion and future work. 2. RELATED WORKS Network reliability plays a major role in all networking environments [1]. A loop-free technique was proposed to handle failures. The node that realizes the failure forwards the packet back to the node from which it received the packet, causing the forwarding loop. The forwarding node, on receiving the packet in the backward direction, recognizes the link failure and takes an alternate path. One of the major drawbacks of this scheme is that the forwarding node sends the packet in the same direction as the primary path, and the node connected to the failed link sends the packet back to the forwarding node. Each time a packet is forwarded, this process continues, causing unnecessary consumption of bandwidth and increase in the forward delay as the packet is forwarded twice over the same node. The effective utilization of bandwidth and preventing its unnecessary consumption enhances the efficiency of the network [2]. Multiple spanning trees, each with a unique VLAN-ID, were used to utilize bandwidth efficiently. The links present in one spanning tree must not be members of another. When a link failure occurs, the switches simply switch the VLAN tag and forward the remaining packets through the secondary spanning tree. The advantage of this method is that there is no flooding of information. Switching occurs immediately. The drawbacks associated with this method are that the network must be at least two-edge-connected. Further, it is incapable of handling multiple link failures and can only handle single link failures. The failures can be identified using the keep-alive mechanism. RSVP hellos were used to identify failures when a node assumes that the network has failed when it does not receive a hello message from the adjacent node [3]. Tunnels were established and packets were re-routed to the destination when a failure occurred. Fast re-routing was achieved as the backup paths were pre-configured. But this method seems to be effective only in Cisco devices as it requires the CEF engine, i.e. it works only if Cisco Express Forwarding is supported. MPLS-based networks handle link failures by maintaining primary and secondary link state paths (LSPs). The topology consists of the ingress LSR (label switched router), core LSR and egress LSRs [4]. The core LSR connects the ingress and egress LSRs. A fault information signal (FIS) is sent by the core LSR to the ingress LSR when a link failure occurs. Traffic is switched to the secondary LSP when the ingress LSR receives the FIS. This method provided better convergence for smaller networks. The drawback of this method is that the traffic switching time depends on the distance between the failure point and the ingress LSR. Apart from increased control traffic owing to the transmission of keep-alive messages, it also requires extra storage space as it has to store both the primary and secondary LSPs. Reducing the round trip time (RTT) is a task associated with the controller. OpenFlow switches use the group table property of OpenFlow to reduce propagation delay [5]. The group table has primary and backup paths, each associated with a priority. The primary path has higher priority, whereas the backup path has a lower priority. The primary path with higher priority is removed from the group table, leaving only the backup path in the table when the network fails. The backup path becomes active and comes into play as soon as the primary path is removed. Reduction in flow table size is achieved by using a single entry for flows that share the same paths. VLAN tags are used to identify failed links. A drawback of this method is its inability to handle multi-link failures. The RTT between the controller and the switches reduces network efficiency. Reducing the RTT between the two allows the OpenFlow switches to maintain their own forwarding rule. BFD, the Bi-directional Forwarding Detection protocol capable of finding link failures, was used [6]. Each switch takes the alternate path if it detects a failure. The packets crank back to the source to find an alternate path in case there exists no alternate path in the current switch. The process of cranking back continues until the controller configures the backup path in the switch. The advantage is that the switches need not wait for the control information and can continue forwarding, irrespective of failure since each switch has its own rule. The packets have to crank back if they do not have a connected node; hence this method scales well only for well-connected networks. A fast failure recovery in OpenFlow-based networks allows each switch to maintain two timers: the idle timeout and the hard timer [7]. The idle timeout value by default is 20 ms and the timers serve as keep-alive messages. A control message is sent to the controller if either timer expires. The value of the timers can be changed. This method does not flood control messages, but the drawback is that the failure cannot be notified until the timer expires. 2.1. Segment protection scheme A segment protection scheme allows each link to have a primary and a backup path. The backup link directs packets to the destination whenever a failure in the link occurs [8]. The controller has to push control messages to each switch to update the flow table using OpenFlow messages when a link failure occurs. The switches use a Packet_In OpenFlow message to send information to the controller, and the controller uses a Packet_Out OpenFlow message to communicate with the switches. The number of addition and deletion operations associated with each flow table is increased, also causing packet loss during the period when the update occurs. The switch has to wait for proper flow information from the controller and will be unable to transmit packets until such time. The primary and backup paths have a mutual relationship with each other. The number of backup paths (Nbp) is calculated by adding one to the number of intermediate switches (Ns). For large networks, the number of intermediate switches increases, proportionately increasing the number of backup paths as well as the size of the flow table. The number of entries in the TCAM table is maximized, hence rendering the search of entries expensive. Handling low-priority traffic in MPLS-TE networks provides a solution to handle bandwidth efficiently [9]. Bandwidth is allocated only if the link has sufficient bandwidth. The advantage is that it could reduce congestion in the network and link failures caused by congestion can be averted because the bandwidth is allocated based on availability. The continuous transmission of packets in case of failure provides each switch with a backup path associated with each destination [10]. Each switch has an idle timer and a hard timer. The idle timer expires when there is no active flow for a particular entry in the flow table. The backup paths, when stored in the flow table along with the active paths, will be flushed if they are not used by any flow and when the idle timer expires. Renewal packets are prepared and periodically sent over the network to prevent the flushing out of entries. The switches send a restoration message to the controller, which subsequently computes the backup and working path and installs the switches when the failed link is up. The advantages include protecting the backup path from flushing and providing fast re-routing. The drawback is that it floods the renewal packets unnecessarily. Further, in case the number of nodes increases, the flow entries increase likewise. Protecting the link between the controller and the switch allows the controller to send a probe to the switch [11]. The probe traverses along all the links in the topology and finally reaches the controller. The probe does not reach the controller within the stipulated time if there is a failure. The probe also triggers the switch to send a reply to the controller as and when it visits each link, which helps the controller to precisely identify the failure point. A more effective method to identify the faulty location is provided. On the other hand, if the network is congested, the cycle time is delayed, making the controller timeout and initiate the recovery process. The control messages travel through the data plane, which is a major drawback. The VLAN tag switching also causes additional delay. A congestion avoidance mechanism in SDN uses congestion detection and control [12]. The congestion detection is responsible for collecting statistical information per flow, per port and per table from all the switches and storing it. The controller thereafter uses the information to prevent congestion. The control switches traffic through lightly loaded paths if a link is utilized to >70% of its capacity. The advantage is that it proactively detects congestion in the network. Monitoring port statistics is a trivial task, as the sizes of the stats_requests and reply messages are large, which is a big drawback. Port statistics messages provide an improved way to identify congestion where the controller continuously monitors the port status of all the switches [13]. Traffic is switched if the link utilization of the port reaches 70% of the link capacity, following which it checks the utilization of the switch. Traffic is restored if the utilization is reduced to 50%. The drawback is that if the traffic is busy for a time, the controller detects congestion and switches the traffic, an unnecessary maneuver as the congestion is temporary. An enhanced method, compared to IP fast reroute and shortest path re-calculation, uses fast reroute and load balancing [14]. The SDN-designated switch is selected and it encapsulates packets from the failure point and re-directs them through the alternate path to provide effective load balancing and fast re-routing. It provides faster delivery to the destination by using a load-balancing mechanism. However, the procedure to select the SDN-designated switch is tedious. Local re-routing to reduce congestion in cloud data centers allows all the switches to be connected to the controller, which continuously probes the switches to find the link utilization [15]. The flow is re-routed through the alternate path if the link utilization exceeds 75%, else it is routed through the normal path. Efficient utilization of a link is achieved and congestion is prevented, but it is only if a link fails that the routing engine will begin the search for the alternate path and transmits packets. An SDN-based multipath QoS solution allows each port associated with the OpenFlow switch to maintain two queues [16]. The controller continuously probes the switches to check the status of the queues. Once the queue limit exceeds a certain threshold value, the controller assumes that congestion is going to occur, immediately calculates multiple paths to reach the destination and updates the flow tables so congestion can be avoided. The advantages include congestion avoidance, packet-loss avoidance and balancing the load by using multiple backup paths. The drawbacks are two: if a single link is overloaded for a few seconds, the controller starts calculating alternate paths, which is unnecessary. Also, when congestion is about to occur, the controller detects it, calculates alternate flows and updates the flow tables by flooding the information, consequently increasing the number of control messages. Segment routing (SR) is a technology that enforces effective routing strategies [17]. Two implementations of SR are as follows: 1. OpenFlow-based software-defined networks (SDN based); 2. Path computation element (PCE based). Different segment list configuration has ensured that rerouting is effectively achieved without packet loss. The SR uses a centralized control plane. It performs the path computation and sends the computed segment list back to the node that issued the request. The SR can simply re-route in a negligible time without loss of traffic. Request is received by the request handler. SR collects the required information from the databases. The controller discovers the location of the source and the destination networks by making use of the information from the TED. No specific algorithm is proposed. Indeed, the SR is used. Least congested path is applied on the paths. If the path is equal to the destination, then the computed segment list is returned. If not, a shortest path element is inserted until the path equals the destination. An experimental demonstration of congestion in software-defined networks is realized [18]. Links become congested when the number of flows between the source and destination pair increases and there is no proper QoS or scheduling algorithms to manage the flows. The increased congestion leads to packet loss and reduces throughput in the network. A loss-free multi-pathing solution for data center networks using a software-defined networking approach combines congestion control and multi-pathing [19]. The congested node sends information to the source whenever congestion occurs in the network. Each egress access switch has a column in the routing table that contains the path load and sends information to the source switch when the path load exceeds the link capacity. The controller collects information from the switches and provides an alternate path if it finds congestion. The drawback is that it requires modifications to the flow-table. Table 1. Classification of related works. Mechanism Time efficiency Space efficiency Topological adaptability Link failure handling [5, 6, 10] Recovery time reduction High memory space Multiple topologies Tunnel based approach [3, 23] Delay caused High memory space Random topology Rerouting and bandwidth utilization for congestion [12, 13, 19] Depends on rerouting mechanism High memory space Random topology Segment protection scheme [8] High recovery time Memory space efficiency is less Ring topology SR scheme [17] Efficient in controller setup time — Random topology ITP [20] Recovery time is reduced Flow entries are reduced Grid topology Shortest path algorithms [21, 22, 24] Time efficiency is less Depends on mechanism used Random topology Mechanism Time efficiency Space efficiency Topological adaptability Link failure handling [5, 6, 10] Recovery time reduction High memory space Multiple topologies Tunnel based approach [3, 23] Delay caused High memory space Random topology Rerouting and bandwidth utilization for congestion [12, 13, 19] Depends on rerouting mechanism High memory space Random topology Segment protection scheme [8] High recovery time Memory space efficiency is less Ring topology SR scheme [17] Efficient in controller setup time — Random topology ITP [20] Recovery time is reduced Flow entries are reduced Grid topology Shortest path algorithms [21, 22, 24] Time efficiency is less Depends on mechanism used Random topology Open in new tab Table 1. Classification of related works. Mechanism Time efficiency Space efficiency Topological adaptability Link failure handling [5, 6, 10] Recovery time reduction High memory space Multiple topologies Tunnel based approach [3, 23] Delay caused High memory space Random topology Rerouting and bandwidth utilization for congestion [12, 13, 19] Depends on rerouting mechanism High memory space Random topology Segment protection scheme [8] High recovery time Memory space efficiency is less Ring topology SR scheme [17] Efficient in controller setup time — Random topology ITP [20] Recovery time is reduced Flow entries are reduced Grid topology Shortest path algorithms [21, 22, 24] Time efficiency is less Depends on mechanism used Random topology Mechanism Time efficiency Space efficiency Topological adaptability Link failure handling [5, 6, 10] Recovery time reduction High memory space Multiple topologies Tunnel based approach [3, 23] Delay caused High memory space Random topology Rerouting and bandwidth utilization for congestion [12, 13, 19] Depends on rerouting mechanism High memory space Random topology Segment protection scheme [8] High recovery time Memory space efficiency is less Ring topology SR scheme [17] Efficient in controller setup time — Random topology ITP [20] Recovery time is reduced Flow entries are reduced Grid topology Shortest path algorithms [21, 22, 24] Time efficiency is less Depends on mechanism used Random topology Open in new tab Table 2. Comparison of CFTP with segment protection and ITP. Schemes Average recovery time (ms) Max. no. packets stored in queue Max. stable flow entries Max. link utilization(%) Max. no. of bytes transmitted (kbps) Segment protection 79 1100 99 90 8 ITP 64 1000 140 85 10 CFTP 50 200 280 25 39 Schemes Average recovery time (ms) Max. no. packets stored in queue Max. stable flow entries Max. link utilization(%) Max. no. of bytes transmitted (kbps) Segment protection 79 1100 99 90 8 ITP 64 1000 140 85 10 CFTP 50 200 280 25 39 Open in new tab Table 2. Comparison of CFTP with segment protection and ITP. Schemes Average recovery time (ms) Max. no. packets stored in queue Max. stable flow entries Max. link utilization(%) Max. no. of bytes transmitted (kbps) Segment protection 79 1100 99 90 8 ITP 64 1000 140 85 10 CFTP 50 200 280 25 39 Schemes Average recovery time (ms) Max. no. packets stored in queue Max. stable flow entries Max. link utilization(%) Max. no. of bytes transmitted (kbps) Segment protection 79 1100 99 90 8 ITP 64 1000 140 85 10 CFTP 50 200 280 25 39 Open in new tab 2.2. ITP The use of two planes, the working plane and the transient plane, tries to reduce packet loss during the switchover from the primary to the backup path [20]. The working plane is responsible for holding the primary path for each source–destination pair, whereas the transient plane is responsible for temporarily switching packets during link failure. The packets immediately follow the transient plane to reach the destination when a link fails. The controller sends a control signal to all the switches to configure the backup path when it realizes that there has been a failure. Once the backup path has been configured, packets can be routed through it. The drawback associated with this method is that the transient plane switchover is triggered by the controller and causes additional delay. When two flows share the same link—where the link serves as the primary path for one flow and the transient path for another—packets going through the transient path are dropped. Transient paths are flushed out by the switches if they remain idle for too long. A comparison metric between open shortest path first (OSPF) and routing information protocol (RIP) in both single and multiple link failure recovery scenarios is proposed [21]. They categorize the simulation using various parameters defined as metrics cost, hop counts, routed convergence activity, routed convergence duration and routing traffic sent. The performance was observed that OSPF should be preferred than RIP for frequent single and multiple link failure. However, it is based on metropolitan area network with mesh topology, where chances of redundant connections are high, which adds up high potential for reduced efficiency. An algorithm named Dijkstra’s disjoint spanning trees edge cuts algorithm to lessen the recovery time in rerouting the network was proposed [22]. The developed algorithm is purely based on the ideas of two well-known algorithms such as Dijkstra’s and disjoint spanning tree edge cut. A shortest path link was created among nodes with its distance vector using Dijkstra’s and a minimum spanning tree was created, which helps the system to regenerate the network despite failures. Hence, when compared to the arc-disjoint spanning tree, the depicted algorithm performs well with multiple link failure in networks. Though the system effectively offsets the network traffic, it deliberates less numbered network and only one performance metric for comparison. A methodology to address the multi-link failure issue that may result in heavy packet loss and network performance degradation was presented. The idea behind the methodology was to optimize the efficiency of fast rerouting mechanism when there is multilink node failure in network [23]. They proposed a model for interface specific routing and then by using tunneling on demand (TOD) approach to minimize the number of tunnels. They developed basic ISR algorithm and ESC Tunnel algorithm to path computation and tunnel protections, respectively. Although TOD can achieve 100% protection ratio, delay is caused in rerouting since the routing is not monitored at periodic intervals. A new architecture for Telkom University Topology using SDN to deal with network link failures was proposed [24]. The alternate link to be chosen after failure was determined by Dijkstra’s and Bellman-Ford algorithms, which finds the shortest path among nodes in a network where one considers cost and the other focuses on hops. Performance metrics were calculated based on highest bandwidth value, number of hops and cost of traveling from one node to other and on lowest delay value where Dijkstra’s obtains higher certainty than Bellman-Ford in a small extent because it always sees for negative weight cycle among nodes. Though Dijkstra’s holds better results in finding the alternatives in Telkom University topology, no further solutions were discussed for multi-link failure restoration during disasters. The classification of related works is shown in Table 1. 2.3. Problem definition When multiple flows are used in the transmission of transient plane environment from sources to destinations and if the link fails, the alternate path flow of first transmission occupies the primary path flow of the secondary transmission. Hence, it results in congestion. The congested network may lead to a temporary link failure. The detection of congestion and its avoidance is required to handle and recover from link failures in such an environment. FIGURE 1 Open in new tabDownload slide CFTP System Architecture. FIGURE 1 Open in new tabDownload slide CFTP System Architecture. 3. PROPOSED SYSTEM The system CFTP system architecture is shown in Fig. 1. It consists of seven mechanisms, each of which is triggered when an action occurs in the network. Each mechanism executes certain tasks and hands over control to the next mechanism, which executes other tasks. This process of monitoring and executing tasks continues until the network is free from link failures and congestion. 3.1. Path configuration The path configuration mechanism is responsible for configuring the active and the backup paths. The active path holds all the working paths between every source–destination pair and is the path used by each switch in the network to forward traffic to the corresponding destination. The switch holds this information until the path is affected by the failure. The backup path is used to reach the destination in case the active path fails. This information is stored in the controller, and if a switch reports a failure, the controller will configure the switch with the backup path. The backup path will not be configured in the switch unless a failure occurs. 3.2. Transient plane construction The transient plane that uses tree topology is referred to as a transient tree. The transient path is assumed to be available always. The transient plane construction mechanism is responsible for constructing the transient plane by considering all the source–destination pairs. This has all the paths needed to reach a destination. The transient tree is a unique tree that connects all the source and destination pairs and is also capable of supporting multiple link failures. The controller computes the transient plane and places it in all the switches. The switches do not use this plane until they detect a failure. Only in case of failure will the transient plane be used. Whenever the failure occurs, the transient tree helps to guide the packet to route from any switch to the specific destination switch called root in the network. 3.3. Link failure detection The link failure detection mechanism is responsible for detecting and reporting a link failure in the network. Once the network is up and running, each open virtual switch sends keep-alive messages along each link to the neighbor switch. The switch assumes that there is a failure if it does not receive a keep-alive message within the stipulated time along the designated port. The switches detect the failure and send a control message to the controller to switch the traffic in the alternate path. The alternate path is the transient plane path. If the alternate path is occupied then the controller takes the necessary steps to prevent congestion and packet loss in the network. 3.4. Transient plane switchover The transient plane switchover mechanism is responsible for providing temporary backup paths until the controller configures the backup path in the switch. This mechanism prevents congestion and packet loss. FIGURE 2 Open in new tabDownload slide Congestion detection. FIGURE 2 Open in new tabDownload slide Congestion detection. FIGURE 3 Open in new tabDownload slide Bandwidth sharing. FIGURE 3 Open in new tabDownload slide Bandwidth sharing. 3.5. Transient paths vs backup paths Once the failure is detected the transient plane switchover uses the transient plane paths for diverting the traffic through it using VLAN link id until the backup paths are configured. The transient plane helps in avoiding packet loss and handles multiple link failures on temporary basis. The transient plane is used immediately after the failure, which is similar to protection mechanism, whereas backup path is used only after the failure and its configuration. 3.6. Recovery The recovery mechanism shown is responsible for preventing congestion and packet loss, in the event of failure, by triggering them. The link failure detection triggers recovery when a failure occurs. The recovery then triggers congestion detection, which, in turn, triggers bandwidth sharing if congestion is detected. The recovery mechanism also calls the path configuration mechanism to install backup paths in the switch. 3.7. Congestion detection The congestion detection mechanism, shown in Fig. 2, is invoked by the recovery mechanism. There are chances for congestion to occur whenever traffic is re-routed through the transient plane, as the path driven by the transient plane may be occupied by other flows. The congestion detection mechanism overcomes this and eliminates packet loss by simply detecting the congestion and reporting it. Therefore, if the path is suspected to be congested, this mechanism invokes bandwidth sharing, which allows both the flows to share the link. Alternatively, when the link is not capable of supporting multiple flows, it sends a control message to the controller to take the necessary steps to divert the traffic. 3.8. Bandwidth sharing The bandwidth sharing mechanism, shown in Fig. 3, is responsible for allocating equal bandwidth if there are two or more flows. The congestion detection mechanism triggers bandwidth sharing if it detects congestion. The purpose of this mechanism is to share bandwidth equally between the flows and eliminate packet drop. 4. TECHNIQUES AND ALGORITHMS The Weighted Node Dijkstra’s Algorithm is used to construct the paths. The algorithm takes into account not only the link cost but also the cost associated with the nodes. The cost varies, depending upon the type of links and OpenFlow switches being used. Preference is given to high-end switches and links (optical cables) capable of handling large traffic. The high-end switches are OpenFlow switches capable of handling large number of flows and extended functionalities. The algorithm above calculates the shortest path tree for each node. The tree can be used to install the primary path flow entry in each OpenFlow switch. The complexity of the algorithm is O (n2), where the iteration has to be carried out for each node ‘n’ in the graph. The input to this algorithm is a graph ‘G’ with edges ‘E’, vertices ‘V’ and the link_cost that is the cost of each link and the node_cost that is the cost of each node. The algorithm produces the shortest path tree. Dist [src] is the distance of the source, dist [n] the distance of all the nodes other than the source, parent [v] the predecessor node of the current node and min_element the element with the minimum cost in the priority queue. Let n1, n2, n3, ……, nn be the set of all nodes and let l1, l2, l3, … …, ln be the set of all links that belong to the shortest path between the source ‘s’ and destination ‘d’. The total cost to reach the destination from the source is given by equation 1. $$\begin{equation} {C}=\sum^{n-1}_{i=1} li \ + \ \sum_{j=1}^{n}\ \ nj \end{equation}$$ (1) The transient plane is constructed using the output of the path construction. The shortest path between the source–destination pair is taken into account and the alternate path is computed by assuming that each link in the path has failed. A unique transient path for the entire topology is computed by repeating the above process for all the paths. The transient tree is constructed to reduce the packet loss and recovery time. If there is no transient tree, the alternate path has to be taken either in the form of protection or restoration mechanism. The requirement of the network to make CFTP work is to construct the transient tree after learning about the network and perform multiple flow transmissions. The failure detection uses the BFD protocol to find link failures in the network. The BFD is configured between all the links in the topology with the desired hello interval. The link is assumed to be down when the hello packet fails to reach the peer and traffic is diverted through the alternate path, and a control message is sent to the controller indicating failure. The transient plane switchover continuously monitors the port states. If the link state is down, it immediately uses the alternate path, i.e. the transient path, to switch the traffic without packet loss. The port states are monitored by enabling the BFD on each link. The ‘link state down’ indicates that the link is down and triggers the transient plane switchover, which switches traffic through the alternate path as shown in Fig. 1. The congestion detection mechanism is evoked when the session state of the BFD is down. The chief task of this mechanism is to check the port statistics to determine the difference between the transmitted and the received flows. Congestion in the network is determined if the transmitted bytes exceed the received bytes. Once congestion is detected, the bandwidth sharing mechanism is triggered. Let λ be the arrival rate and μ be the service rate. As per Little’s Theorem, $$\begin{equation} \rho =\frac{\lambda}{\mu} \end{equation}$$ (2) Equation. 2 show the utilization factor. It is < 1 for a stable system; therefore when >, congestion occurs. The algorithm creates queues for each flow and maps each flow to a queue where β represents the total bandwidth of the link, βf denotes the bandwidth for each flow, F denotes the set of all flows and Q denotes the queue created for each flow. The algorithm takes the total flows and total bandwidth as inputs and produces queues and bandwidth for each flow as the output. Let λ1, λ2 … ..λn be the arrival rate of the flows with Poisson distribution. Let μ denote the exponential distribution with which the flows are served. Considering the case of a single queue system, the probability of ‘n’ number of packets in the queue is given by equation 3. $$\begin{equation} {\mathrm{P}}_{\mathrm{n}}=\frac{1-{\sum}_{i=1}^n{\lambda}_i}{\mu \ast{\left({\sum}_{i=1}^n\frac{\lambda_{\mathrm{i}}}{\mu}\right)}^{\mathrm{n}}} \end{equation}$$ (3) The equation above shows that when the number of flow arrivals increases, the probability of packets waiting in the queue also increases. $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}=\frac{\lambda^2}{\mu \left(\mu -\lambda \right)} \end{equation}$$ (4) where λ = |$\sum_{\mathrm{i}=1}^{\mathrm{n}}\lambda \mathrm{i}$|⁠. Equation 4 shows the number of packets waiting in the queue. For a system to work λ < μ, equation 4 becomes, $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}=\frac{\lambda^2}{\mu^2} \end{equation}$$ (5) Consider two arrivals |${\lambda}_1$| and|${\lambda}_2$|⁠, $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}=\frac{{\left(\lambda 1+\lambda 2\right)}^2}{\mu^2}={\lambda_1}^2+2{\lambda}_1{\lambda}_2+{\lambda_2}^2 \end{equation}$$ (6) Since everything will be buffered in the same queue. Let q1, q2 … ..qn be the queues created in the open virtual switch and each queue capable of serving a single flow. The probability of ‘n’ number of packets in each queue is given by equation 7. $$\begin{equation} {\mathrm{P}}_{\mathrm{n}}=\left(1-\frac{\lambda}{\mu}\right)\ast{\left(\frac{\lambda}{\mu}\right)}^{\mathrm{n}} \end{equation}$$ (7) The probability of packets waiting in the queue is as follows: $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}={\sum}_{\mathrm{i}=1}^{\mathrm{n}}{\mathrm{L}}_{\mathrm{q}\mathrm{i}} \end{equation}$$ (8) Let q1 and q2 be the two queues in switch s1. Equation 9 is given as follows: $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}={\mathrm{L}}_{\mathrm{q}1}+{\mathrm{L}}_{\mathrm{q}2} \end{equation}$$ (9) $$\begin{equation} {\mathrm{L}}_{\mathrm{q}}=\left(\frac{{\lambda_1}^2}{{\mu_1}^2}\right)+\left(\frac{{\lambda_2}^2}{{\mu_2}^2}\right) \end{equation}$$ (10) Comparing equations 6 and 10, it can be shown that for a single queue, the system packet loss is higher than in multi-queue systems. 5. IMPLEMENTATION AND RESULTS The simulation setup is created using Mininet and OpenDaylight Controller. Mininet is used to create an environment that consists of OpenFlow switches connected by a 10GB fiber cable. The Mininet environment is then interfaced to the controller through port 6633, which is the default port to connect to the controller. The topology creation is done using a Python code. The experimental topology consists of five switches (s1, s2, s3, s4, s5), four hosts (h1, h2, h3, h4) and a controller C0 that are interconnected together and interfaced to the OpenDaylight Controller as shown in Fig. 3. Three tables are maintained at the mininet emulation environment link_id, transient path and routes table. Traffic is initiated from h1 to h2 and simultaneously from h3 to h4. Initially, both the flows are independent of each other as they take individual primary paths. The link between s2 and s4 is failed and this pushes transient entries into the switches. The transient entries create queues in the switches and allow traffic to be shared between both the flows without loss or congestion in the network. When a link failure is identified, OFPR_DELETE port status message is sent to the controller. The failed link and port number is identified by the controller using link_id table. The flow is rerouted to the transient plane. If congestion is identified through the port statistics, the bandwidth sharing mechanism is enabled. FIGURE 4 Open in new tabDownload slide Stable flow entries. FIGURE 4 Open in new tabDownload slide Stable flow entries. FIGURE 5 Open in new tabDownload slide Number of flows vs bytes transmitted. FIGURE 5 Open in new tabDownload slide Number of flows vs bytes transmitted. FIGURE 6 Open in new tabDownload slide Recovery time. FIGURE 6 Open in new tabDownload slide Recovery time. The experimental demonstration of real implementation is done by replacing the server hosting the emulating network with a real OpenFlow-based Ethernet network with five switches and four hosts. The five switches are implemented in five personal computers (i.e. Intel Core 4 CPU 2.40 GHz, 2 GB RAM, Ubuntu 10.04 kernel 2.6.32-25-generic) OpenvSwitch to support the proposed mechanism. Each computer is equipped with a network interface card (Intel Quad Port server adapter PCIExpress) providing four Ethernet interfaces, i.e. two for the links, one for the controller and one for the connection to an end host. The stable flow entries are those that are actually required for the alternate path during link failures. During multi source to destination transmission, there may be a chance where a flow entry that could be used for an alternate path gets flushed. The ITP has a transient path independent of the primary path and the backup path, and is used to handle link failure on a temporary basis. Figure 4 shows that the number of stable flow entries in the segment protection and ITP is fewer, compared to the number of stable entries in the CFTP. The difference exists because the ITP uses no renewal packets, and this is responsible for protecting the flow entries against flushing. The segment protection uses the predefined backup path entries installed and deletes the primary entries in the switch during failure. The CFTP requires no renewal packets because the flow entries are installed at the point of failure. Therefore it can clearly identify the alternate path stable flow entries. The ‘ovs-ofctl dump-flows xx’ is used to add the flow entries. Figure 5 shows the number of bytes transmitted with respect to the number of flows. An increase in the number of flows in the segment protection and ITP causes a decrease in the number of bytes transmitted. This is because multiple flows trying to use the same link cause congestion in the network and obstruct traffic flow. Both of the existing techniques do not make use of the queue; hence, the number of bytes transmitted is low compared to CFTP. The CFTP has a constant transmission rate because the flows are rate-limited and stored in the queue, thereby sharing the bandwidth of the link. The statistics module is enabled at the controller. The port stats messages are used at regular intervals to compute the bandwidth. Figure 6 shows the time taken to recover from failure. The average time taken to recover is 79 ms in the segment protection scheme and 64 ms in the ITP design, whereas the time taken to recover is 50 ms in CFTP design. The reduction in recovery time is achieved by allowing the controller to monitor the network every 50 ms. FIGURE 7 Open in new tabDownload slide Packets stored in queue. FIGURE 7 Open in new tabDownload slide Packets stored in queue. FIGURE 8 Open in new tabDownload slide Link utilization. FIGURE 8 Open in new tabDownload slide Link utilization. FIGURE 9 Open in new tabDownload slide Probability of failure with link utilization. FIGURE 9 Open in new tabDownload slide Probability of failure with link utilization. Figure 7 shows the number of packets in the queue with respect to the number of flows. The increase in the number of flows is maximized with the number of bytes stored in the queue in the case of segment protection and ITP because the default queue is used to store packets. The number of packets stored in the queue is constant in the case of the CFTP because each flow is mapped to a queue that can store incoming packets. Figure 8 shows link utilization with respect to the traffic passing through the link. The configuration messages like OFPT_QUEUE_GET_CONFIG_REQUEST for request and OFPT_QUEUE_GET_CONFIG_REPLY for reply are used. The possibility of congestion occurrence is high in both the segment protection and ITP when two different flows are transmitted and experience a link failure in the network topology. The segment protection and ITP show higher link utilization with respect to traffic, whereas the link utilization in the CFTP remains constant. The variation is because the CFTP limits traffic by setting the maximum transmission rate in the queue. The CFTP maintains an average of 40% link utilization. Figure 9 shows the probability of failure with respect to link utilization. The sFlow counters are used to determine the link utilization. The segment protection and ITP have a higher probability of failure than the CFTP. The segment protection and ITP do not maintain any queue to regulate the traffic. The variation exists because the CFTP proactively determines link utilization and prevents the link from being over utilized by preventing incoming flows flowing through the link that has higher utilization. The summary of the results obtained through the comparison of CFTP with segment protection and ITP is shown in Table 2. 6. CONCLUSION AND FUTURE WORK The time taken to switch the packets over to the transient plane is reduced with this method, thereby reducing congestion and allowing multiple flows to share the same path by using the bandwidth sharing method. The number of packets lost during the transition from the primary to the backup path is also minimized, and the packet loss in the existing method while multiple flows share the same path is likewise decreased drastically. The proposed work concentrates on reducing switchover time, packet loss and congestion. Furthermore, this work could be extended to impart QoS in the network, thus handling high-priority packets with marginal delay, lessening recovery time and providing enhanced switching and network reliability, which allows all the flows to continue transmission with minimal propagation delays during network failures. The work could be further extended in a wireless environment to reduce the congestion when data are shared between multiple sources and destinations in a shared transmission medium. The proposed solution has an impact in the cloud environment for the scheduling and allocation of resources from the service provider to the end user. The SR-based SDN controller can be used for testing our proposed system in future. References 1. Nelakuditi , S. , Lee , S. , Yu , Y. , Zhang , Z.L. and Chuah , C.N. ( 2007 ) Fast local rerouting for handling transient link failures . IEEE/ACM Trans. Netw. , 15 , 359 – 372 . Google Scholar Crossref Search ADS WorldCat 2. Gopalan , A. and Ramasubramanian , S. ( 2014 ) Fast recovery from link failures in ethernet networks . IEEE Trans. Rel. , 63 , 412 – 426 . Google Scholar Crossref Search ADS WorldCat 3. Naveed , S. and Kumar , S.V. ( 2014 ) MPLS traffic engineering–fast reroute . Int. J. Sci. Res. , 3 , 1796 – 1801 . WorldCat 4. Arunkumar , C.K. ( 2014 ) An efficient fault tolerance model for path recovery in MPLS networks . Int. J. Innov. Res. Comput. Commun. Eng. , 2 , 4546 – 4551 . WorldCat 5. Thorat , P. , Challa , R. , Raza , S. M. , Kim , D. S. and Choo , H. ( 2016 ) Proactive failure recovery scheme for data traffic in software defined networks . In Proc. IEEE NetSoft Conference and Workshops (NetSoft) , Seoul, South Korea , 6–10 June , pp. 219 – 225 . IEEE, New Jersey . 6. Van Adrichem , N. L. , Van Asten , B. J. and Kuipers , F. A. ( 2014 ) Fast recovery in software-defined networks . In Third European Workshop on Software Defined Networks (EWSDN) , London, UK , 1–3 September , pp. 61 – 66 . IEEE, New Jersey . 7. Sharma , S. , Staessens , D. , Colle , D. , Pickavet , M. , and Demeester , P. (2011) Enabling fast failure recovery in OpenFlow networks . In Proc. 8th International Workshop on Design of Reliable Communication Networks (DRCN) , Krakow, Poland , 10–12 October , pp. 164 – 171 . IEEE , New Jersey . 8. Sgambelluri , A. , Giorgetti , A. , Cugini , F. , Paolucci , F. and Castoldi , P. ( 2013 ) OpenFlow-based segment protection in Ethernet networks . J. Opt. Commun. Netw. , 5 , 1066 – 1075 . Google Scholar Crossref Search ADS WorldCat 9. Lu , Z. , Jayabal , Y. , Fei , Y. , Fumagalli , A. , Galimberti , G. and Martinelli , G. ( 2015 ) Effects of multi-link failures on low priority traffic in MPLS-TE networks . In Proc. 11th Int. Conf. Design of Reliable Communication Networks (DRCN) , Kansas City, MO, USA , 24–27 March , pp. 103 – 106 . IEEE , New Jersey . 10. Padma , V. and Yogesh , P. ( 2015 ) Proactive failure recovery in OpenFlow based software defined networks . In Proc. 3rd International Conference on Signal Processing, Communication and Networking (ICSCN) , Chennai, India , 26–28 March , pp. 1 – 6 . IEEE , New Jersey . 11. Lee , S. S. , Li , K. Y. , Chan , K. Y. , Lai , G. H. and Chung , Y. C. ( 2015 ) Software-based fast failure recovery for resilient OpenFlow networks . In Proc. 7th International Workshop on Reliable Networks Design and Modeling (RNDM) , Munich, Germany , 5–7 October , pp. 194 – 200 . IEEE , New Jersey . 12. Gholami , M. and Akbari , B. ( 2015 ) Congestion control in software defined data center networks through flow rerouting . In Proc. 23rd Iranian Conference on Electrical Engineering (ICEE) , Tehran, Iran , 10–14 May , pp. 654 – 657 . IEEE , New Jersey . 13. Song , S. , Lee , J. , Son , K. , Jung , H. and Lee , J. ( 2016 ) A congestion avoidance algorithm in SDN environment . In Proc. Int. Conf. Information Networking (ICOIN) , Kota Kinabalu, Malaysia , 13–15 January , pp. 420 – 423 . IEEE , New Jersey . 14. Chu , C. Y. , Xi , K. , Luo , M. and Chao , H. J. ( 2015 ) Congestion-aware single link failure recovery in hybrid SDN networks . In Proc. IEEE Conference on Computer Communications (INFOCOM) , Kowloon, Hong Kong , 26 April–1 May , pp. 1086 – 1094 . IEEE , New Jersey . 15. Kanagevlu , R. and Aung , K. M. M. ( 2015 ) SDN controlled local re-routing to reduce congestion in cloud data center . In Proc. International Conference on Cloud Computing Research and Innovation (ICCCRI) , Singapore , 26–27 October , pp. 80 – 88 . IEEE , New Jersey . 16. Jinyao , Y. , Hailong , Z. , Qianjun , S. , Bo , L. and Xiao , G. ( 2015 ) HiQoS: An SDN-based multipath QoS solution . China Commun. , 12 , 123 – 133 . WorldCat 17. Sgambelluri , A. , Paolucci , F. , Giorgetti , A. , Cugini , F. and Castoldi , P. ( 2016 ) Experimental demonstration of segment routing . J. Light. Technol. , 34 , 205 – 212 . Google Scholar Crossref Search ADS WorldCat 18. Jha , N. K. , Agarwal , N. and Singh , P. ( 2015 ) Realization of congestion in software defined networks . In Proc. International Conference on Computing Communication & Automation (ICCCA) , Noida, India , 15–16 May , pp. 535 – 539 . IEEE , New Jersey . 19. Kitsuwan , N. , McGettrick , S. , Slyne , F. , Payne , D.B. and Ruffini , M. ( 2015 ) Independent transient plane design for protection in OpenFlow-based networks . IEEE/OSA J. Opt. Commun. Netw. , 7 , 264 – 275 . Google Scholar Crossref Search ADS WorldCat 20. Fang , S. , Yu , Y. , Foh , C.H. and Aung , K.M.M. ( 2013 ) A loss-free multipathing solution for data center network using software-defined networking approach . IEEE Trans. Magn. , 49 , 2723 – 2730 . Google Scholar Crossref Search ADS WorldCat 21. Ajani , A. A. , Ojuolape , B. J. , Ahmed , A. A. , Aduragba , T. and Balogun , M. ( 2017 ) Comparative performance evaluation of open shortest path first, OSPF and routing information protocol, RIP in network link failure and recovery cases . In Proc. IEEE 3rd International Conference on Electro-Technology for National Development (NIGERCON) , Owerri, Nigeria , 7–10 November , pp. 280 – 288 . IEEE , New Jersey . 22. Bhor , M. and Karia , D. ( 2017 ) Network recovery using IP fast rerouting for multi link failures . In Proc. Int. Conf. Intelligent Computing and Control (I2C2), Coimbatore , India , 23–24 June , pp. 1 – 5 . IEEE , New Jersey . 23. Yang , Y. , Xu , M. and Li , Q. ( 2018 ) Fast rerouting against multi-link failures without topology constraint . IEEE/ACM Trans. Netw. , 26 , 384 – 397 . Google Scholar Crossref Search ADS WorldCat 24. Nastiti , A. , Rakhmatsyah , A. and Nugroho , M. A. ( 2018 ) Link Failure Emulation with Dijkstra and Bellman-Ford Algorithm in Software Defined Network Architecture (Case Study: Telkom University Topology) . In Proc. 6th Int. Conf. Information and Communication Technology (ICoICT) , Bandung, Indonesia , 3–5 May , pp. 135 – 140 . IEEE , New Jersey . © The British Computer Society 2019. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Congestion-Free Transient Plane (CFTP) Using Bandwidth Sharing During Link Failures in SDN JF - The Computer Journal DO - 10.1093/comjnl/bxz137 DA - 2020-06-18 UR - https://www.deepdyve.com/lp/oxford-university-press/congestion-free-transient-plane-cftp-using-bandwidth-sharing-during-xcmxBNKl30 SP - 1 VL - Advance Article IS - DP - DeepDyve ER -