TY - JOUR AU - Yi-Fan, Lee, AB - Abstract People are constantly using mobile technologies to exchange perspectives across the world. The search services they use, however, belong to centralized systems that may be easily censored. The Peer-to-Peer retrieval system was created to impede censorship of online information, but the decentralized nature of P2P makes it difficult to infer information that cannot be measured directly, such as the proportion of subversion, selfish nodes, network size, or churn rate. Recent advances have pushed providers toward large-scale wireless networks where data retrieval is difficult. Thus, we propose a defense mechanism that can: (1) tackle censorship issues; (2) employ probability density function, exponential weighted moving average and modified Chi-squared tests to estimate the proportion of malicious and selfish nodes; (3) defend against malicious and selective-forwarding attacks by adjusting the number of forwarding levels and requests to ensure high-match probability; (4) maintain high-retrieval rates even in large and highly mobile networks; and (5) guarantee robustness compared to other search systems. A series of experiments demonstrated our algorithm’s high-retrieval rate, reasonable costs, mobility resilience, and robustness, demonstrating that the algorithm can work well when the network size is large and/or has a large proportion of selfish nodes, malicious nodes and mobile nodes. 1. INTRODUCTION The Internet is the pre-eminent source of information on this planet. The Internet, which is distributed, uncontrolled, unbiased and dispassionate, smooths the turbulent flow of data. It has changed how we live and how we perceive. As networked devices become ubiquitous, people across the world use them to share information and compare perspectives. As of this writing, online information retrieval is highly centralized for efficiency and economy of scale, and the public’s trust in that access depends on the benign administration of centralized search engines. Unfortunately, history teaches that such administrators cannot be trusted to stay benign forever. These centralized search systems depend on a small number of nodes that can be easily censored or can easily become malicious, as in the case of widespread Internet censorship in China [1]. Some decentralized search and retrieval systems [2–8] have been proposed with an eye to impeding censorship of information accessed over the Internet. These systems are more trustworthy than centralized search, since they have cut out the middleman and allowed people to share information directly without depending on a central node. Decentralized search systems offer peace of mind, since a small number of administrators cannot prevent users from exchanging their information with others. However, the very decentralization of these systems makes it fiendishly difficult to infer any information that cannot be measured directly, such as the proportion of malicious nodes that do not reply when they have matches. Traditionally, a source node distributes its message by multicasting directly to multiple nodes. However, this technique has become infeasible in a large network. Therefore, most systems encourage the source node to first transmit the message to a set of selected nodes, each of which forwards the messages to another set of nodes, and so on. This strategy allows nodes to share the load of processing and buffering, as well as the communication costs, with others, rather than placing the entire load on the source node. Yet the nature of this system makes it very difficult to deduce any information that cannot be measured directly, such as the number and prevalence of selfish nodes that fail to forward the message. Recent technical advancements have pushed providers to use large-scale wireless networks, such as wireless sensor networks (WSNs) and mobile ad-hoc networks (MANETs), with many mobile nodes distributed over a wide area. In a WSN, sensors are usually gathered to detect, collect and share environmental data. It is now reasonable to suppose that wireless devices are everywhere, particularly in densely populated areas. However, the problem with such wireless networks is the difficulty of data distribution and retrieval in a large, highly mobile system. Some scholars [9–15] have proposed to address these issues in both WSNs and MANET environments, but their systems cannot guarantee high-retrieval rates and can raise network overhead too high to allow efficient data distribution and maintenance. In light of these problems, it is clearly very difficult to maintain a high-retrieval rate in the face of a large proportion of malicious, selfish, and mobile nodes in a highly mobile, large-scale network. First of all, there is no way for individual nodes in such a network to maintain a full view of the system, thus affecting the overall retrieval rate. Next, direct communication is infeasible in such a large network, and the very process of forwarding creates vulnerabilities. Selfish nodes and malicious nodes can have a particularly profound effect on the retrieval rate. In a previous paper [16], we presented the design and implementation of an adaptive algorithm that can protect against malicious and selective-forwarding attacks in fixed-size and static networks, but did not consider the case of nodes that reside in a highly mobile and large-scale network. As recent technical advancements have pushed more and more users to use large-scale wireless and mobile ad-hoc networks, we must consider just such a case-the situation of many mobile nodes distributed over a large network. As a result, in [17], we addressed the mobility issues that arise in a small mobile network, but did not consider the problems of malicious nodes, selfish nodes, or enormous numbers of nodes in a large-scale network. The difference between this paper and previous works is substantial, because this paper addresses the problem of selfish and malicious nodes not only in the smaller neighborhoods but in large mobile networks as well. Furthermore, we also investigated and compared our work with other search systems to demonstrate the effectiveness and robustness of our work in detecting malicious and selfish nodes, even when the network is large and contains enormous numbers of mobile nodes. Therefore, in this paper, we present a Defense mechanism Against Malicious & Selective forwarding attacks in large and mobile wireless networks, also called DAMS, to address all the problems mentioned above. The goal of DAMS is to address a scenario involving any or all of the following factors: (1) a decentralized large-scale network; (2) high churn; (3) malicious nodes; and (4) selfish nodes. Our system first divides the entire network into a number of regions. Next, our system generates metadata, applies locality sensitive hash (LSH) functions to calculate the mapped region, routes metadata to a mapped region, and finally stores it in 2N nodes in that mapped region, thus minimizing the number of data replicas in a region while maintaining a high-retrieval rate. Our DAMS allows nodes to join or leave a region, distribute metadata, make requests, or move to other regions at will. Moreover, our system allows every node to maintain and update its membership view during metadata or query distribution in order to address a high churn rate. In addition, DAMS allows nodes to estimate the proportion of malicious and selfish nodes during metadata or query distribution, and adaptively adjusts the number of forwarding levels and requests as needed to achieve a high probability of a match in large and mobile networks. After the requesting node has successfully retrieved the file, it becomes a source node, and thus publishes its metadata of the file to its mapped region. Furthermore, our system applies a relocation method so that when a node moves to other region, the nodes from the source region still know where to forward subsequent messages to the moving node. Finally, we compare our system to other search systems in terms of match probability, proportion of malicious nodes, proportion of selfish nodes, network size, moving speed, message costs and search times. We also demonstrate that our system could achieve high-retrieval rates, better scalability, reasonably low overhead, greater mobility resilience and higher overall system robustness. We summarize the contribution of this paper as follows: Our algorithm tackles censorship issues by creating a system in which a small number of administrators cannot prevent users from distributing or retrieving messages and information. Our algorithm employs probability density function (pdf), exponential weighted moving average (EWMA) and modified Chi-squared testing to estimate the proportion of malicious and selfish nodes. Our algorithm defends against malicious and selective-forwarding attacks by appropriately adjusting the number of forwarding levels and messages to ensure high-match probability. Our algorithm employs LSH, relocation methods and update-membership methods to guarantee robustness and scalability in a large and high-churn network. Our algorithm does not require nodes to communicate with others and can still achieve high-match probability with reasonably low overhead, even with a large proportion of malicious and selfish nodes in a large-scale and highly mobile network. Because energy savings and overhead reduction are relatively important, especially in WSNs, our algorithm is designed to reduce overall overhead as much as possible, particularly when compared to other systems. Multiple equivalent responses from different matching nodes create redundancy-in other words, waste. Our DAMS is the first work that uses redundant responses as valuable information to manage network churn, to address malicious and selfish nodes, and to adaptively determine the quantity of forwarding levels and fanout, such that it can guarantee low overhead with high-match rates even in a large-scale and highly mobile network. The goal of this paper is to protect users across the world from censorship by centralized administrators. We hope that such infrastructure will ultimately assure the free flow of information over the Internet. We know that it is important to develop such a system before it is needed, and to ensure that it is available when the need arises. 2. RELATED WORK 2.1. Distributed search in peer-to-peer (P2P) networks Refs [18–20] have conducted extensive and elaborate comparisons of distributed search methods in P2P networks. The structured approach [21, 22] demands that nodes be organized in an overlay network based on distributed hash tables, trees, rings and the like. The efficiency of this approach is counterbalanced by the fact that it leaves the network vulnerable to manipulation by untrustworthy nodes. In contrast, the unstructured approach [2–8, 23, 11, 14, 24] uses randomization and gossiping, where nodes find one another by exchanging messages over existing links. This approach leaves the network considerably less vulnerable to manipulation, which is why our DAMS uses the unstructured approach. Gnutella [4], one of the first unstructured networks, uses flooding of requests to find information. Ferreira et al. [3] use a random-walk strategy to replicate both queries and data in an unstructured network. BubbleStorm [6] replicates both queries and data and combines random walks with flooding. Ref. [2] is more sophisticated and efficient than Gnutella because it learns from previous requests. However, it presents a clearer opportunity for a group of untrustworthy nodes to hoard searches. Ref. [7] distributes subscription and publication messages to nodes via directed routing. The authors of both [8, 5] are concerned with trust, much as we are with DAMS; they have addressed these concerns with the facts of privacy and scalability, making centralized systems undesirable to the users of social networks. Ref. [11] employs geographic routing for both file distribution and retrieval; a source node sends out a location update message to periodically identify the closest node to the location of the file, and transfers the file if it finds a closer node. In [14], the entire geographic area is separated into many smaller squares, metadata is mapped and stored from the squares with IDs closest to the virtual ID, and the location is updated if a node’s moving distance reaches a predefined updating threshold. Ref. [24] is an on-demand multi-hop routing system in which nodes first apply flooding to build a route from the source to the destination, and then send the message along this route. Ref. [23] introduces a scalable and mobility-resilient distributed data search system for extremely dense and highly mobile wireless networks. Our DAMS uses a strategy similar to [23] for applying LSH [25, 26] to the mapped regions of messages, but our methods for network division, metadata distribution, query retrieval, message routing and node relocation are quite different from that of [23]. Ref. [27] adopts a hierarchical proactive routing framework, which provides reliable and timely data transmission for efficient routing updates and maintenance in large-scale wireless sensor networks. This work addresses the problem of large-scale networks but fails to address the issues of mobility and attacks. Ref. [28] uses a content-centric cooperative service paradigm via D2D communications to reduce cellular traffic and provides an efficient cooperative content-retrieval framework for mobile users. However, this approach is quite different from our work since our work focuses on location-aware content searching. In addition, it does not consider any issues of malice or mobility. Ref. [29] proposes two distributed data-storage algorithms using compressive sensing for wireless sensor networks. The idea of this paper in addressing data-storage problems is quite different from our own work. In addition, it does not address any of the problems created by mobility or malice. 2.2. WSNs and MANET WSNs are used for a wide range of applications. In WSNs, sensors must constantly cooperate to collect and share data, and to exchange environmental phenomena. Refs [30–33] propose flooding and local broadcasting-based methods in WSNs; however, these methods still generate too much transmission overhead to be energy-efficient and, thanks to their reliance on local broadcasting, they cannot guarantee data discovery. Similarly, in [9, 10], the authors propose a topological routing-based method in MANETs, where nodes distribute their data, develop tables for their received messages and forward data queries to the nodes that are most willing to process the data. Unfortunately, these proposed methods still require networks to invest too much overhead in distributing data and data maintenance. Moreover, they cannot guarantee data discovery in a highly mobile network. MANETs contain several mobile nodes with no centralized node to control the whole infrastructure, where each node is responsible for sending messages, receiving messages and maintaining its routing table. The challenges in MANETs include mobility issues, network churn, short transmission range, lack of centralized administration and dynamic changes in network topology. A comprehensive survey by Abid et al. [34] classified the routing methods for MANETs into the following categories: (1) flat-based routing, which is best for small-scale jobs, and can be further classified into reactive routing [24, 35] (discover path when it is needed) and proactive routing [36, 27] (constantly update routine tables); (2) cluster or hierarchical-based routing [37, 38], in which a cluster of nodes such as root-peer manage messages; (3) geographical-based routing [11, 14, 13, 15], where a node applies a data-mapping policy to map a file into a geographic location and stores it to the closest node for high scalability; (4) virtual coordinate-based routing [39, 40], where a node obtains its virtual coordinates and updates these coordinates when it moves from one region to another; and (5) DHT-based routing [41–44], where a node needs to have a logical network and its corresponding physical network, and the logical network and routing algorithm are constructed from these logical identifiers. Ref. [45] indicates that routing methods in MANETs should address the dynamic topology, and guarantee that the message can be quickly delivered without excessive overhead. As a result, we use geographical-based routing for our DAMS system, which maintains high-retrieval rates and low overhead in large-scale and mobile wireless networks, and simultaneously allows nodes to publish their metadata, retrieve information and protect against malicious and selfish nodes. 2.3. Chi-squared statistics and EWMA Several network security researchers use the EWMA algorithm and Chi-squared statistics. Refs [46, 47] employ EWMA and Chi-squared statistics on its anomaly-detection techniques for intrusion detection. Likewise, Refs [48, 49] use known window size to appropriately estimate the smoothing factor for detecting network anomalies, and Ref. [50] applies the Chi-squared test to detect intrusions. Refs [51, 52] apply modified Chi-squared statistics to identify similarities between attribute couples of a dataset and a projected subset. Press et al. [53] use a modified Chi-squared statistic to balance the weights of the buckets in comparing two datasets. Our defense mechanism considers all results from a node, beginning when it joins the network and ending when it leaves, and employs the EWMA method with a modified Chi-squared test to determine the proportion of malicious and selfish nodes, increasing the number of forwarding levels and messages in the network in order to maintain high-match probability. 2.4. Message forwarding Probabilistic forwarding is an alternative solution to multicast routing that has been widely applied for sensors and ad-hoc networks [54–57]. Ref. [58] presents an overview of gossiping and broadcasting in communication networks. Likewise, Refs [59, 60] use forwarding to reduce overall performance cost in delay-tolerant networks. Ref. [61] develops algorithms that require nodes to broadcast with a minimum number of messages and determines upper and lower bounds for broadcasting [62]. Similarly, Ref. [63] presents algorithms that achieve greater efficiency when messages are forwarded in both inter-networks and extended LANs. A paper by [64] finds the pdf for a number of distinct nodes reached, the expected number of distinct nodes in which a message is forwarded within a small and fixed-size network, the pdf for the number of new nodes at a given forwarding level and the expected number of new nodes at a given forwarding level. Our defense mechanism applies the pdf in the same manner as [64] to estimate the number of distinct nodes that would receive the metadata and requests. Beyond that, our algorithm estimates the proportion of malicious nodes and selfish nodes, and can determine the appropriate number of forwarding levels and messages in achieving high-match probabilities from a large and highly mobile network. 2.5. Defense mechanism for malicious attacks Ref. [65] takes an unique approach to detecting malicious behaviors, that is it detects nodes that do not respond and then reduces their workloads [66]. Ref. [67] creates a blacklist of malicious nodes in an overlay network with regard to gossiping. Ref. [68] uses a statistical method for determining how often the sequence should occur in normal traces or in intrusions. Ref. [69] employs neural networks or training decision trees [70] to detect novel attacks. Ref. [71] describes an adaptive replication protocol based on random walks whose feedback mechanism adjusts the number of replicas based on search length for sufficient replications. Ref. [72] presents an adaptive P2P protocol to protect against malicious peers that upload corrupt, inauthentic or misnamed content. 2.6. Defense mechanism for selective-forwarding attacks According to [73], sensor nodes might refuse to forward or drop messages due to limited signal, resulting in a failure to propagate the messages. Ref. [74] obtains safe data transmission between nodes, effectively detects selective-forwarding attacks and manages design issues in a reasonable manner. Two noteworthy approaches to detecting selective-forwarding attacks can be found in [75], which uses Lightweight Detection, and [76], which uses watermark technology. Ref. [77] detects black-hole and selective-forwarding attacks and employs cryptographic hashes, neighborhood watch and threshold-based analysis to defend against such attacks. Refs [78, 79] use lightweight security scheme to detect selective-forwarding attacks in WSN, and use multi-hop acknowledgment from intermediate nodes to create responses and launch alarms. Ref. [80] proposes a defense against selective-forwarding attacks in the form of a fuzzy-based reliable data-delivery method, and Refs [81, 82] apply a multipath routing scheme to defend against such attacks in WSN. 3. DAMS: DEFENSE MECHANISM AGAINST MALICIOUS, SELFISH AND MOBILE NODES In this subsection, we explain our DAMS defense mechanism, which is designed to deal with mobile nodes, malicious nodes and selfish nodes. Our defense mechanism is engineered to address situations where any given node cannot have knowledge of all the nodes in its membership, especially in a large-scale network. The aims of this mechanism are to: (1) utilize the pdf, EWMA method and modified Chi-squared test to detect malicious and selfish nodes; (2) defend against malicious and selfish nodes by adjusting the number of requests so as to achieve high-match probability; and (3) ensure network robustness and scalability in a large network, and even with a large proportion of nodes moving to other regions. Figure 1 shows the basic concepts of the DAMS defense mechanism. In DAMS, the entire network is first divided into multiple regions (Section 3.1), as shown in Fig. 1. When a node joins a network, it applies Section 3.2 to enter the region. When a source node wishes to distribute its files, it applies Section 3.4 to do so. Similarly, the requesting node uses Section 3.5 to distribute its requests. When a node is about to transfer to other region, it applies the steps described in Section 3.9. When a node needs to relay its received messages to other nodes, it employs the steps described in Section 3.6. Lastly, when a node wishes to leave a region, it adapts the steps described in Section 3.3. Figure 2 shows the flowchart of the DAMS defense mechanism, and the following subsections lay out a detailed design of the system. FIGURE 1. View largeDownload slide Basic concepts of DAMS network. FIGURE 1. View largeDownload slide Basic concepts of DAMS network. FIGURE 2. View largeDownload slide Flowchart of DAMS system. FIGURE 2. View largeDownload slide Flowchart of DAMS system. 3.1. Network separation Due to the range limitations inherent in wireless systems, it is impossible to have all the nodes communicate with all other nodes in a wireless network. Therefore, our system uses static landmarks or wireless access points within the network. We first divide the whole network into multiple regions, where each region has a central landmark whose transmission range can cover the entire region. For example, we may divide the whole network into four smaller regions, as shown in Fig. 1. Every node periodically senses the signal strength of its nearby landmarks and re-identifies its region based on the strongest signal received from these landmarks. Assuming that we have a network size with the area NS, and TR is the transmission range, the number of regions NR can be calculated as follows: NR=NSπ×TR2 (1) Please note that TR has to be larger than the diameter D of each region in order to allow every region to cover the transmission range of all of its nodes. Pseudocode for the network-dividing algorithm is given in Fig. 3. FIGURE 3. View largeDownload slide Finding the appropriate number of regions NR. FIGURE 3. View largeDownload slide Finding the appropriate number of regions NR. 3.2. Entering the network When a node joins the network, it first selects a node from the new region to serve as its bootstrapping node and then sends a membership request to the selected node. The bootstrapping node replies to the joining node with a list of all of its members. The bootstrapping node also determines whether the joining node should hold metadata. If so, the bootstrapping node sends a source’s URL, along with its membership information, back to the newly joined node. The joining node then saves all of the membership information to its database. If the bootstrapping node also replies with the source node’s URL, the joining node contacts the source node to obtain the metadata. The joining node distributes a join message to all the nodes in its view, each of which then adds the joining node to its local view. 3.3. Departing the network When a node leaves a network, it first determines whether it has metadata in its database. If so, the leaving node asks a bootstrapping node whether it should transfer its metadata. If so instructed, the departing node randomly selects a node to receive the metadata and notifies the source node of this transfer. The leaving node then distributes a leaving message to all the nodes in its view, deletes its metadata and departs the network. If a node simply leaves the network without informing other nodes, other nodes would still discover this leaving node during their join, metadata distribution or query distribution. 3.4. Metadata distribution When a source node wishes to distribute its metadata, it first applies alDistribution() algorithm from [16] to calculate the pdf [83, 84] for the number of nodes that receive the metadata in terms of forwarding level l ⁠, forwarding fanout a ⁠, and forwarding probability. In [85], if the source and requesting node both distribute the metadata and requests to m=2n nodes with x=1 proportion of nodes that are operational, then P(k≥1)>1−exp2n2nn≥1−exp−4>0.9817 ⁠. Likewise, we determine the appropriate number of a and l such that the overall number of distributions would still be appropriate to achieve m=2n of nodes to distribute metadata. After obtaining the number of forwarding fanouts a and forwarding levels l ⁠, the source node produces metadata and applies an LSH (Section 3.10) algorithm to identify the most appropriate destination for the metadata. The source node first chooses an LSH function and hashes its metadata to the x coordinates, then applies another LSH function, and hashes its metadata to the y coordinates. Our system then sends this message to the region that has the coordinates x and y ⁠. After this is done, the source node uses the routing algorithm, as described in Section 3.7, to send its metadata to the designated region. The first node in a destination region, called the relay node, chooses 2view random nodes in its local view and distributes this metadata to these nodes. Based on the chosen forwarding probability, some nodes receiving such metadata might further forward this metadata to other nodes in the same region. Lastly, the relay node updates its membership and then notifies the source node about these metadata nodes. 3.5. Query distribution When a requesting node wishes to distribute its query, it first applies alGet() algorithm from [16] to calculate the pdf [83, 84] for the number of nodes that receive the query messages in terms of forwarding level l ⁠, forwarding fanout a ⁠, and forwarding probability. Next, the requesting node produces a query, and applies an LSH (Section 3.10) algorithm to determine the appropriate destination for the query. The requesting node first chooses an LSH function and hashes its query to the x coordinates, and then applies another LSH function, and hashes its query to the y coordinates. Our system then sends this query to the region that contains the coordinates x and y ⁠. The requesting node uses the routing algorithm, as described in Section 3.7, to forward the query to the calculated region. The first node, called the relay node, chooses 2view random nodes in its local view and distributes this query to these nodes. Based on the chosen forwarding probability, some nodes receiving such queries might forward this request to other nodes in the same region. The nodes that receive such a query compare the keywords in the request with the metadata they hold. If a node finds a match, the node responds to the relay node with the URLs of the source nodes. After obtaining the responses, the relay node first updates its view from its membership and then sends its response to the requesting node. The requesting node then randomly selects one of the source nodes to retrieve its desired file, and applies a protection algorithm (Section 3.8) to discover newly joined and leaving nodes, to estimate the proportion of malicious nodes, to determine the proportion of selfish nodes, to estimate the churn and to decide its next update time based on the current churn. The requesting node further applies the steps as described in Section 3.4 to publish its metadata to some randomly chosen nodes in its mapped region. 3.6. Relaying messages There are six different ways that a relay node would relay its messages. The first way is that the relay node receives metadata and simply forwards that metadata to 2view randomly selected nodes. The second way is that the relay node receives a metadata reply message with 2view nodes in it, and it forwards this reply back to the source node. The third way is that the relay node gets a query, and it simply forwards this query to 2view randomly selected nodes. The fourth way is that the relay node receives a query-matching response, and it simply forwards it to the requesting node. The fifth way is that the relay node receives a relocation message from the distributor and it forwards this message to all the nodes that have the metadata. The last way is that the relay node receives a relocation message from the requesting node, and it simply updates the location in its database. 3.7. Distance–direction-based geographic routing algorithm In our DAMS, we propose a distance–direction-based geographic routing method (DGR). In our DGR, the sender first determines the recipient’s x coordinate, y coordinate and direction. Next, the sender detects its nearby neighbors within a certain range and along a certain vector and calculates the distances between the neighbors’ coordinates and the recipient’s coordinates. Finally, the sender selects the neighbor along the recipient’s vector that is the closest to the recipient’s coordinates. Thus, the message can be quickly delivered to the destination region, making the overall delivery to the destination region within the smallest number of hops. For example, assuming that we have a sender located at coordinates s=(1,2) ⁠, and a receiver located at coordinates r=(8,10) ⁠. The sender identifies four nearby neighbors: n1=(0,3) ⁠, n2=(3,3) ⁠, n3=(5,5) and n4=(0,5) ⁠. The sender first determines that the receiver is to the northwest, and notices that n2 and n3 also lie in that direction. The sender then determines the distances from n2 and n3 to the receiver and notices that the distance between the receiver and n3 is smallest. Thus, the sender forwards the message to n3. The algorithm is given in Fig. 4. FIGURE 4. View largeDownload slide Message routing algorithm. FIGURE 4. View largeDownload slide Message routing algorithm. 3.8. Protection algorithm Those nodes that receive both metadata and requests but do not report a match are considered malicious. Similarly, those nodes that do not forward a message are considered selfish. Moreover, when the network has high churn, it means that enormous quantities of nodes are continuously arriving and departing the network, which might cause disconnection in the network, or limit other nodes’ view of it. Thus, our protection algorithm is designed to allow every node to discover newly joined and leaving nodes, to estimate the proportion of malicious and selfish nodes, to estimate the churn and to allow nodes to manage and decide their next update time based on the estimated churn of the network. The sender first sends its messages to a few nodes chosen at random, and waits for responses from those nodes. The sender then removes the unresponsive nodes, and adds newly joined nodes from the received response messages to its view. Our algorithm further collects data on the number of responses that a requesting node receives using the probability density function from [16], EWMA method, and modified Chi-squared test. The EWMA method is used to decrease the noise of the observations and to average the observations over a sequence of requests. The EWMA method is defined by: st=c×vt+(1−c)×st−1 ⁠, where c is the smoothing factor for 0≤c≤1 ⁠, st is the output of the EWMA at time t ⁠, and vt is the current non-normalized observed probabilities. The modified Chi-squared goodness-of-fit test [53] is used to determine which of the analytical curves best matches the observed probabilities, and estimate the proportion of non-malicious nodes in its membership view. The modified Chi-squared statistic is defined by: χ2=∑k=1kMax((ok−ek)2)/(ok+ek) ⁠, where ok is the actual number of observations in k th bucket, ek is the expected number of observations in k th bucket, and kMax is the maximum bucket in which the observations might be located. After calculating the observed probabilities from the collected responses, the sender further calculates the analytical match probabilities with message forwarding. For the analytical match probabilities, we employ hypergeometric distribution [83], which describes the number of successes in a sequence of random draws from a finite population without replacement, and it is calculated as: P(k)=mxkn−mxr−knr ⁠, for mx+r≤n and k≤min{mx,r} ⁠. If either of those two conditions is not satisfied, then P(k)=0 ⁠. In addition, the match probability P(k≥1) of one or more matches is given by P(k≥1)=1−P(0) ⁠, where mx+r≤n ⁠. In essence, we employ hypergeometric distribution to calculate the probability of k match P(k) ⁠, and probabilistic analysis with probability density function (pdf), for both pdf(n,a,l,F).getM() and pdf(n,a,l,F).getR() ⁠, for the number of nodes reached when forwarding to m and to r nodes. The match probability will be the summation of the product of these match probabilities for all values of m and r ⁠. After obtaining the analytical probability curve, the algorithm applies normalization since there might be no response at all, particularly in light of the fact that metadata and requests might end up distributed to disjointed subsets of nodes, malicious or selfish nodes might interfere with distribution, and there might exist no metadata that match a request. Therefore, we excluded requests for which all responses indicate no match. In addition, we further excluded requests that result in matches for large values of kMax since the probability of these matches is negligibly small. We normalize the probabilities as: Q(k)=P(k)/∑i=1kMaxP(i) ⁠. Our algorithm finally compares the normalized observed probabilities and the normalized analytical probabilities with various values of F and X and determines that the value of x and f for which the modified Chi-squared value is the smallest between the observed and analytical probability curves, since the smallest value corresponds to the curve with the best fit. Next, once the algorithm has obtained the value of x and f ⁠, it further computes the appropriate a and f in order to satisfy the match probability of one or more matches y0=P(k≥1) ⁠. Then the algorithm increases the number of l levels and a fanouts to which the query is distributed to achieve the same match probability as when X=1.0 and F=1.0 ⁠. Finally, the algorithm calculates the next update time based on the sum of current leaving nodes plus newly joined nodes, multiplied by the maximum update time, and then divided by the total number of request messages. Pseudocode for our protection algorithm is given in Fig. 5. FIGURE 5. View largeDownload slide Protecting algorithm. FIGURE 5. View largeDownload slide Protecting algorithm. 3.9. Region transfer algorithm There are four types of nodes that might move to other regions: (1) source nodes, (2) requesting nodes, (3) nodes with metadata and (4) relay nodes. First, when the network has a high churn, the source node might move to another region at any time. Our algorithm ensures that all the nodes that have metadata know the current location of the source node, and thus they would know where to inform the requesting node about the location of the source node later. Second, the requesting node might move to another region before it obtains a match reply. Therefore, our algorithm ensures that the relay node from the original region knows the current location of the requesting node, and thus it would know where to forward the incoming response to the requesting node. The requesting node has a timer, so that if it does not receive any response for a while, it will initiate another request from the new region. Third, if a metadata node happens to move out of the current region, it will transfer its information to another node chosen at random from its region, and inform the source node about this transfer. Lastly, if a relay node moves to another region, it transfers its information from its database to a node selected at random in its region. 3.10. LSH algorithm The goal of LSH is to hash items several times, so that similar documents would be more likely to appear in the same bucket [25, 26]. In our experiment, we ignore all of the punctuation in the document, convert all capitalized characters to lowercase, apply the stemming algorithm to convert various forms of a word to its base form, remove any step words, and group the consecutive set of k words into a set. According to [86], the value of k is usually set as k=2 or k=3 for small documents such as emails, and set as k=3 or k=4 for large documents such as news and research articles. We likewise set the value of k=3 for our further experiments. It is difficult to use Jaccard similarity to calculate the similarities between sets when we have millions of documents in the network since we cannot store all the sets in a node’s main memory space. As a result, we apply the “Minhashing” method, which estimates the similarities between documents with relatively little computation time. The minhashing process begins with the conversion of a series of sets into a single matrix. If the element j is in the set Si ⁠, we set 1 in location (i,j); otherwise, 0. Next, we construct a minhash (or permutation) function h to reorder the all the rows of the matrix. The permutation function is defined as follows: [87]: h(x)=(ax+b)modp ⁠, where p is any prime number, a is any integer that satisfies 0