Speeding Up GDL-Based Message Passing Algorithms for Large-Scale DCOPs

Speeding Up GDL-Based Message Passing Algorithms for Large-Scale DCOPs Abstract This paper develops a new approach to speed up Generalized Distributive Law (GDL) based message passing algorithms that are used to solve large-scale Distributed Constraint Optimization Problems (DCOPs) in multi-agent systems. In particular, we significantly reduce computation and communication costs in terms of convergence time for algorithms such as Max-Sum, Bounded Max-Sum, Fast Max-Sum, Bounded Fast Max-Sum, BnB Max-Sum, BnB Fast Max-Sum and Generalized Fast Belief Propagation. This is important since it is often observed that the outcome obtained from such algorithms becomes outdated or unusable if the optimization process takes too much time. Specifically, the issue of taking too long to complete the internal operation of a DCOP algorithm is even more severe and commonplace in a system where the algorithm has to deal with a large number of agents, tasks and resources. This, in turn, limits the practical scalability of such algorithms. In other words, an optimization algorithm can be used in larger systems if the completion time can be reduced. However, it is challenging to maintain the solution quality while minimizing the completion time. Considering this trade-off, we propose a generic message passing protocol for GDL-based algorithms that combines clustering with domain pruning, as well as the use of a regression method to determine the appropriate number of clusters for a given scenario. We empirically evaluate the performance of our method in a number of settings and find that it brings down the completion time by around 37–85% (1.6–6.5 times faster) for 100–900 nodes, and by around 47–91% (1.9–11 times faster) for 3000–10 000 nodes compared to the current state-of-the-art. 1. INTRODUCTION Distributed Constraint Optimization Problems (DCOPs) are a widely studied framework for solving constraint handling problems of cooperative multi-agent systems (MAS) [1]. They have been applied to many real world applications such as disaster response [2], sensor networks [3, 4], traffic control [5], meeting scheduling [6] and coalition formation [7]. DCOPs have received considerable attention from the multi-agent research community due to their ability to optimize a global objective function of problems that can be described as the aggregation of distributed constraint cost functions. To be precise, DCOP algorithms are distributed by having agents negotiate a joint solution through local message exchange, and the algorithms exploit the structure of the application domain by encoding this into constraints to tackle hard computational problems. In DCOPs, such problems are formulated as constraint networks that are often represented graphically. In particular, the agents are represented as nodes, and the constraints that arise between the agents, depending on their joint choice of action, are represented by the edges [8]. Each constraint can be defined by a set of variables held by the corresponding agents related to that constraint. In more detail, each agent holds one or more variables, each of which takes values from a finite domain. The agent is responsible for setting the value of its own variable(s) but can communicate with other agents to potentially influence their choice. The goal of a DCOP algorithm is to set every variable to a value from its domain, to minimize the constraint violation. Over the last decade, a number of algorithms have been developed to solve DCOPs under two broad categories: exact and non-exact algorithms. The former always finds an optimal solution, and can be further classified into fully decentralized and partially centralized approaches. In fully decentralized approaches (e.g. ADOPT [1], BnB ADOPT [9] and DPOP [10]), the agent has complete control over its variables and is aware of only local constraints. However, such approaches often require an excessive amount of communication when applied to complex problems [11]. On the other hand, centralizing parts of the problem can often reduce the effort required to find a globally optimal solution (e.g. OptAPO [12] and PC-DPOP [11]). In both cases, finding an optimal solution for a DCOP is an NP-hard problem that exhibits an exponentially increasing coordination overhead as the system grows [1]. Consequently, exact approaches are often impractical for application domains with larger constraint networks. On the contrary, non-exact algorithms sacrifice some solution quality for better scalability. These algorithms are further categorized into local greedy and Generalized Distributive Law (GDL) based inference methods. In general, local greedy algorithms (e.g. DSA [13] and MGM [14]) begin with a random assignment of all the variables within a constraint network, and go on to perform a series of local moves that try to greedily optimize the objective function. They often perform well on small problems having constraints with lower arity. There has been some work on providing guarantees on the performance of such algorithms in larger DCOP settings [15–18]. However, for large and complex problems, consisting of hundreds or thousands of nodes, this class of algorithms often comes up with a global solution far from the optimal [19–22]. This is because agents do not explicitly communicate their utility for being in any particular state. Rather they only communicate their preferred state (i.e. the one that will maximize their own utility) based on the current preferred state of their neighbours. GDL-based inference algorithms are the other popular class of non-exact approaches [23]. Rather than exchanging self-states with their neighbours in local search, agents in a GDL-based algorithm explicitly share the consequences of choosing non-preferred states with the preferred one during inference through the graphical representation of a DCOP [2, 20, 24]. Thus, agents can obtain global utilities for each possible value assignment. In other words, in contrast to the greedy local search algorithms, agents do not propagate assignments. Instead, they calculate utilities for each possible value assignment of their neighbouring agents’ variables. Eventually, this information helps this class of algorithms to achieve good solution quality for large and complex problems. Hence, GDL-based inference algorithms perform well in practical applications and provide optimal solutions for cycle-free constraint graphs (e.g. Max-Sum [24]) and acceptable bounded approximate solutions for their cyclic counterparts (e.g. Bounded Max-Sum (BMS) [22]). Therefore, these algorithms have received an increasing amount of attention from the research community. In this paper, we are going to focus on GDL-based DCOP algorithms and utilize the aforementioned advantages of the partial centralization. GDL-based DCOP algorithms follow a message passing protocol where agents continuously exchange messages to compute an approximation of the impact that each of the agents’ actions has on the global optimization function, by building a local objective function (expounded in Section 2). Once the function is built, each agent picks the value of a variable that maximizes the function. Thus, this class of non-exact algorithms make efficient use of constrained computational and communication resources, and effectively represent and communicate complex utility relationships through the network. Despite these aforementioned advantages, scalability remains a challenge for GDL-based algorithms [2, 25, 26]. Specifically, they perform repetitive maximization operations for each constraint to select the locally best configuration of the associated variables, given the local function and a set of incoming messages. To be precise, a constraint that depends on n variables having domains composed of d values each, will need to perform dn computations for a maximization operation. As the system scales up, the complexity of this step grows exponentially and makes this class of algorithms computationally expensive. While several attempts have been made to reduce the cost of the maximization operation, most of them typically limit the general applicability of such algorithms (see Section 7). In more detail, previous attempts at scaling up GDL-based algorithms have focused on reducing the overall cost of the maximization operator. However, they overlook an important concern that all the GDL-based algorithms follow a Standard Message Passing (SMP) protocol to exchange messages among the nodes of the corresponding graphical representation of a DCOP. In the SMP protocol, a message is sent from a node von an edge eto its neighbouring node wif and only if all the messages are received at von edges other than e, summarized for the node associated with e [23, 27]. This means that a node in the graphical representation is not permitted to send a message to its neighbouring node until it receives messages from all its other neighbours. Here, for w to be able to generate and send messages to all its other neighbours, it depends on the message from v. To be exact, w cannot compute and transmit messages to its neighbours other than v until it has received all essential messages, including the message from v. This dependency, which is common for all the nodes, (the so-called completion time) increases. Now, there is an asynchronous version of message passing where nodes are initialized randomly, and outgoing messages can be updated at any time and in any sequence [23, 24]. Thus, the asynchronous protocol minimizes the waiting time of the agents, but there is no guarantee about how consistent their local views (i.e. the local objective function) are. In other words, agents can take decisions from an inconsistent view and they may need to revise their action. Therefore, unlike SMP, even in an acyclic constraint graph, this asynchronous version does not guarantee convergence after a fixed number of message exchanges. Thus, it experiences more communication and computational cost as redundant messages are generated and sent, regardless of the structure of the graph. Significantly, even in the asynchronous version, the expected result for a particular node can be achieved only when all the received messages for the node are computed by following the regulation of the SMP protocol. Building on this insight, [28] introduces an asynchronous propagation algorithm that schedules messages in an informed way. Moreover, [29] demonstrates that the impact of inconsistent views is worse than the waiting time of the agents regarding the total completion time, due to the effort required to revise an action in the asynchronous protocol. Thus, the completion time for both the cases is proportional to the diameter of the factor graph, and the asynchronous version never outperforms SMP in terms of the completion time [20, 27, 29]. In light of the aforementioned observations, this paper develops a new message passing protocol that we call Parallel Message Passing (PMP). PMP can be used to obtain the same overall results as SMP in significantly less time,1 when applied to all the existing GDL-based message passing algorithms. In this paper, we use SMP as a benchmark in evaluating PMP, because SMP is faster (or in the worst case equal) to its asynchronous counterpart. It is noteworthy that the GDL-based algorithms, which deal with cyclic graphical representations of DCOPs (e.g. BMS,2 Bounded Fast Max-Sum (BFMS)2), initially remove the cycles from the original constraint graph, then apply the SMP protocol on the acyclic graph to provide a bounded approximate solution of the problem. Comparatively, our protocol can be applied on cyclic DCOPs in the same way. Thus, once the cycles have been removed, PMP can be applied in place of SMP on the transformed acyclic graph. In more detail, this work advances the state-of-the-art in the following ways: PMP provides the same result as its SMP counterpart, but takes significantly less time. Here, we do not change the computation method of the messages. Rather, we efficiently distribute the overhead of message passing to the agents to exploit their computational and communication power concurrently. Thus, we reduce the average waiting time of the agents. To do so, we split the graphical representation of a DCOP into several parts (i.e. clusters) and execute message passing on them in parallel. As a consequence of this cluster formation, we have to ignore inter cluster links. Therefore, PMP requires two rounds of message passing, and an intermediate step (done by a representative agent on behalf of the cluster, namely cluster head) to recover the values of the ignored links. However, this overall process is still significantly quicker than SMP. In addition to that, we introduce a domain pruning algorithm to further reduce the time required to complete the intermediate step. Our approach is mostly decentralized, apart from the intermediate step performed by the cluster heads. However, unlike existing partially centralized techniques such as OptAPO and PC-DPOP which require extensive effort to find and distribute the centralized parts of a DCOP to the cluster heads (also known as mediator agents), no such additional effort is required in PMP. Thus, we effectively take the advantages of partial centralization without being affected by its major shortcomings (see Section 7). We empirically evaluate the performance of our protocol, in terms of completion time, and compare it with the GDL-based benchmark DCOP algorithms that follow the SMP protocol in different settings (up to 10,000 nodes). Our results show a speed-up of 37–91% with no reduction in solution quality, meaning a GDL-based DCOP algorithm can generate the same solution quality 1.6–11 times faster by using PMP. By doing so, PMP makes GDL-based algorithms more scalable in that either they take less time to complete the internal operation of a given size of DCOP or they can handle a larger DCOP in the same completion time as a smaller one that uses SMP. We observe from our empirical studies that it is non-trivial to determine the appropriate number of clusters for a certain scenario. Therefore, it is important to find out the number of clusters to be formed for a particular scenario, before initiating the message passing. To address this issue, our final contribution is to use a linear regression numerical prediction model to determine the appropriate number of clusters for a specific problem instance. Our empirical evidence suggests that we can achieve at least 98.5% of the possible performance gain from PMP by using the linear regression method. The rest of this paper is structured as follows: Section 2 formally defines the generic DCOP framework and details how the SMP protocol operates on the corresponding graphical representation of a DCOP. Then, in Section 3, we discuss the technical details of our PMP protocol with worked examples. Next, we present the performance of our approach through empirical evaluation in Section 4. Afterwards, Section 5 demonstrates the details and the performance of applying the linear regression model on PMP. Subsequently, Section 6 provides some supplementary empirical evidence. Section 7 puts our work in perspective with previous approaches, and Section 8 concludes. 2. THE STANDARD MESSAGE PASSING PROTOCOL In general, a DCOP can be formally defined as follows [1, 19, 20, 31]: Definition 2.1 (DCOP). A DCOP can be defined by a tuple ⟨ A, X, D, F, M ⟩, where Ais a set of agents {A0,A1,…Ak}. Xis a set of finite and discrete variables {x0,x1,…,xm}, which are being held by the set of agents A. Dis a set of domains {D0,D1,…,Dm}, where each Di∈Dis a finite set containing the values to which its associated variable ximay be assigned. Fis a set of constraints {F1,F2,…,FL}, where each Fi∈Fis a function dependent on a subset of variables xi∈Xdefining the relationship among the variables in xi. Thus, function Fi(xi)denotes the value for each possible assignment of the variables in xiand represents the joint pay-off that the corresponding agents achieve. Note that this setting is not limited to pairwise (binary) constraints, and the functions may depend on any number of variables. Mis a function M:η→Athat represents the mapping of variables and functions, jointly denoted by η, to their associated agents. Each variable/function is held by a single agent. However, each agent can hold several variables and/or functions. Notably, the dependencies (i.e. constraints) between the variables and the functions generate a bipartite graph, called a factor graph, which is commonly used as a graphical representation of such DCOPs [27]. Within this model, the objective of a DCOP algorithm is to have each agent assign values to its associated variables from their corresponding domains in order to either maximize or minimize the aggregated global objective function which eventually produces the value of all the variables, X* (Equation (1)). X*=argmaxX∑i=1LFi(xi)ORX*=argminX∑i=1LFi(xi) (1) For example, Fig. 1 depicts the relationship among variables, functions and agents of a factor graph representation of a sample DCOP. Here, we have a set of nine variables X={x0,x1,…,x8}, a set of six functions/factors F={F0,F1,…,F5} and a set of three agents A={A1,A2,A3}. Moreover, D={D0,D1,…,D8} is a set of discrete and finite variable domains, each variable xi∈X can take its value from the domain Di∈D. In this example, agent A1 holds two function nodes ( F0, F1) and three variable nodes ( x0, x1, x2). Similarly, nodes F2, F3, x3, x4 and x5 are held by agent A2; whereas, functions F4, F5 and variables x6, x7 and x8 are held by agent A3. Now, these three agents participate in the optimization process in order to either maximize or minimize a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). Here, the global objective function is an aggregation of six local functions F0(x0,x1), F1(x1,x2,x3), F2(x3,x4), F3(x4,x5), F4(x5,x6,x7) and F5(x7,x8). Figure 1. View largeDownload slide A sample factor graph representation of a DCOP, with six function/factor nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}, stands for a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). It also illustrates the relationship among variables, factors and agents of a DCOP. In the figure, variables are denoted by circles, factors as squares and agents as octagons. Figure 1. View largeDownload slide A sample factor graph representation of a DCOP, with six function/factor nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}, stands for a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). It also illustrates the relationship among variables, factors and agents of a DCOP. In the figure, variables are denoted by circles, factors as squares and agents as octagons. To date, the factor graph representation of the aforementioned DCOP formulation follows the SMP protocol to exchange messages in the GDL-based message passing algorithms. Notably, both the Max-Sum and the BMS algorithms (two key algorithms based on GDL) use Equations (2) and (3) for their message passing, and they can be directly applied to the factor graph representation of a DCOP. Specifically, the variable and the function nodes of a factor graph continuously exchange messages (variable xi to function Fj (Equation (2)) and function Fj to variable xi (Equation (3))) to compute an approximation of the impact that each of the agents’ actions have on the global objective function by building a local objective function Zi(xi). In Equations (2–4), Mi stands for the functions connected to xi and Nj represents the set of variables connected to Fj. Once the function is built (Equation (4)), each agent picks the value of a variable that maximizes the function by finding argmaxxi(Zi(xi)). Even though some extensions of the Max-Sum and the BMS algorithms (e.g. Fast Max-Sum (FMS) [2], BFMS [30] and BnB-FMS [32]) modify these equations slightly, the SMP protocol still underpins these algorithms. The reason behind this is that a message passing protocol, by definition, does not depend on how the messages are generated; rather, it ensures when a message should be computed and exchanged [23, 27]. Qxi→Fj(xi)=∑Fk∈Mi⧹FjRFk→xi(xi) (2) RFj→xi(xi)=maxxj⧹xi[Fj(xj)+∑xk∈Nj⧹xiQxk→Fj(xk)] (3) Zi(xi)=∑Fj∈MiRFj→xi(xi) (4) Algorithm 1 gives an overview of how SMP operates on a factor graph in a multi-agent system. Here, a number of variable and function nodes of a factor graph FG are held by a set of agents A. The corresponding agents act (i.e. generate and transmit messages) on behalf of the nodes they hold. Initially, only the variable and the function nodes that are connected to a minimum number of neighbouring nodes in FG, denoted by iNodes, are permitted to send messages to their neighbours. Line 1 of Algorithm 1 finds the set of agents Am∈A that hold iNodes. Specifically, the function messageUpdate() represents the messages sent by the agents on behalf of the permitted nodes they hold to their permitted neighbours in a particular time step within a factor graph. Notably, the SMP protocol ensures that a node, variable or function, within a factor graph cannot generate and transmit a message to its particular neighbour before receiving messages from the rest of its neighbour(s). According to this regulation of SMP, the permitted nodes ( pNodes) and their corresponding permitted neighbours ( pNodes.pNeighbours) for a particular time step are determined. At the very first time step, agents Am on behalf of iNodes, also denoted as Am.iNodes, send NULL values to all their neighbouring nodes ( iNodes.allNeighbours) within FG (Line 2). Now, following the SMP protocol, a set of agents Am′ on behalf of pNodes, namely Am′.pNodes, compute messages ( generatedMessages) using Equation (2) or (3) for those neighbours ( pNodes.pNeighbours) they are allowed to send to (Line 4). The while loop in Lines 3–7 ensures that this will continue until each of the nodes sends messages to all their neighbours. Within this loop, once a variable xi receives messages from all of its neighbours, it can build a local objective function Zi(xi), and the corresponding agent chooses the value to maximize it by finding argmaxxi(Zi(xi))(Lines 5–7). Algorithm 1 Overview of the SMP protocol on a factor graph. View Large Algorithm 1 Overview of the SMP protocol on a factor graph. View Large Figure 2 demonstrates a worked example of how SMP works on the factor graph shown in Fig. 1. Here, Equation (5) and Equations (6–9) illustrate two samples of how the variable-to-function (e.g. x5 to F4 or Qx5→F4(x5)) and the function-to-variable (e.g. F4 to x6 or RF4→x6(x6)) messages are computed based on Equations (2) and (3), respectively. All the messages are generated considering the local utilities depicted at the bottom of Fig. 2 for domain: {R,B}. During the computation of the messages, red and blue colours are used to distinguish the values of the domain states R and B, respectively. In the former example, apart from the receiving node F4, the sending node x5 has only one other neighbouring node (i.e. F3). Therefore, x5 only needs to forward the message it received from F3 to the node F4 (Equation (5)). In the latter example, the computation of the message from F4 to x6 includes a maximization operation on the summation of the local utility function F4 and the messages received by F4 from its neighbours other than x6 (i.e. x5, x7). Given the messages are sent by the corresponding agents on behalf of the nodes they hold, for simplicity, we skip mentioning agents from our worked examples in this paper. Qx5→F4(x5)=RF3→x5(x5)={81,82} (5) RF4→x6(x6)=max{x5,x6,x7}⧹{x6}[F4(x5,x6,x7)+(Qx5→F4(x5)+Qx7→F4(x7))] (6) =max{x5,x7}[[6925762732]+[8181818182828282]+[4045404540454045]] (7) =max{x5,x7}[127135146133128154125129] (8) ={154,146} (9) Z1(x1)=RF0→x1(x1)+RF1→x1(x1)={12,10}+{135,144}={147,154} (10) Figure 2. View largeDownload slide Worked example of SMP on the factor graph of Fig. 1. In the factor graph, each of the tables represents the corresponding local utility of a function for domain {R,B}. The values within a curly bracket represent a message computed based on these local utilities, and each arrow indicates the sending direction of the message. Figure 2. View largeDownload slide Worked example of SMP on the factor graph of Fig. 1. In the factor graph, each of the tables represents the corresponding local utility of a function for domain {R,B}. The values within a curly bracket represent a message computed based on these local utilities, and each arrow indicates the sending direction of the message. In the example of Fig. 2, initially, only x0,x2,x6 and x8 (i.e. iNodes) can send messages to their neighbours at the very first time step (Line 2 of Algorithm 1). According to the SMP protocol, in time step 2, only F0 and F5 can generate and send messages to x1 and x7, respectively. It is worth mentioning that, despite receiving a message from x2 and x6 respectively in time step 1, F1 and F4 cannot send messages in time step 2 as they have to receive messages from at least two of their three neighbours, according to the regulation of SMP (Line 4). Hence, F1 has to wait for the message from x1 to be able to generate and send a message to x3. Similarly, F1 cannot generate a message for x2 until it receives messages from x1 and x3. Subsequently, x3 cannot send a message to F1 until it receives the message from F2. In this process, a variable x1 can only build its local objective function Z1(x1) when it receives messages from all of its neighbours F0 and F1 (Equation (10)), and this is common for all the variables. In Equation (10), from the value {R,B} = {147,154} generated by Z1(x1) for the variable x1, the holding agent of x1 chooses B=154, that is argmaxx1(Z1(x1)). Following the SMP protocol, the complete message passing procedure will end after each node receives messages from all of its neighbours. These are the dependencies we discussed in Section 1, which make GDL-based algorithms less scalable in practice for large real world problems. Formally, the total time required to complete the message passing procedure for a particular factor graph can be termed as the completion time T, and the ultimate objective is to reduce the completion time while maintaining the same solution quality from a message passing algorithm. To address this issue, we introduce a new message passing protocol in the following section. 3. THE PARALLEL MESSAGE PASSING PROTOCOL PMP uses a means similar to that of its SMP counterpart in computing messages. For example, Max-Sum messages are used when PMP is applied to the Max-Sum algorithm, and FMS messages are used when it is applied to the FMS algorithm. Even so, PMP reduces the completion time by splitting the factor graph into a number of clusters (Definition 3.1), and independently running the message passing3 on those clusters in parallel. As a result of this, the average waiting time of the nodes, during the message passing, is lessened. In particular, the completion time of PMP is reduced to TsmpNC; where Tsmp is the completion time of the algorithm that follows the SMP protocol, and NC is the number of clusters. However, PMP ignores inter cluster links (i.e. messages) during the formation of clusters. Hence, it is not possible to obtain the same solution quality as the original algorithm by executing only one round of message passing. This is why PMP requires two rounds of message passing and an additional intermediate step. The role of the intermediate step is to generate the ignored messages (Definition 3.2) for the split node(s) of a cluster, so that the second round can use these as initial values for those split node(s) to compute the same messages as the original algorithm. To be precise, a representative agent (or a cluster head) takes the responsibility of performing the operation of the intermediate step for the corresponding cluster. To make it possible, we assume that each of the cluster heads retains full knowledge of that cluster, and it can communicate with its neighbouring clusters, making PMP a partially centralized approach. As a consequence of two rounds of message passing and an intermediate step, the total completion time of PMP (i.e. Tpmp) becomes 2×TsmpNC+Tintm, where Tintm is the time required to complete the intermediate step. As the sizes of the clusters can be different in PMP, a more precise way to compute Tpmp is through Equation (11). Here, Tclargest stands for the time required to complete the message passing process of the largest cluster in PMP. Having discussed how to compute the completion time of PMP, we explain the details of our proposed algorithm in the remainder of the section. Tpmp=2×Tclargest+Tintm (11) 3.1. Algorithm overview Algorithm 2 gives an overview of PMP. Similar to SMP, it works on a factor graph FG, and the variable and the function nodes of FG are being held by a set of agents A. To form the clusters in a decentralized manner, PMP finds an agent Ac∈A that holds a special function node ( firstFunction), which initiates the cluster formation procedure (Line 1). Specifically, firstFunction is a function node that shares variable(s) with only one function node. As PMP operates on acyclic or transformed acyclic factor graphs, such node(s) will always be found. Now, each agent that holds a function node maintaining this property broadcasts an initiator message, and any agent can be picked if more than one agent is found. Then, in Line 2, agent Ac initiates the procedure distributeNodes(FG,Ac) that distributes the nodes of FG to the clusters {c0,c1,…,cNC} in a decentralized way, and the detail of the cluster formation procedure will be explained shortly in Algorithm 3. Note that, in PMP all the operations within each cluster are performed in parallel. Algorithm 2 Overview of the PMP protocol on a factor graph. View Large Algorithm 2 Overview of the PMP protocol on a factor graph. View Large Algorithm 3 Parallel Message Passing. View Large Algorithm 3 Parallel Message Passing. View Large After the cluster formation procedure has completed, PMP starts the first round of message passing (Lines 3–6). Line 3 finds the set of agents Am∈A that hold the variable and function nodes iNodesci that are connected to the minimum number of neighbouring nodes within each cluster ci. Then, messageUpdate() of Line 4 represents those messages with NULL values sent by Am on behalf of iNodesci, also denoted as Am.iNodesci, to all their neighbouring nodes ( iNodesci.allNeighbours) within the cluster ci. Afterwards, following the same procedure as SMP, a set of nodes Am′.pNodes generate the messages ( generatedMessages) for the neighbours ( pNodes.pNeighbours) they are allowed to send messages to (Line 6). However, unlike SMP where the message passing procedure operates on the entire FG, PMP executes the first round of message passing on the clusters having only one neighbouring cluster in parallel (Line 5). This is because, in the first round, it is redundant to run message passing on the cluster having more than one neighbouring cluster, as a second round will re-compute the messages (see the explanation in Section 3.3). The while loop in Lines 5 and 6 ensures that this procedure will continue until each of the nodes sends messages to all its neighbouring nodes within the participating clusters of the first round. Next, a representative agent from each cluster ci computes the values (Definition 3.2) ignored during the cluster formation procedure for that particular cluster (Line 7). Note that these ignored values, represented by ignVal(ci), are the same values for those edges, if we run SMP on the complete factor graph FG. Finally, the second round of message passing is started on all of the clusters in parallel, by considering ignVal(ci) as initial values for those ignored edges (Line 8). Similar to the first round, the while loop ensures that this procedure will continue until each of the nodes sends messages to all its neighbouring nodes within ci (Lines 9–13). Within the second round of message passing, once a variable xi receives messages from all of its neighbours within ci, it can build a local objective function Zi(xi). Then, the corresponding agent chooses the value that maximizes it by finding argmaxxi(Zi(xi)) (Lines 12 and 13). By considering the ignVal(ci) values in the second round, PMP generates the same solution quality as SMP. We give a more detailed description of each part of PMP with a worked example in the remainder of this section. To be exact, Section 3.2 concentrates on the cluster formation and the message passing procedure. Then, Section 3.3 presents the intermediate step. Finally, Section 3.4 ends this section with a complete comparative example, SMP vs. PMP, in terms of the completion time. Definition 3.1 (Cluster, Neighbouring Clusters and Split Node). A cluster ciis a sub factor graph of a factor graph FG. Two clusters ciand cjare neighbours if and only if they share a common variable node (i.e. split node) xp. For instance, c1and c2of Fig. 3are two sub factor graphs of the entire factor graph shown in Fig. 2. Here, c1and c2are neighbouring clusters as they share variable x3as a split node. Definition 3.2 (Ignored Values of a Cluster, ignVal(ci)). The value(s) overlooked, through the split node(s) of each cluster ci, during the first round of message passing. In other words, these are the incoming messages through the split node(s), should the SMP protocol have been followed. The intermediate step of PMP takes the responsibility of computing these ignored values, so that they can be used in the second round in order to obtain the same solution quality from an algorithm as its SMP counterpart. In the example of Fig. 3, the intermediate step recovers {R,B}={119,126}for the split node x3of cluster c1, which is going to be used as an initial value for x3in the second round of message passing, instead of {R,B}={0,0}. Figure 3. View largeDownload slide Worked example of PMP (participating clusters: first round—( c1,c3) and second round—( c1,c2,c3)) on the same factor graph and local utility as in Fig. 2. In this figure, blue circles represent split variables for each cluster and coloured messages show the ignored values ( ignVal()) recovered during the intermediate step, where yellow messages require synchronous computations but green underlined ones are ready after the first round. Figure 3. View largeDownload slide Worked example of PMP (participating clusters: first round—( c1,c3) and second round—( c1,c2,c3)) on the same factor graph and local utility as in Fig. 2. In this figure, blue circles represent split variables for each cluster and coloured messages show the ignored values ( ignVal()) recovered during the intermediate step, where yellow messages require synchronous computations but green underlined ones are ready after the first round. 3.2. Cluster formation and message passing PMP operates on a factor graph FG of a set of variables X and a set of functions F. Specifically, lines 1–12 of Algorithm 3 generate NCclusters by splitting FG. In the process, Lines 1 and 2 compute the maximum number of function nodes per cluster ( N), and associated variable nodes of a corresponding function goes to the same cluster. Now, Line 3 gets a special function node firstFunction, which is a function node that shares variable node(s) with only one function node (i.e. min_nFunction(F)). Any node can be chosen in case of a tie. Then, Line 4 initializes the variable ‘ node’ with the chosen firstFunction, which is the first member node of the cluster c1. The for loop of Lines 5–12 iteratively adds the member nodes to each cluster ci. In order to do this in a decentralized manner, a special variable count is used as a token to keep track of the current number of function nodes belonging to ci. When a new node added to ci is held by a different agent, the variable count is passed to the new holding agent. The while loop (Lines 7–11) iterates as long as the member count for a cluster ci remains less than the maximum nodes per cluster (N), then the new node becomes a member of the next cluster. In the worked example of Fig. 3, we use the same factor graph shown in Fig. 2 that consists of six function nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}. Here, F0 and F5 satisfy the requirements to become the firstFunction node as both of them share variable nodes with only one function node ( F1 and F4, respectively). We pick F0 randomly as the firstFunction which eventually becomes the first node for the first cluster c1; therefore, the holding agent of node F0 now holds the variable count as long as the newly added nodes are held by the same agent. Assume, the number of clusters ( NC) for this example is 3: c1,c2 and c3 (see Section 5 for more details about the appropriate number of clusters). In that case, each cluster can retain a maximum of two nodes (functions). According to the cluster formation process of PMP, the sets of function nodes {F0, F1}, {F2, F3} and {F3, F4} belong to the clusters c1,c2 and c3, respectively. Moreover, c1 and c2 are neighbouring clusters as they share a common split node x3. Similarly, split node x5 is shared by the neighbouring clusters c2 and c3. At this point, the first of the two rounds of message passing initiates as the cluster formation procedure has completed. The for loop in Lines 13–16 acts as the first round of message passing. This involves only computing and sending the variable-to-function (Equation (2)) and the function-to-variable (Equation (3)) messages within clusters ( SN) having only a single neighbouring cluster in parallel. It can be seen from our worked example of Fig. 3 that among the clusters c1, c2 and c3, only c2 has more than one neighbouring clusters. Therefore, only c1 and c3 will participate in the first round of message passing. Unlike the first round, all the clusters participate in the second round of message passing in parallel (Lines 19–23). In the second round, instead of using the null values (i.e. predefined initial values) for initializing all the variable-to-function messages, we exploit the recovered ignored values (Definition 3.2) from the intermediate step (Lines 17 and 18) as initial values for the split variable nodes, as shown in Line 21. Here, all the ignored messages from the split nodes of a cluster ci are denoted as QignEdge(ci). The rest of the messages are then initialized as null (Lines 20 and 22). Here, ∀Q(ci) and ∀R(ci) represent all the variable-to-function and the function-to-variable messages within a cluster ci, respectively. For example, in cluster c3 all the messages are initialized as zeros for the first round of message passing. Therefore, during the first round, the variable nodes x5 and x8 start the message passing with the values {0, 0} and {0, 0} to the function nodes F4 and F5, respectively, in Fig. 3. However, in the second round, split node x5 starts with the value {81,82} instead of {0, 0} to the function node F4. Note that this value {81,82} is the ignored value for the split node x5 of cluster c3 computed during the intermediate step of PMP. Significantly, this is the same value transmitted by the variable node x5 to the function node F4 if we follow the SMP protocol (see Fig. 2), which ensures the same solution quality from both the protocols. We describe this intermediate step of PMP shortly. Finally, PMP will converge with Equation (4) by computing the value Zi(xi) and hence finding argmaxxi(Zi(xi)). 3.3. Intermediate step A key part of PMP is the intermediate step (Algorithm 4). It takes a cluster ( ci) provided by Line 18 of Algorithm 3 as an input, and returns the ignored values (Definition 3.2) for each of the ignored links of that cluster. A representative of each cluster ci (cluster head chi) performs the operation of the intermediate step for that cluster. Note that each cluster head operates in parallel. Initially, each cluster head needs to receive the StatusMessages from the rest of the cluster heads (Line 1 of Algorithm 4). Each StatusMessage contains the factor graph structure of the sending cluster along with the utility information. Notably, the StatusMessages can be formed and exchanged during the time of the first round, thus it does not incur an additional delay. The for loop in Lines 2–14 computes the ignored values for each of the split nodes Sj∈S (where, S={S1,S2,…,Sk}) of the cluster ci by generating a Dependent Acyclic Graph, DG(Sj) (Definition 3.3). In addition to StatusMessages, a cluster head also requires a factor to split variable message ( Mr) from each of the participating clusters of the first round. This is significant, as only clusters with one neighbouring cluster can participate in the first round, and the Mr message is prepared for the split node Sj of that neighbouring cluster. The content of Mr will not change as the participating cluster of the first round has no other clusters on which it depends. As a consequence, if a neighbouring cluster of ci has participated in the first round, the Dependent Acyclic Graph DG(Sj) for Sj comprises only one edge having the message Mr. In more detail, cp stands for the neighbouring cluster of ci that shares the split node Sj (i.e. adjCluster(ci,Sj)), and the variable dCountcp holds the value of the total number of clusters adjacent to cp obtained from the function totalAdjCluster(cp) (Lines 3 and 4). If the cluster cp has no cluster to depend on apart from ci (i.e. cp has participated in the first round of message passing), there is no need for further computation as the ignored value for Sj (i.e. Sj.values) is immediately ready (READY.DG(Sj)) after the first round (Lines 6 and 7). Here, the function append(Sj.values) appends the ignored value for Sj in ignVal(ci). Algorithm 4 intermediateStep(Cluster ci). View Large Algorithm 4 intermediateStep(Cluster ci). View Large On the other hand, if the cluster cp has other clusters to depend on, further computations in the graph are required. This creates the need to find each node of that graph DG(Sj) (Lines 8–14). Line 9 initializes the first function node dNode of DG(Sj), which is connected to the split node Sj and member of the cluster cp (i.e. adjNode(Sj,cp)). The while loop (Lines 10–12) repeatedly forms that graph through extracting the adjacent nodes from the first selected node, dNode. Finally, synchronous executions (explained shortly) from the farthest node to the start node (i.e. split node Sj) of DG(Sj) produce the desired value, Sj.values for Sj (Line 13), which eventually becomes the ignored value for that split node of the cluster ci (Line 14). This value will be used as an initial value during the second round of message passing for the corresponding split node. Definition 3.3 (Dependent Acyclic Graph, DG(Sj)). A DG(Sj)is an acyclic directed graph for a split node Sjof a cluster cifrom the furthest node within the factor graph FGfrom the node Sjtowards it. Note that apart from the node Sj, none of the nodes of DG(Sj)can belong to the cluster ci. During the intermediate step, synchronous operations are performed at the edges of this graph in the same direction to compute each ignored value of a cluster, ignVal(ci). In the example of Fig. 3, F3→F2→x3is the dependent acyclic graph for split node x3of cluster c1in the intermediate step. RF8→x0(x0)=Qx0→F7(x0)=DF8→F7(x0) (12) RF8→x0(x0)=max{x0,x1,x2,x3,x4}⧹{x0}[F8(x0,x1,x2,x3,x4)+Qx1→F8(x1)+Qx2→F8(x2)+Qx3→F8(x3)+Qx4→F8(x4)]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{Qx1→F8(x1)+Qx4→F8(x4)}+{RF9→x2(x2)}+{RF3→x3(x3)+RF4→x3(x3)}]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{RF9→x2(x2)+RF3→x3(x3)+RF4→x3(x3)}]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{DF9→F8(x2)+DF3→F8(x3)+DF4→F8(x3)}] (13) DF8→F7(x0)=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{DF9→F8(x2)+DF3→F8(x3)+DF4→F8(x3)}] (14) DFj→Fp(xi)=maxxj⧹xi[Fj(xj)+∑k∈Cj⧹FpDFk→Fj(xt)] (15) As discussed, the entire operation of the intermediate step is performed by the corresponding cluster head for each cluster. Therefore, apart from receiving the Mr values, which is literally a single message from the participating clusters of the first round, there is no communication cost in this step. This produces a significant reduction of communication cost (time) in PMP. Moreover, we can avoid the computation of variable-to-factor messages in the intermediate step as they are redundant and produce no further significance in this step. In the example of Fig. 4, we consider every possible scenario while computing the message F8→x0 (i.e. F8→F7), and show that the variable-to-factor messages (x4→F8,x1→F8,x2→F8,x3→F8) are redundant during the intermediate step of PMP (Equations (12)–(14)). Here, Q1→8(x1)={0,0,…,0} and Q4→8(x4))={0,0,…,0}, as x1 and x4 do not have any neighbours apart from F8. As a result, we get Equation (14) from Equations (12) and (13), and Equation (15) is the generalization of Equation (14). c1BeforeFirstRoundofMessagePassing:splitnodeofc1,S={S1=x3}initialvaluesforx3={0,0}BeforeIntermediateStep:ignoredvaluesforx3={RF2→x3(x3)}AfterIntermediateStep:initialvaluesforx3=S1.values={65,72} (16) c2BeforeFirstRoundofMessagePassing:splitnodeofc2,S={S1=x3,S2=x5}initialvaluesforx3={0,0}initialvaluesforx5={0,0}BeforeIntermediateStep:ignoredvaluesforx3={RF1→x3(x3)}ignoredvaluesforx5={RF4→x5(x5)}AfterIntermediateStep:initialvaluesforx3=S1.values={27,28}initialvaluesforx5=S2.values={65,72} (17) c3BeforeFirstRoundofMessagePassing:splitnodeofc3,S={S1=x5}initialvaluesforx5={0,0}BeforeIntermediateStep:ignoredvaluesforx5={RF3→x5(x5)}AfterIntermediateStep:initialvaluesforx5=S1.values={81,82} (18) Figure 4. View largeDownload slide Single computation within the intermediate step. In the figure, directed dashed arrows indicate the dependent messages to generate the desired message from F8 to x0 or F8 to F7 (directed straight arrows). Figure 4. View largeDownload slide Single computation within the intermediate step. In the figure, directed dashed arrows indicate the dependent messages to generate the desired message from F8 to x0 or F8 to F7 (directed straight arrows). Despite the aforementioned advantages, each synchronous execution (i.e. DFj→Fp(xi)) within DG(Sj) is still as expensive as a factor-to-variable message, and can be computed using Equation (15). In this context, Cj denotes the set of indexes of the functions connected to function Fj in the dependent acyclic graph ( DG(Sj)) of the intermediate step, and xt stands for a variable connected to both functions Fk and Fj. Notably, Equation (15) retains similar properties as Equation (3), but the receiving node is a function node (or the split node) instead of only a variable node. For example, x3 is a split node for cluster c1 in Fig. 3, where F2 and F3 are the nodes of the directed acyclic graph for x3. Here, the cluster head of c1 receives the Mr value {65,72}. Then the first operation on the graph produces {102,99} for the edge F3→F2. Afterwards, by taking {102,99} as the input, the cluster head of c1 generates {119,126}, which is the ignored value for x3∈c1 generated during the intermediate step. Now, instead of {0,0} the second round uses {119,126} as the initial value for node x3 in cluster c1. On the other hand, cluster c2 has two neighbouring clusters, c1 and c3, and neither of them have other clusters to depend on. Therefore, there is no need for further computation in the intermediate step for the split nodes x3 and x5 of cluster c2. The Mr values {27,28} and {65,72} need to be used as the ignored values for the split node x3 and x5, respectively, for the cluster c2. During different steps of PMP, all the values related to the split nodes of the clusters c1, c2 and c3 are shown in the set of Equations (16–18), respectively. Note that each synchronous operation (i.e. Equation (15)) on each edge of DG(Sj)) in the intermediate step still requires a significant amount of computation due to the potentially large parameter domain size and constraints with high arity. Considering this, in order to improve the computational efficiency of this step, we propose an algorithm to reduce the domain size over which the maximization needs to be computed (Algorithm 5). In other words, Algorithm 5 operates on Equation (15), that represents a synchronous operation of the intermediate step, to reduce its computational cost. This algorithm requires incoming messages from the neighbour(s) of a function in DG(Sj), and each local utility must be sorted independently by each state of the domain. Specifically, this sorting can be done before computing the StatusMessage during the time of the first round of message passing. Therefore, it does not incur an additional delay. Finally, this algorithm returns a pruned range of values for each state of the domain (i.e. {d1,d2,…,dr}) over which the maximization will be computed. Algorithm 5 Domain pruning to compute DFj→Fp(xi) in intermediate step of PMP. View Large Algorithm 5 Domain pruning to compute DFj→Fp(xi) in intermediate step of PMP. View Large As discussed in the previous section, DFj→Fp(xi) stands for a synchronous operation where Fj computes a value for Fp within DG(Sj). Initially, Line 2 computes m, which is the summation of the maximum values of the messages received by the sending function Fj, other than Fp. In the worked example of Fig. 5, we illustrate the complete process of domain pruning of the state B, while computing a sample message from F4 to F5. Notably, this is the same example we previously used in Section 2 to explain the function-to-variable message computation process (see Equations (6)–(9)), and it can be seen that the synchronous operation (i.e. F4 to F5) in the intermediate step is similar to that of the function-to-variable ( F4 to x6) computation. Here, the messages received by the sending node F4 are {81,82} and {40,45}. As the maximum of the received messages are 82 and 45, the value of m=82+45=127. Now, the for loop in Lines 3–13 generates the range of the values for each state di∈{d1,d2,…,dr} of the domain from where we will always find the maximum value for the function Fj, and discard the rest. To do so, Line 4 of the algorithm initially generates the maximum value p for the state di of the function Fj (i.e. maxdi(Fj(xj))). Then, Line 5 computes b, which is the summation of the corresponding values of p from the incoming messages of Fj. In the example of Fig. 5, the sorted local utility for B is {25,7,3,2}, from where we get b=81+40=121 for the maximum value, p=25. Afterwards, Line 6 gets the base case t, which is a subtraction of b from m (i.e. t=m−b=127−121=6). Figure 5. View largeDownload slide Worked example of domain pruning during the intermediate step of PMP. In this example, red and blue colours are used to distinguish the domain state R and B while performing the domain pruning. Figure 5. View largeDownload slide Worked example of domain pruning during the intermediate step of PMP. In this example, red and blue colours are used to distinguish the domain state R and B while performing the domain pruning. At this point, a value s, which is less than p, is picked from the sorted list of that state di (Line 8). Here, the function getVal(j) finds the value of s, with j representing the number of attempts. In the first attempt (i.e. j=1), it will randomly pick a value of s from a range of top log2∣di∣ values of di, where ∣di∣ is the size of di. Finally, if the value of t is less than or equal to p−s, the desired maximization must be found within the range of [p,s). Otherwise, we need to pick a smaller value of s from the next top log2∣di∣ values, and repeat the checking (Lines 9–12). In the worked example, we pick the value s=3 which is in the top 2 (i.e. log2∣4∣) values of B. Here, t is smaller than p−s, that is 6<(25−3), and it satisfies the condition of Line 9. As a result, the maximum value for the state B will definitely be found from range [25,3) or [25,7]. Hence, it is not required to consider the smaller values of s for this particular scenario. Eventually, introducing the domain pruning technique allows PMP to ignore these redundant operations during the intermediate step, thus reducing the computational cost in terms of completion time. Even for such a small example, this approach reduces half of the search space. Therefore, the overall completion time of the intermediate step can be reduced significantly by employing the domain pruning algorithm. 3.4. Comparative example Having discussed each individual step of PMP separately, Fig. 6 illustrates a complete worked example that compares the performance of SMP and PMP in terms of completion time. In so doing, we use the same factor graph shown in Fig. 1. Additionally, the message computation and transmission costs for the nodes, based on which the completion time is generated, are given in the figure. In general, a function-to-variable message is computationally significantly more expensive to generate compared to a variable-to-function message, and a node with a higher degree requires more time to compute a message than a node with a lower degree [19, 25, 27]. In this example, the values were chosen to reflect this observation. For instance, a function node F1 with degree 3 requires 20 ms to compute a message for any of its neighbouring nodes, and F2 (with degree 2) requires 10 ms to compute a message. On the other hand, for the variable-to-function messages, if a variable has only one neighbouring node (e.g. x0,x2), it generally sends a predefined initial message to initiate the message passing process. Therefore, the time required to generate such a message is negligible. On the contrary, we consider 2 ms as the time it takes to produce a variable-to-function message when the variable has degree 2 (e.g. x1,x4). Moreover, we consider 5 ms as the time it takes to transmit a message from one node to another in this example. Furthermore, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. Figure 6. View largeDownload slide Comparative example of SMP (top) and PMP (bottom), in terms of completion time, based on the factor graph shown in Fig. 1. In the figure, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means, F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. Figure 6. View largeDownload slide Comparative example of SMP (top) and PMP (bottom), in terms of completion time, based on the factor graph shown in Fig. 1. In the figure, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means, F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. The total calculation of the completion time following the SMP protocol is depicted at the top of the figure. At the beginning, nodes x0, x2, x6 and x8 initiate the message passing, and their corresponding receiving nodes F0, F1, F4 and F5 receive messages after 5 ms. Then, x1 and x7 receive messages from F0 and F5, respectively, after 20 ms. Although F1 has already received a message from x2, it cannot generate a message as it requires at least two messages to produce one. In this process, the message passing process will complete when all of the nodes receive messages from all of their neighbours. In this particular example, this is when x0 and x8 receive messages from F0 and F5, respectively, after 150 ms. Thus, the completion time of SMP is 150 ms. On the other hand, PMP splits the original factor graph into three clusters in this example. Each of the clusters executes the message passing in parallel following the similar means and regulation as its SMP counterpart. To be precise, the first round of message passing is completed after 52 ms. Afterwards, the intermediate step to recover the ignored values for the split nodes is initiated. For cluster c1, x3 is the split node that ignored the message coming from F2 during the first round. Two synchronous operations, DF3→F2(x4) and DF2→x3(x3), are required to obtain the desired value for the split node x3. Each of these operations is as expensive as the corresponding function-to-variable messages. However, Algorithm 5 can be used to reduce the cost of these operations, and we consider a reduction of 40%, since this is the minimum reduction we get from the empirical evaluation (see Section 4). In this process, the intermediate step of cluster c1 and c3 is completed after 64 ms. Unlike those two clusters, cluster c2 shares split node x3 ( x5) with such a cluster c1 ( c3) that has no other cluster to depend on apart from c2. Therefore, the ignored values for x3 and x5 are ready immediately after the completion of the first round (see Algorithm 4). As a result, cluster c2 can start its second round after 52 ms. In any case, the second round utilizes the recovered ignored values as the initial values for the split nodes to produce the same outcome as its SMP counterpart. We can observe that the second round of message passing completes after 116 ms. Thus, even for such a small factor graph of six function nodes and nine variable nodes, we can save around 23% of the completion time by replacing SMP with PMP. 4. EMPIRICAL EVALUATION Given the detailed description in previous section, we now evaluate the performance of PMP to show how effective it is in terms of completion time compared with the benchmarking algorithms that follow the SMP protocol. To maintain the distributed nature, all the experiments were performed on a simulator in which we generated different instances of factor graph representations of DCOPs that have varying numbers of function nodes 100–10,000. Hence, the completion time that is reported in this section is a simulated distributed metric, and the factor graphs are generated by randomly connecting a number of variable nodes per function from the range 2–7. Although we use these ranges to generate factor graphs for this paper, the results are comparable for larger settings. Now, to evaluate the performance of PMP on different numbers of clusters, we report the result for the number of clusters 2–99 for the factor graph of 100–900 function nodes, and 2–149 for the rest. These ranges were chosen because the best performances are invariably found within these ranges, and the performance steadily gets worse for larger numbers of clusters. As both SMP and PMP are generic protocols that can be applied to any GDL-based DCOP formulation, we run our experiments on the generic factor graph representation, rather than an application specific setting. To be exact, we only focus our objective on evaluating the performance of PMP compared with the SMP protocol in terms of the completion time of message passing process, rather than concentrating on the overall algorithm of a DCOP solution approach. Notably, the completion time of such algorithms mainly depends on following three parameters: Average time to compute a potentially expensive function-to-variable message within a factor graph, denoted as Tp1. Average time to compute an inexpensive variable-to-function message within a factor graph, denoted as Tp2. Average time to transmit a message between nodes of a factor graph, denoted as Tcm. In the MAS literature, a number of extensions of the Max-Sum/BMS algorithms have been developed. Significantly, each of them can be defined by different ratios of the above mentioned parameters. For example, the value of Tp1Tp2 is close to 1 for algorithms such as FMS, BFMS or BnB-FMS, because they restrict the domain sizes for the variables always to 2 [2, 32]. In contrast, in a DCOP setting with large domain size, the value of Tp1Tp2 is much higher for a particular application of the Max-Sum or the BMS algorithm [24, 22, 25]. Additionally, the communication cost or the average message transmission cost (Tcm) can vary due to different reasons such as environmental hazard in disaster response or climate monitoring application domains [33, 34]. To reflect all these issues in evaluating the performance of PMP, we consider different ratios of those parameters to show the effectiveness of PMP over its SMP counterpart in a wide range of conceivable settings. To be exact, we run our experiments on seven different settings, each of which has identical ratios of the parameters: Tp1, Tp2 and Tcm. Note that, once the values of each of the parameters have been fixed for a particular setting, the outcome remains unchanged for both SMP and the different versions of PMP even if we repeat the experiments for that setting. This is because, we run both the protocols on acyclic or transformed acyclic version of a factor graph, hence they always provide a deterministic outcome. Hence, there is no need to perform an analysis of statistical significance for this set of experiments. Note that all of the following experiments are performed on a simulator implemented on an Intel i7 Quadcore 3.4 GHz machine with 16 GB of RAM. 4.1. Experiment E1: Tp1>Tp2 AND Tp1≈Tcm Figures 7(a) and (b) illiterate the comparative measure on completion time for SMP and PMP under experimental setting E1 for the factor graph with the number of function nodes 100–900 and 3000–10,000, respectively. Each line of the figures shows the result of both SMP (Number of Clusters=1) and PMP (Number of Clusters>1). The setting E1 characterizes a scenario where average computation cost (time) of a function-to-variable message ( Tp1) is moderately more expensive than a variable-to-function message ( Tp2), and the average time to transmit a message between nodes ( Tcm) is approximately similar to Tp1. To be precise, we consider Tp2 be 100 times less expensive than a randomly taken Tp1 for this particular experiment. The scenario E1 is commonly seen in the following GDL-based DCOP algorithms: Max-Sum, BMS and FMS. Once these three parameters have been determined, the completion time of SMP (i.e. Tsmp) and PMP (i.e. Tpmp) can be generated using Equations (19) and (20), respectively. Here, the function requiredTime() takes Tp1,Tp2,Tcm and an acyclic factor graph as an input, and computes the time it needs to finish the message passing by following the regulation of SMP. Tsmp=requiredTime(FG,Tp1,Tp2,Tcm) (19) Tpmp=2×requiredTime(clargest,Tp1,Tp2,Tcm)+Tintm (20) Figure 7. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster >1) for the experimental setting E1: ( Tp1>Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 7. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster >1) for the experimental setting E1: ( Tp1>Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. As discussed in Section 3, due to parallel execution on each cluster, for PMP, we only need to consider the largest cluster of FG (i.e. clargest) instead of the complete factor graph FG. Altogether the completion time of PMP includes the time required to complete the two rounds of message passing on the largest cluster with the addition of the time it takes to complete the intermediate step (Tintm). In the intermediate step, each synchronous operation is as expensive as the factor-to-variable message ( Tp1). However, during the intermediate step, the proposed domain pruning technique of PMP (i.e. Algorithm 5) minimizes the cost of Tp1 by reducing the size of the domain (i.e search space) over which maximization needs to be computed. To empirically evaluate the performance of the domain pruning technique, we independently test it on a randomly generated local utility table that has a varying domain size from 2 to 20. In general, we observe a significant reduction of the search space, ranging from 40% to 75%, by using this technique, and as expected the results are getting better with the increase of the domain size (see Section 3.3). Hence, to reflect the worst case scenario, we consider only a 40% reduction for each operation of the intermediate step while computing the completion time of PMP for all the results reported in this paper. According to Fig. 7(a), the best performance of PMP compared to the SMP protocol can be found if the number of clusters is picked from the range {5–25}. In particular, for the smaller factor graphs this range becomes smaller. For example, when we are dealing with a factor graph of 100 function nodes the best results are found within the range of {5–18} clusters; afterwards, the performance of PMP gradually decreases. This is because the time required to complete the intermediate step increases steadily when the cluster size gets smaller (i.e. the number of clusters gets larger). On the other hand, the time it takes to complete the two rounds of message passing increases when the cluster size becomes larger. As a consequence, it is observed from the results that the performance of PMP drops steadily with the increase of the number of clusters after reaching to its peak with a certain number of clusters. Generally, we observe a similar trend in each scenario. Therefore, a proper balance is necessary to obtain the best possible performance from PMP (see Section 5). Notably, for the larger factor graphs, the comparative performance gain of PMP is more substantial in terms of completion time due to the consequence of parallelism. As observed, PMP running over a factor graph with 100–300 function nodes achieves around 53–59% performance gain (Fig. 7(a)) over its SMP counterpart. On the other hand, PMP takes 61–63% less time than SMP when larger factor graphs (600–900 functions) are considered. Finally, Fig. 7(b) depicts that this performance gain reaches around 61–65% for the factor graph having 3000–10 000 function nodes. Here, this performance gain of PMP is achieved if the number of clusters is chosen from the range of {25–44}. 4.2. Experiment E2: Tp1≫Tp2 AND Tp1≫Tcm In experimental setting E2, we generated the results based on similar comparative measures and representations as the setting E1 (Fig. 8). However, E2 characterizes the scenario where the average computation cost (time) of a function-to-variable message ( Tp1) is extremely expensive compared to the variable-to-function message ( Tp2), and the average time to transmit a message between nodes ( Tcm) is considerably more inexpensive compared to Tp1. To be exact, we consider Tp2 be 10 000 times less expensive than a randomly taken Tp1 for this particular setting. Here, Tp1 is considered 200 times more time consuming than Tcm. Max-Sum and BMS are two exemplary GDL-based algorithms where E2 is commonly seen. More specifically, this particular setting reflects those applications that contain variables with high domain size. For example, assume the domain size is 15 for all five variables involved in a function. In this case, to generate each of the function-to-variable messages, the corresponding agent needs to perform 155 or 7,59,375 operations. Since Tp1 is extremely expensive in this experimental setting, the performance of PMP largely depends on the performance of the domain pruning technique. Similar to the above experiment, Fig. 8(a) shows the results for the factor graphs having 100–900 function nodes, and the results obtained by applying on larger factor graphs (3000–10 000 function nodes) are shown in Fig. 8(b). This time, the best performance of PMP for those two cases are observed when the number of clusters are picked from the ranges {15–41} and {45–55} respectively. Afterwards the performance of PMP drops gradually due to the same reason as witnessed in E1. Notably, the performance gain reaches around 37–42% for the factor graphs having 100–900 function nodes, and 41–43% for 3000–10 000 function nodes. Figure 8. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E2: ( Tp1≫Tp2 AND Tp1≫Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 8. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E2: ( Tp1≫Tp2 AND Tp1≫Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.3. Experiment E3: Tp1≫Tp2 AND Tp1>Tcm Experimental setting E3 possesses similar properties and scenarios as E2, apart from the fact that here Tp1 is moderately more expensive than Tcm instead of extremely more expensive. Similar to the previous experiment, we consider Tp2 be 10 000 times less expensive than a randomly taken Tp1. However, Tcm is taken only 10 times less expensive than Tp1. It is observed from the results that even without the domain pruning technique, PMP minimizes the cost of Tcm and Tp2 significantly in E3. This is because Tcm is not too inexpensive; and the operations of the intermediate step do not include any communication cost. Moreover, given Tp1 is also very expensive, PMP produces better performance than what we observed in E2 by utilizing the domain pruning technique. Altogether, Figs. 9(a) and (b) show that PMP consumes 45–49% less time than SMP for this setting when the number of clusters is chosen from the range {17–47}. Max-Sum and BMS are the exemplary algorithms where settings similar to E3 are commonly seen. Figure 9. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E3: ( Tp1≫Tp2 AND Tp1>Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 9. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E3: ( Tp1≫Tp2 AND Tp1>Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.4. Experiment E4: Tp1≫Tp2 AND Tp1≈Tcm Figure 10 shows the comparative results of PMP over SMP for experimental setting E4. E4 characterizes the scenarios where Tp1 is extremely more expensive than Tp2, and approximately equal to Tcm. To be exact, we consider Tp2 be approximately 5000 times less expensive than randomly taken values of Tp1 and Tcm. Here, both Tp1 and Tcm are substantial, and hence PMP achieves notable performance gains over SMP, compared with the previous experiments. According to the graphs of Figs. 10(a) and (b) PMP takes 59–73% less time compared to its SMP counterpart. The preferable range of number of clusters for the setting E4 is {15–55}. Figure 10. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E4: ( Tp1≫Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 10. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E4: ( Tp1≫Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.5. Experiment E5: Tp1≈Tp2 AND Tp1≈Tcm Experiment E5 characterizes the scenarios where Tp1 and Tcm are approximately equal to the inexpensive Tp2. Such a scenario normally occurs when message passing (SMP or PMP) is applied on the following algorithms: FMS, BnB Max-Sum [33], Generalized Fast Belief Propagation [26] or Max-Sum/BMS with small domain size and inexpensive communication cost in terms of time. This is a trivial setting where each of the three parameters is not that expensive. Specifically, as Tp1 is inexpensive, the domain pruning technique has less impact on reducing the completion time of PMP. However, the effect of parallelism from the clustering process coupled with the avoidance of redundant variable-to-function messages during the intermediate step allows PMP to take 55–67% less time than its SMP counterpart (Figs. 11(a) and (b)). The preferable range of number of clusters is same as the setting E4. Figure 11. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E5: ( Tp1≈Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 11. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E5: ( Tp1≈Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.6. Experiment E6: Tp1≈Tp2 AND Tp1≪Tcm Figure 12 illustrates the comparative results of PMP over SMP for experimental setting E6, which possess similar properties, scenarios and the applied algorithms as E5. However, in E6, the average message transmission cost Tcm is considerably more expensive than Tp1 and Tp2. To be exact, we consider Tp1 be 15 times less expensive than a randomly taken value of Tcm. As Tcm is markedly more expensive and Tp2 is approximately equal to Tp1, the performance gain of PMP increases to the highest level (70–91%). To be precise, the reduction of communication by avoiding the variable-to-function messages during the intermediate step, which is extremely expensive in this setting, helps PMP achieves this performance. This result signifies that PMP performs best in those settings where the communication cost is expensive. Note that the preferable range of number of clusters for the setting E6 is {15–61}. Figure 12. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E6: ( Tp1≈Tp2 AND Tp1≪Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 12. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E6: ( Tp1≈Tp2 AND Tp1≪Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.7. Experiment E7: Tp1≈Tp2 AND Tp1<Tcm Experiment E7 possesses similar properties, scenarios and exemplary algorithms as the setting E6, with the following exception: Tcm in E7 is moderately more expensive than Tp1 instead of considerably more expensive. To be precise, we consider Tp1 be four times less expensive than randomly taken values of Tcm. Due to the less substantial value of Tcm, unlike E6 where the performance gain reaches to the maximum level, PMP consumes 65–82% less time compared to its SMP counterpart (Figs. 13(a) and (b)). The preferable range of number of clusters for E7 is {17–50}. Figure 13. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the Experimental setting. E7: ( Tp1≈Tp2 AND Tp1<Tcm). (a) Number of function nodes (factors): 100–900 and (b) Number of function nodes (factors): 3000–10000. Figure 13. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the Experimental setting. E7: ( Tp1≈Tp2 AND Tp1<Tcm). (a) Number of function nodes (factors): 100–900 and (b) Number of function nodes (factors): 3000–10000. 4.8. Total number of messages The most important finding to emerge from the results of the experiments is that PMP significantly reduces the completion time of the GDL-based message passing algorithms for all the settings. However, PMP requires more messages to be exchanged compared with its SMP counterpart due to two rounds of message passing. To explore this trade-off, Fig. 14 illustrates the comparative results of PMP and SMP in terms of the total number of messages for factor graphs with a number of function nodes 50–1200 with an average five variables connected to a function node. The results are comparable for settings with higher arities. Specifically, we find that PMP needs 27–45% more messages than SMP for a factor graph having less than 500 function nodes and 15–25% more messages for a factor graph having more than 500 nodes. As more messages are exchanged at the same time in PMP due to the parallel execution, this phenomenon does not affect the performance gain in terms of completion time. Figure 14. View largeDownload slide Total number of messages: SMP vs. PMP. Figure 14. View largeDownload slide Total number of messages: SMP vs. PMP. Now, based on the extensive empirical results, we can claim that, in PMP, even randomly splitting a factor graph into a number of clusters, within the range of around 10–50 clusters, always produces a significant reduction in completion time of GDL-based DCOP algorithms. However, this performance gain is neither guaranteed to be the optimal one, nor deterministic for a given DCOP setting. Therefore, we need an approach to predict how many clusters would produce the best performance from PMP for a given scenario. At this point, we only have a range from which we should pick the number of clusters for a certain factor graph representation of a DCOP. 5. APPROXIMATING THE APPROPRIATE NUMBER OF CLUSTERS FOR A DCOP In this section, we turn to the challenge of determining the appropriate number of clusters for a given scenario in PMP. The ability to predict a specific number in this regard would allow PMP to split the original factor graph representation of a DCOP accurately into a certain number of clusters, prior to executing the message passing. In other words, this information allows PMP to be applied more precisely in different multi-agent DCOPs. However, it is not possible to predict the optimal number of clusters due to the diverse nature of the application domains, and the fact that a graphical representation of a DCOP can be altered at runtime. Therefore, we use an approximation. To be precise, we use a linear regression method, and run it off-line to approximate a specific number of clusters for a DCOP before initiating the message passing of PMP. In this context, logistic regression, Poisson regression and a number of classification models could be used to predict information from a given finite data set. However, they are more suited to estimate categorical information rather than predicting specific numerical data required for our model. Therefore, we choose the linear regression method for our setting. Moreover, this method is time efficient in terms of computational cost because as an input it only requires an approximate number of function nodes of the corresponding factor graph representation of a DCOP in advance. The remainder of this section is organized as follows. In Section 5.1, we explain the linear regression method, and detail of how it can be used along with the PMP protocol to predict the number of clusters for a specific problem instance. Then, Section 5.2 presents our empirical results of using this method on different experimental settings (i.e. E1, E2,…,E7) defined and used in the previous section. Specifically, we show the differences in performance of PMP considering the prediction method compared with its best possible results in terms of completion time. Notably, PMP’s performance gain, for each value within the preferred range of number of clusters, is shown in the graphs of the previous section. Here, we run a similar experiment to obtain the best possible performance gain for a certain problem instance, and then compare this with the gain obtained by using the predicted number of clusters. Finally, we end this section by evaluating the performance of PMP as opposed to SMP on two explicit implementations of GDL-based algorithms. 5.1. Determining the appropriate number of clusters Regression analysis is one of the most widely used approaches for numeric prediction [35, 36]. The regression method can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable which is continuous valued. Many problems can be solved by linear regression, and even more can be handled by applying transformations to the variables so that a non-linear problem can be converted to a linear one. Specifically, the linear regression with a single predictor variable is known as straight-line linear regression, meaning it only involves a response variable Y and a single predictor variable X. Here, the response variable Y is modelled as a linear function of the predictor variable X (Equation (21)). Y=W0+W1X (21) W1=∑i=1∣D∣(Xi−X¯)(Yi−Y¯)∑i=1∣D∣(Xi−X¯)2 (22) W0=Y¯−W1X¯ (23) In Equation (21), the variance of Y is assumed to be constant, and W0 and W1 are regression coefficients which can be thought of as weights. These coefficients can be solved by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. Let D be the training set consisting of values of the predictor variable X and their associated values for the response variable Y. This training set contains ∣D∣ data points of the form (X1,Y1),(X2,Y2),…,(X∣D∣,Y∣D∣). Equations (22) and (23) are used to generate the regression coefficients W1 and W0, respectively. Now, the linear regression analysis can be used to predict the number of clusters for a certain application given that continuously updated training data from the experimental results of PMP exists. To this end, Table 1 contains the sample training data taken from the results shown in Section 4. Here, we formulate this training data D so that straight-line linear regression can be applied, where D consists of the values of a predictor variable X (number of function nodes) and their associated values for a response variable Y (number of clusters). In more detail, this training set contains ∣D∣ (number of nodes − number of clusters) data of the form (X1,Y1),(X2,Y2),…,(X∣D∣,Y∣D∣). Initially, Equations (22) and (23) are used to generate regression coefficients W1 and W0, respectively, which are used to predict the appropriate number of clusters (response variable, Y) for a factor graph with a certain number of function nodes (predictor variable, X) (Equation (21)). For instance, based on the training data of Table 1, we can predict that for factor graphs with 4500 and 9200 function nodes PMP should split the graphs into 43 and 51 clusters, respectively (Table 2). As we need to deal with only a single predictor variable, we are going to use the terms linear regression and straight-line linear regression interchangeably. In the remainder of this section, we evaluate the performance of this extension through extensive empirical evidence. Table 1. Sample training data from Figs. 7–13. Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 View Large Table 1. Sample training data from Figs. 7–13. Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 View Large Table 2. Predicted number of clusters by applying the straight-line linear regression (Equations (21–23)) on the training data of Table 1. Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 View Large Table 2. Predicted number of clusters by applying the straight-line linear regression (Equations (21–23)) on the training data of Table 1. Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 View Large 5.2. Empirical evaluation In this section, we evaluate the performance of PMP by considering the number of clusters predicted using the linear regression method in terms of completion time, and compare this with the highest possible performance gain from PMP, which is the best case outcome of PMP using a certain number of clusters. In so doing, we use the same experimental settings (E1, E2,…,E7) used in Section 4. Specifically, Table 3 illustrates the comparative performance gain of PMP using the straight-line linear regression and the highest possible gain for five factor graphs having the number of function nodes: 3050, 5075, 6800, 8050 and 9975 based on the experimental setting E1 and E6. We repeat the experiments of Section 4 for each of these factor graphs to obtain the highest possible performance gain from PMP. That is, we reported the performance of PMP for all the clusters ranging from 2 to 150. From these results, we get the highest possible gain and the performance based on the predicted number of clusters of PMP. It can be seen that for the factor graph with 3050 function nodes the highest possible gain of PMP reaches to 60.69%, meaning PMP takes 60.69% less time to complete the message passing operation to solve the DCOP representing the factor graph than its SMP counterpart. Now, if PMP is applied considering the predicted number of clusters (i.e. 41) using the straight-line linear regression (Table 2), the gain reaches to 60.21%. This indicates, PMP ensures 98.7% of its possible performance gain by applying the straight-line linear regression. Similarly, PMP ensures 99.64% of the possible performance gain for a factor graph with 9975 function nodes in the experimental setting E6 while applied based on the number of clusters obtained from the linear regression method. Notably, this trend is common for the rest of the factor graphs, and in each case more than 98.42% of the best possible results are assured by applying the straight-line linear regression according to the results shown in Table 3, and all the results are comparable for all the other experimental settings. Significantly, it can be ascertained from our experiments that a minimum of around 98.5% of the best possible results of PMP can be achieved if the number of clusters is predicted by the straight-line linear regression method. Notably, a common phenomenon is noticed from the empirical evaluation of Section 4 that the performance of PMP falls very slowly after it reaches to its peak by a certain number of clusters on either increasing or decreasing that number. This is why, approximating a number of clusters produces such good results. Table 3. Performance gain of PMP using the linear regression method compared to the highest possible gain from PMP. Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 View Large Table 3. Performance gain of PMP using the linear regression method compared to the highest possible gain from PMP. Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 View Large 6. SUPPLEMENTARY EMPIRICAL RESULTS In the final experiment, we analyse the performance of PMP (compared with SMP) based on two GDL-based algorithms: Max-Sum and Fast Max-Sum. We do this to observe the performance of PMP on the actual runtime metric that complements our controlled and systematic experiments of Section 4. In the experiment, we consider the predicted number of clusters obtained using the linear regression method. Here, the factor graphs are generated in the same way as they are for the experiments of Section 4. Additionally, we make use of the Frodo framework [37] to generate local utility tables (i.e. cost function) for the function nodes of the factor graphs. On the one hand, we use two ranges, (5–8) and (12–15) of the variables’ domain size to generate the utility tables for Max-Sum. In doing so, we are able to observe the comparative result for different ratios of the parameters Tp1 and Tp2. To be precise, these two ranges reflect the scenarios Tp1>Tp2 and Tp1≫Tp2, respectively. On the other hand, we restrict the domain size to exactly 2 for all the variables in case of Fast Max-Sum so that it reflects the characteristic of the algorithm (i.e. Tp1≈Tp2). Notably, it is not possible to emulate a realistic application, such as disaster response or climate monitoring in a simulated environment that provides the actual value of Tcm [38]. Consequently, we observe the value of Tcm to be very small in this experiment. It can be seen from the solid-gray line of Fig. 15(a) that PMP takes around 55–60% less time than SMP to complete the message passing process for Fast Max-Sum on the factor graph having 100–900 function nodes. Meanwhile PMP reduces 35–42% of SMP’s completion time for Max-Sum where the variables’ domain size is picked from the range 5–8 (dashed-black line). This is because in the former case, all three parameters (i.e. Tp1, Tp2 and Tcm) are small and comparable. Therefore, the parallel execution of message passing, along with the avoidance of variable-to-factor messages in the intermediate step, allows PMP attain this performance in this case. In contrast, its performance in the latter case mainly depends on the impact of domain reduction in the intermediate step, given that the value of Tp2 and Tcm is negligible when compared to Tp1. The same holds true for Max-Sum with a larger domain size, where we observe a 67–72% reduction in completion time by PMP, as opposed to its SMP counterpart (dashed-gray line). However, the observed outcome in this case indicates that the impact of domain reduction in intermediate step gets better with an increase in domain size. Figure 15(b) illustrates a similar trend in the performance of PMP, wherein we take larger factor graphs of 3000 to around 10 000 function nodes into consideration. Here, we observe an even better performance for each of the cases, due to the impact of parallelism in larger settings. Figure 15. View largeDownload slide Empirical performance of PMP vs. SMP running on two GDL-based algorithms. Error bars are calculated using standard error of the mean. (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 15. View largeDownload slide Empirical performance of PMP vs. SMP running on two GDL-based algorithms. Error bars are calculated using standard error of the mean. (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. It can be seen from the solid-gray line of Fig. 15(a) that PMP takes around 55–60% less time than SMP to complete message passing process for Fast Max-Sum on the factor graph having 100–900 function nodes. Whereas, PMP reduces 35–42% of SMP’s completion time for Max-Sum where variables’ domain size is picked from the range 5–8 (dashed-black line). This is due to the fact that, in the former case, all three parameters (i.e. Tp1, Tp2 and Tcm) are small and comparable. Hence, the parallel execution of message passing, coupled with the avoidance of variable-to-factor messages in the intermediate step, allows PMP achieve this performance in this case. In contrast, the performance of PMP in the latter case mainly depends on the impact of domain reduction in the intermediate step, as the value of Tp2 and Tcm is negligible, compared with Tp1. The same reason is true for Max-Sum with larger domain size, where we observe 67–72% reduction of completion time by PMP, compared with its SMP counterpart (dashed-gray line). However, the observed outcome in this case gives us a notable indication that the impact of domain reduction in intermediate step gets better with the increase in the domain size. Figure 15(b) illustrates similar trend in the performance of PMP, in which we consider larger factor graphs of 3000 to around 10 000 function nodes. Here, we observe even better performance for each of the cases, due to the impact of parallelism in larger settings. 7. RELATED WORK DCOP algorithms often find a solution in a completely decentralized way. However, centralizing part of the problem can reduce the effort required to find globally optimal solutions. Although PMP is based on the GDL framework, which provides DCOPs solutions in a decentralized manner, a representative agent from each cluster (i.e. cluster head) takes the responsibility for working out a number of synchronous operations during its intermediate step. In other words, PMP uses only the cluster heads to complete the operation of the intermediate step, instead of using all cooperating agents in the system. This phenomenon means PMP is a partially centralized approach. In the multi-agent systems literature, a number of approaches utilize the computational power of comparatively more powerful agents to find and solve hard portions of a DCOP. In particular, Optimal Asynchronous Partial Overlay (OptAPO) is an exact algorithm that discovers the complex parts of a problem through trial and error, and centralizes these sub-problems into mediating agent(s) [12]. The message complexity of OptAPO is significantly smaller than the benchmark exact algorithm ADOPT. However, in order to guarantee that an optimal solution has been found, one or more agents may end up centralizing the entire problem, depending on the difficulty of the problem and the tightness of the interdependence between the variables. As a consequence, it is impossible to predict where and what portion of the problem will eventually be centralized, or how much computation the mediators have to resolve. Nonetheless, it is possible that several mediators needlessly duplicate effort by solving overlapping problems. To address these issues, [11] introduces a partial centralization technique (PC-DPOP) based on another benchmarking exact algorithm DPOP. Unlike OptAPO, it also offers a prior exact prediction about communication, computation and memory requirements. However, although PC-DPOP provides better control over what parts of the problem are centralized than OptAPO, this control cannot be guaranteed to be optimal. Our approach originates from the non-exact GDL framework that is suitable for larger settings. Moreover, our approach is mostly decentralized, and the specific part (i.e. intermediate step) of the algorithm which needs to be done by the cluster heads is known in advance. Therefore, no effort is required to find these parts of a DCOP. As a consequence, there is no ambiguity in deciding which part of a problem should be done by which agent, nor is there any possibility of duplicating efforts in solving overlapping problems during this step of PMP. As far as the splitting of graphical representation is concerned, [18] proposes the use of a divide-and-coordinate approach in order to provide quality guarantees for non-exact DCOP algorithms. Nevertheless, their approach is neither targeted at reducing the completion time of DCOP algorithms, nor specifically based on GDL framework. Additionally, [28, 39] utilize greedy methods to maximize ‘residual’ at each iteration of classical belief propagation method, so that the overall convergence property (i.e. solution quality) can be improved. On the contrary, in this paper, we do not aim to improve the solution quality or convergence property. Rather, our objective is to minimize the completion time of GDL-based DCOP algorithms while maintaining the same solution quality. Over the past few years, a number of efforts have been sought to improve the scalability of GDL-based message passing algorithms. They are mainly built upon the Max-Sum or BMS algorithm, and focus on reducing the cost of the maximization operator of those algorithms. However, most of them typically limit the general applicability of such algorithms. For instance, FMS, BFMS and BnB-FMS can only be applied to a specific problem formulation of task allocation domain, and [40] proposes an explorative version of Max-Sum to solve a modified DCOP formulation specifically designed for mobile sensor agents [41]. On the other hand, other approaches rely on a pre-processing step, thus denying the opportunity of obtaining local utilities at runtime [26, 33, 42]. Despite these criticisms, these extensions perform well in certain DCOP settings. Moreover, given all of the extensions of Max-Sum and BMS follow the SMP protocol (or its asynchronous version), PMP can be easily applied on those algorithms in place of SMP. 8. CONCLUSIONS AND FUTURE WORK In this paper, we propose a generic framework which significantly reduces the required completion time of GDL-based message passing algorithms while maintaining the same solution quality. To be precise, our approach is applicable to all the GDL-based algorithms which use factor graph as a graphical representation of DCOPs. In particular, we provide a significant reduction in completion time for such algorithms, ranging from a reduction of 37–91% depending on the scenario. To achieve this performance, we introduced a cluster based method to parallelize the message passing procedure. Additionally, a domain reduction algorithm is proposed to further minimize the cost of the expensive maximization operation. Afterwards, we addressed the challenge of determining the appropriate number of clusters for a given scenario. In so doing, we propose the use of a linear regression prediction method for approximating the appropriate number of clusters for a DCOP. Remarkably, through the empirical results, we observe that more than 98% of the best possible outcomes can be achieved if PMP is applied on the number of clusters predicted by the straight-line linear regression. This means, if we know the size of a factor graph representation of a DCOP prior to performing the message passing, we can utilize the straight-line linear regression method to find out how many clusters should be created from that factor graph. Thus, we make PMP a deterministic approach. Given this, by using the PMP approach, we now can indeed use GDL-based algorithms to efficiently solve larger DCOPs. Notably, similar to the DCOP algorithms based on the Standard Message Passing protocol, PMP-based algorithms always find the optimal solutions for acyclic factor graphs and bounded approximate solution for cyclic factor graphs. The sacrifice in solution quality for cyclic graphical representation of DCOPs still limits the applicability of GDL-based approaches. Moreover, when the number of cycles is higher in a factor graph, the complexity of obtaining transformed acyclic factor graphs from the factor graph increases significantly [22]. At the same time, it is challenging to maintain the acceptable solution quality. In future work, we intend to investigate whether the clustering process of PMP can be utilized to obtain better solution quality for cyclic factor graphs to further extend the applicability of GDL-based approaches. We also intent to investigate the influence of a good cluster-head selection strategy on PMP in the future work. Moreover, often partially centralized approaches trade privacy for higher scalability. In the future, we also intend to analyse PMP in the context of DCOP privacy. Furthermore, as discussed in Section 2, the function and variable nodes of a factor graph are distributed among a number of agents. To date, there is no approach that determines how many agents should participate in this process for a particular situation. We intend to develop a model that addresses this issue for a given scenario, and thus we can avoid unnecessary agents’ involvement in the message passing procedure that would reduce communication costs by its own nature. As a result, such algorithms will be able to cope with even larger multi-agent settings. References 1 Modi , P.J. , Shen , W. , Tambe , M. and Yokoo , M. ( 2005 ) Adopt: asynchronous distributed constraint optimization with quality guarantees . Artif. Intell. , 161 , 149 – 180 . Google Scholar CrossRef Search ADS 2 Ramchurn , S.D. , Farinelli , A. , Macarthur , K.S. and Jennings , N.R. ( 2010 ) Decentralized coordination in RoboCup rescue . Comput. J. , 53 , 1447 – 1461 . Google Scholar CrossRef Search ADS 3 Zivan , R. , Yedidsion , H. , Okamoto , S. , Glinton , R. and Sycara , K. ( 2014 ) Distributed constraint optimization for teams of mobile sensing agents . Auton. Agent Multi Agent Syst. , 29 , 495 – 536 . Google Scholar CrossRef Search ADS 4 Farinelli , A. , Rogers , A. and Jennings , N.R. ( 2014 ) Agent-based decentralised coordination for sensor networks using the max-sum algorithm . Auton. Agent Multi Agent Syst. , 28 , 337 – 380 . Google Scholar CrossRef Search ADS 5 Junges , R. and Bazzan , A.L. ( 2008 ) Evaluating the Performance of DCOP Algorithms in a Real World, Dynamic Problem. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 599–606. IFAAMAS. 6 Maheswaran , R.T. , Tambe , M. , Bowring , E. , Pearce , J.P. and Varakantham , P. ( 2004 ) Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-event Scheduling. Proc. 3rd Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, New York, USA, July 9–23, pp. 310–317. ACM. 7 Cerquides , J.B. , Farinelli , A. , Meseguer , P. and Ramchurn , S.D. ( 2013 ) A tutorial on optimization for multi-agent systems . Comput. J. , 57 , 799 – 824 . Google Scholar CrossRef Search ADS 8 Dechter , R. ( 2003 ) Constraint Processing ( 1st edn ). Morgan Kaufmann , Massachusetts, USA . 9 Yeoh , W. , Felner , A. and Koenig , S. ( 2008 ) Bnb-Adopt: An Asynchronous Branch-and-Bound DCOP Algorithm. Proc. 7th international Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 591–598. IFAAMAS. 10 Petcu , A. and Faltings , B. ( 2005 ) A Scalable Method for Multiagent Constraint Optimization. Proc. 19th Int. Joint Conf. Artificial Intelligence, Edinburgh, Scotland, July 30–August 5, pp. 266–271. AAAI Press. 11 Petcu , A. , Faltings , B. and Mailler , R. ( 2007 ) Pc-DPOP: A New Partial Centralization Algorithm for Distributed Optimization. Proc. 20th Int. Joint Conf. Artificial Intelligence, Hyderabad, India, January 6–12, pp. 167–172. AAAI Press. 12 Mailler , R. and Lesser , V. ( 2004 ) Solving Distributed Constraint Optimization Problems using Cooperative Mediation. Proc. 3rd Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, New York, USA, July 9–23, pp. 438–445. ACM. 13 Fitzpatrick , S. and Meertens , L. ( 2003 ) Distributed Coordination through Anarchic Optimization’. In Lesser , V. , Ortiz , C.L. and Tambe , M. (eds.) Distributed Sensor Networks: A Multiagent Perspective . Springer , Boston, USA . 14 Maheswaran , R.T. , Pearce , J.P. and Tambe , M. ( 2004 ) Distributed Algorithms for DCOP: A Graphical-Game-based Approach. Proc. ISCA 17th Int. Conf. Parallel and Distributed Computing Systems (ISCA PDCS), San Francisco, USA, September 15–17, pp. 432–439. ACTA Press. 15 Kiekintveld , C. , Yin , Z. , Kumar , A. and Tambe , M. ( 2010 ) Asynchronous Algorithms for Approximate Distributed Constraint Optimization with Quality Bounds. Proc. 9th Int. Conf. Autonomous Agents and Multiagent Systems, Toronto, Canada, May 10–14, pp. 133–140. IFAAMAS. 16 Pearce , J.P. and Tambe , M. ( 2007 ) Quality Guarantees on k-Optimal Solutions for Distributed Constraint Optimization Problems. Proc. 20th Int. Joint Conf. Artificial Intelligence, Hyderabad, India, January 6–12, pp. 1446–1451. AAAI Press. 17 Bowring , E. , Pearce , J.P. , Portway , C. , Jain , M. and Tambe , M. ( 2008 ) On k-Optimal Distributed Constraint Optimization Algorithms: New Bounds and Algorithms. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 607–614. IFAAMAS. 18 Vinyals , M. , Pujol , M. , Rodriguez-Aguilar , J. and Cerquides , J. ( 2010 ) Divide-and-Coordinate: DCOPS by Agreement. Proc. 9th Int. Conf. Autonomous Agents and Multi-Agent Systems, Toronto, Canada, May 10–14, pp. 149–156. IFAAMAS. 19 Farinelli , A. , Vinyals , M. , Rogers , A. and Jennings , N.R. ( 2013 ) Distributed Constraint Handling and Optimization. In Weiss , G. (ed.) Multiagent Systems. . MIT Press , Cambridge, Massachusetts, USA . 20 Leite , A.R. , Enembreck , F. and Barthès , J.A. ( 2014 ) Distributed constraint optimization problems: review and perspectives . Expert Syst. Appl. , 41 , 5139 – 5157 . Google Scholar CrossRef Search ADS 21 Zivan , R. and Peled , H. ( 2012 ) Max/Min-Sum Distributed Constraint Optimization through Value Propagation on an Alternating Dag. Proc. 11th Int. Conf. Autonomous Agents and Multi-Agent Systems, Valencia, Spain, June 4–8, pp. 265–272. IFAAMAS. 22 Rogers , A. , Farinelli , A. , Stranders , R. and Jennings , N. ( 2011 ) Bounded approximate decentralised coordination via the max-sum algorithm . Artif. Intell. , 175 , 730 – 759 . Google Scholar CrossRef Search ADS 23 Aji , S.M. and McEliece , R. ( 2000 ) The generalized distributive law . IEEE Trans. Inf. Theory , 46 , 325 – 343 . Google Scholar CrossRef Search ADS 24 Farinelli , A. , Rogers , A. , Petcu , A. and Jennings , N.R. ( 2008 ) Decentralised Coordination of Low-Power Embedded Devices using the Max-Sum Algorithm. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 639–646. IFAAMAS. 25 Lesser , V. and Corkill , D. ( 2014 ) Challenges for Multi-agent Coordination Theory Based on Empirical Observations. Proc. 13th Int. Conf. Autonomous Agents and Multi-Agent Systems, Paris, France, May 5–9, pp. 1157–1160. IFAAMAS. 26 Kim , Y. and Lesser , V. ( 2013 ) Improved Max-Sum Algorithm for DCOP with n-Ary Constraints. Proc. 12th Int. Conf. Autonomous Agents and Multi-Agent Systems, Minnesota, USA, May 6–10, pp. 191–198. IFAAMAS. 27 Kschischang , F.R. , Frey , B.J. and Loeliger , H. ( 2001 ) Factor graphs and the sum-product algorithm . IEEE Trans. Inf. Theory , 47 , 498 – 519 . Google Scholar CrossRef Search ADS 28 Elidan , G. , McGraw , I. and Koller , D. ( 2006 ) Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing. Proc. 22nd Conf. Uncertainty in AI, Massachussetts, USA, July 13–16, pp. 200–208. AUAI. 29 Peri , O. and Meisels , A. ( 2013 ) Synchronizing for Performance-dcop algorithms. Proc. 5th Int. Conf. Agents and Artificial Intelligence, Barcelona, Spain, February 15–18, pp. 5–14. Springer. 30 Macarthur , K. ( 2011 ) Multi-agent Coordination for Dynamic Decentralised Task Allocation. PhD thesis University of Southampton, Southampton, UK. 31 Fioretto , F. , Pontelli , E. and Yeoh , W. ( 2016 ) Distributed constraint optimization problems and applications: A survey. CoRR, abs/1602.06347. 32 Macarthur , K.S. , Stranders , R. , Ramchurn , S.D. and Jennings , N.R. ( 2011 ) A Distributed Anytime Algorithm for Dynamic Task Allocation in Multi-Agent Systems Fast-Max-Sum. Proc. 25th AAAI Conf. Artificial Intelligence, San Francisco, USA, May 12–16, pp. 701–706. AAAI Press. 33 Stranders , R. , Farinelli , A. , Rogers , A. and Jennings , N.R. ( 2009 ) Decentralised Coordination of Mobile Sensors using the Max-Sum Algorithm. Proc. 21st Int. Joint Conf. Artificial Intelligence, California, USA, July 13–17, pp. 299–304. AAAI Press. 34 Vinyals , M. , Rodriguez-Aguilar , J.A. and Cerquides , J. ( 2011 ) A survey on sensor networks from a multiagent perspective . Comput. J. , 54 , 455 – 470 . Google Scholar CrossRef Search ADS 35 Kutner , M.H. , Nachtsheim , C. and Neter , J. ( 2004 ) Applied Linear Regression Models ( 4th edn ). McGraw-Hill , Irwin . 36 Han , J. , Pei , J. and Kamber , M. ( 2011 ) Data Mining: Concepts and Techniques ( 3rd edn ). Elsevier , Amsterdam, Netherlands . 37 Léauté , T. , Ottens , B. and Szymanek , R. ( 2009 ) FRODO 2.0: An Open-source Framework for Distributed Constraint Optimization. Proc. IJCAI’09 Distributed Constraint Reasoning Workshop (DCR’09), Pasadena, California, USA, July 13, pp. 160–164. https://frodo-ai.tech. 38 Sultanik , E.A. , Lass , R.N. and Regli , W.C. ( 2008 ) Dcopolis: A Framework for Simulating and Deploying Distributed Constraint Reasoning Algorithms. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems: Demo Papers, Estoril, Portugal, May 12–16, pp. 1667–1668. IFAAMAS. 39 Knoll , C. , Rath , M. , Tschiatschek , S. and Pernkopf , F. ( 2015 ) Message Scheduling Methods for Belief Propagation. Proc. Joint Eur. Conf. Machine Learning and Knowledge Discovery in Databases, Porto, Portugal, September 7–11, pp. 295–310. Springer. 40 Yedidsion , H. , Zivan , R. and Farinelli , A. ( 2014 ) Explorative Max-Sum for Teams of Mobile Sensing Agents. Proc. 13th Int. Conf. Autonomous Agents and Multi-Agent Systems, Paris, France, May 5–9, pp. 549–556. IFAAMAS. 41 Zivan , R. , Glinton , R. and Sycara , K. ( 2009 ) Distributed Constraint Optimization for Large Teams of Mobile Sensing Agents. Proc. IEEE/WIC/ACM Int. Joint Conf. Web Intelligence and Intelligent Agent Technology, Milan, Italy, September 15–18, pp. 347–354. IEEE. 42 Zivan , R. , Parash , T. and Naveh , Y. ( 2015 ) Applying Max-Sum to Asymmetric Distributed Constraint Optimization. Proc. 24th Int. Joint Conf. Artificial Intelligence, Buenos Aires, Argentina, July 25–August 01, pp. 432–438. AAAI Press. Footnotes 1 Here, we consider both the computation and communication cost of an algorithm in terms of time. 2 BMS has been proposed as a generic approach that can be applied to all DCOP settings, while BFMS can only be applied to specific formulation of task allocation domain (see [30] for more detail of the formulation). 3 It can either be SMP or its asynchronous alternative, without loss of generality, we use SMP from now on (see Section 1). Author notes Handling editor: Franco Zambonelli © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

Speeding Up GDL-Based Message Passing Algorithms for Large-Scale DCOPs

Loading next page...
 
/lp/ou_press/speeding-up-gdl-based-message-passing-algorithms-for-large-scale-dcops-fgyJA3WUX2
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy021
Publisher site
See Article on Publisher Site

Abstract

Abstract This paper develops a new approach to speed up Generalized Distributive Law (GDL) based message passing algorithms that are used to solve large-scale Distributed Constraint Optimization Problems (DCOPs) in multi-agent systems. In particular, we significantly reduce computation and communication costs in terms of convergence time for algorithms such as Max-Sum, Bounded Max-Sum, Fast Max-Sum, Bounded Fast Max-Sum, BnB Max-Sum, BnB Fast Max-Sum and Generalized Fast Belief Propagation. This is important since it is often observed that the outcome obtained from such algorithms becomes outdated or unusable if the optimization process takes too much time. Specifically, the issue of taking too long to complete the internal operation of a DCOP algorithm is even more severe and commonplace in a system where the algorithm has to deal with a large number of agents, tasks and resources. This, in turn, limits the practical scalability of such algorithms. In other words, an optimization algorithm can be used in larger systems if the completion time can be reduced. However, it is challenging to maintain the solution quality while minimizing the completion time. Considering this trade-off, we propose a generic message passing protocol for GDL-based algorithms that combines clustering with domain pruning, as well as the use of a regression method to determine the appropriate number of clusters for a given scenario. We empirically evaluate the performance of our method in a number of settings and find that it brings down the completion time by around 37–85% (1.6–6.5 times faster) for 100–900 nodes, and by around 47–91% (1.9–11 times faster) for 3000–10 000 nodes compared to the current state-of-the-art. 1. INTRODUCTION Distributed Constraint Optimization Problems (DCOPs) are a widely studied framework for solving constraint handling problems of cooperative multi-agent systems (MAS) [1]. They have been applied to many real world applications such as disaster response [2], sensor networks [3, 4], traffic control [5], meeting scheduling [6] and coalition formation [7]. DCOPs have received considerable attention from the multi-agent research community due to their ability to optimize a global objective function of problems that can be described as the aggregation of distributed constraint cost functions. To be precise, DCOP algorithms are distributed by having agents negotiate a joint solution through local message exchange, and the algorithms exploit the structure of the application domain by encoding this into constraints to tackle hard computational problems. In DCOPs, such problems are formulated as constraint networks that are often represented graphically. In particular, the agents are represented as nodes, and the constraints that arise between the agents, depending on their joint choice of action, are represented by the edges [8]. Each constraint can be defined by a set of variables held by the corresponding agents related to that constraint. In more detail, each agent holds one or more variables, each of which takes values from a finite domain. The agent is responsible for setting the value of its own variable(s) but can communicate with other agents to potentially influence their choice. The goal of a DCOP algorithm is to set every variable to a value from its domain, to minimize the constraint violation. Over the last decade, a number of algorithms have been developed to solve DCOPs under two broad categories: exact and non-exact algorithms. The former always finds an optimal solution, and can be further classified into fully decentralized and partially centralized approaches. In fully decentralized approaches (e.g. ADOPT [1], BnB ADOPT [9] and DPOP [10]), the agent has complete control over its variables and is aware of only local constraints. However, such approaches often require an excessive amount of communication when applied to complex problems [11]. On the other hand, centralizing parts of the problem can often reduce the effort required to find a globally optimal solution (e.g. OptAPO [12] and PC-DPOP [11]). In both cases, finding an optimal solution for a DCOP is an NP-hard problem that exhibits an exponentially increasing coordination overhead as the system grows [1]. Consequently, exact approaches are often impractical for application domains with larger constraint networks. On the contrary, non-exact algorithms sacrifice some solution quality for better scalability. These algorithms are further categorized into local greedy and Generalized Distributive Law (GDL) based inference methods. In general, local greedy algorithms (e.g. DSA [13] and MGM [14]) begin with a random assignment of all the variables within a constraint network, and go on to perform a series of local moves that try to greedily optimize the objective function. They often perform well on small problems having constraints with lower arity. There has been some work on providing guarantees on the performance of such algorithms in larger DCOP settings [15–18]. However, for large and complex problems, consisting of hundreds or thousands of nodes, this class of algorithms often comes up with a global solution far from the optimal [19–22]. This is because agents do not explicitly communicate their utility for being in any particular state. Rather they only communicate their preferred state (i.e. the one that will maximize their own utility) based on the current preferred state of their neighbours. GDL-based inference algorithms are the other popular class of non-exact approaches [23]. Rather than exchanging self-states with their neighbours in local search, agents in a GDL-based algorithm explicitly share the consequences of choosing non-preferred states with the preferred one during inference through the graphical representation of a DCOP [2, 20, 24]. Thus, agents can obtain global utilities for each possible value assignment. In other words, in contrast to the greedy local search algorithms, agents do not propagate assignments. Instead, they calculate utilities for each possible value assignment of their neighbouring agents’ variables. Eventually, this information helps this class of algorithms to achieve good solution quality for large and complex problems. Hence, GDL-based inference algorithms perform well in practical applications and provide optimal solutions for cycle-free constraint graphs (e.g. Max-Sum [24]) and acceptable bounded approximate solutions for their cyclic counterparts (e.g. Bounded Max-Sum (BMS) [22]). Therefore, these algorithms have received an increasing amount of attention from the research community. In this paper, we are going to focus on GDL-based DCOP algorithms and utilize the aforementioned advantages of the partial centralization. GDL-based DCOP algorithms follow a message passing protocol where agents continuously exchange messages to compute an approximation of the impact that each of the agents’ actions has on the global optimization function, by building a local objective function (expounded in Section 2). Once the function is built, each agent picks the value of a variable that maximizes the function. Thus, this class of non-exact algorithms make efficient use of constrained computational and communication resources, and effectively represent and communicate complex utility relationships through the network. Despite these aforementioned advantages, scalability remains a challenge for GDL-based algorithms [2, 25, 26]. Specifically, they perform repetitive maximization operations for each constraint to select the locally best configuration of the associated variables, given the local function and a set of incoming messages. To be precise, a constraint that depends on n variables having domains composed of d values each, will need to perform dn computations for a maximization operation. As the system scales up, the complexity of this step grows exponentially and makes this class of algorithms computationally expensive. While several attempts have been made to reduce the cost of the maximization operation, most of them typically limit the general applicability of such algorithms (see Section 7). In more detail, previous attempts at scaling up GDL-based algorithms have focused on reducing the overall cost of the maximization operator. However, they overlook an important concern that all the GDL-based algorithms follow a Standard Message Passing (SMP) protocol to exchange messages among the nodes of the corresponding graphical representation of a DCOP. In the SMP protocol, a message is sent from a node von an edge eto its neighbouring node wif and only if all the messages are received at von edges other than e, summarized for the node associated with e [23, 27]. This means that a node in the graphical representation is not permitted to send a message to its neighbouring node until it receives messages from all its other neighbours. Here, for w to be able to generate and send messages to all its other neighbours, it depends on the message from v. To be exact, w cannot compute and transmit messages to its neighbours other than v until it has received all essential messages, including the message from v. This dependency, which is common for all the nodes, (the so-called completion time) increases. Now, there is an asynchronous version of message passing where nodes are initialized randomly, and outgoing messages can be updated at any time and in any sequence [23, 24]. Thus, the asynchronous protocol minimizes the waiting time of the agents, but there is no guarantee about how consistent their local views (i.e. the local objective function) are. In other words, agents can take decisions from an inconsistent view and they may need to revise their action. Therefore, unlike SMP, even in an acyclic constraint graph, this asynchronous version does not guarantee convergence after a fixed number of message exchanges. Thus, it experiences more communication and computational cost as redundant messages are generated and sent, regardless of the structure of the graph. Significantly, even in the asynchronous version, the expected result for a particular node can be achieved only when all the received messages for the node are computed by following the regulation of the SMP protocol. Building on this insight, [28] introduces an asynchronous propagation algorithm that schedules messages in an informed way. Moreover, [29] demonstrates that the impact of inconsistent views is worse than the waiting time of the agents regarding the total completion time, due to the effort required to revise an action in the asynchronous protocol. Thus, the completion time for both the cases is proportional to the diameter of the factor graph, and the asynchronous version never outperforms SMP in terms of the completion time [20, 27, 29]. In light of the aforementioned observations, this paper develops a new message passing protocol that we call Parallel Message Passing (PMP). PMP can be used to obtain the same overall results as SMP in significantly less time,1 when applied to all the existing GDL-based message passing algorithms. In this paper, we use SMP as a benchmark in evaluating PMP, because SMP is faster (or in the worst case equal) to its asynchronous counterpart. It is noteworthy that the GDL-based algorithms, which deal with cyclic graphical representations of DCOPs (e.g. BMS,2 Bounded Fast Max-Sum (BFMS)2), initially remove the cycles from the original constraint graph, then apply the SMP protocol on the acyclic graph to provide a bounded approximate solution of the problem. Comparatively, our protocol can be applied on cyclic DCOPs in the same way. Thus, once the cycles have been removed, PMP can be applied in place of SMP on the transformed acyclic graph. In more detail, this work advances the state-of-the-art in the following ways: PMP provides the same result as its SMP counterpart, but takes significantly less time. Here, we do not change the computation method of the messages. Rather, we efficiently distribute the overhead of message passing to the agents to exploit their computational and communication power concurrently. Thus, we reduce the average waiting time of the agents. To do so, we split the graphical representation of a DCOP into several parts (i.e. clusters) and execute message passing on them in parallel. As a consequence of this cluster formation, we have to ignore inter cluster links. Therefore, PMP requires two rounds of message passing, and an intermediate step (done by a representative agent on behalf of the cluster, namely cluster head) to recover the values of the ignored links. However, this overall process is still significantly quicker than SMP. In addition to that, we introduce a domain pruning algorithm to further reduce the time required to complete the intermediate step. Our approach is mostly decentralized, apart from the intermediate step performed by the cluster heads. However, unlike existing partially centralized techniques such as OptAPO and PC-DPOP which require extensive effort to find and distribute the centralized parts of a DCOP to the cluster heads (also known as mediator agents), no such additional effort is required in PMP. Thus, we effectively take the advantages of partial centralization without being affected by its major shortcomings (see Section 7). We empirically evaluate the performance of our protocol, in terms of completion time, and compare it with the GDL-based benchmark DCOP algorithms that follow the SMP protocol in different settings (up to 10,000 nodes). Our results show a speed-up of 37–91% with no reduction in solution quality, meaning a GDL-based DCOP algorithm can generate the same solution quality 1.6–11 times faster by using PMP. By doing so, PMP makes GDL-based algorithms more scalable in that either they take less time to complete the internal operation of a given size of DCOP or they can handle a larger DCOP in the same completion time as a smaller one that uses SMP. We observe from our empirical studies that it is non-trivial to determine the appropriate number of clusters for a certain scenario. Therefore, it is important to find out the number of clusters to be formed for a particular scenario, before initiating the message passing. To address this issue, our final contribution is to use a linear regression numerical prediction model to determine the appropriate number of clusters for a specific problem instance. Our empirical evidence suggests that we can achieve at least 98.5% of the possible performance gain from PMP by using the linear regression method. The rest of this paper is structured as follows: Section 2 formally defines the generic DCOP framework and details how the SMP protocol operates on the corresponding graphical representation of a DCOP. Then, in Section 3, we discuss the technical details of our PMP protocol with worked examples. Next, we present the performance of our approach through empirical evaluation in Section 4. Afterwards, Section 5 demonstrates the details and the performance of applying the linear regression model on PMP. Subsequently, Section 6 provides some supplementary empirical evidence. Section 7 puts our work in perspective with previous approaches, and Section 8 concludes. 2. THE STANDARD MESSAGE PASSING PROTOCOL In general, a DCOP can be formally defined as follows [1, 19, 20, 31]: Definition 2.1 (DCOP). A DCOP can be defined by a tuple ⟨ A, X, D, F, M ⟩, where Ais a set of agents {A0,A1,…Ak}. Xis a set of finite and discrete variables {x0,x1,…,xm}, which are being held by the set of agents A. Dis a set of domains {D0,D1,…,Dm}, where each Di∈Dis a finite set containing the values to which its associated variable ximay be assigned. Fis a set of constraints {F1,F2,…,FL}, where each Fi∈Fis a function dependent on a subset of variables xi∈Xdefining the relationship among the variables in xi. Thus, function Fi(xi)denotes the value for each possible assignment of the variables in xiand represents the joint pay-off that the corresponding agents achieve. Note that this setting is not limited to pairwise (binary) constraints, and the functions may depend on any number of variables. Mis a function M:η→Athat represents the mapping of variables and functions, jointly denoted by η, to their associated agents. Each variable/function is held by a single agent. However, each agent can hold several variables and/or functions. Notably, the dependencies (i.e. constraints) between the variables and the functions generate a bipartite graph, called a factor graph, which is commonly used as a graphical representation of such DCOPs [27]. Within this model, the objective of a DCOP algorithm is to have each agent assign values to its associated variables from their corresponding domains in order to either maximize or minimize the aggregated global objective function which eventually produces the value of all the variables, X* (Equation (1)). X*=argmaxX∑i=1LFi(xi)ORX*=argminX∑i=1LFi(xi) (1) For example, Fig. 1 depicts the relationship among variables, functions and agents of a factor graph representation of a sample DCOP. Here, we have a set of nine variables X={x0,x1,…,x8}, a set of six functions/factors F={F0,F1,…,F5} and a set of three agents A={A1,A2,A3}. Moreover, D={D0,D1,…,D8} is a set of discrete and finite variable domains, each variable xi∈X can take its value from the domain Di∈D. In this example, agent A1 holds two function nodes ( F0, F1) and three variable nodes ( x0, x1, x2). Similarly, nodes F2, F3, x3, x4 and x5 are held by agent A2; whereas, functions F4, F5 and variables x6, x7 and x8 are held by agent A3. Now, these three agents participate in the optimization process in order to either maximize or minimize a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). Here, the global objective function is an aggregation of six local functions F0(x0,x1), F1(x1,x2,x3), F2(x3,x4), F3(x4,x5), F4(x5,x6,x7) and F5(x7,x8). Figure 1. View largeDownload slide A sample factor graph representation of a DCOP, with six function/factor nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}, stands for a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). It also illustrates the relationship among variables, factors and agents of a DCOP. In the figure, variables are denoted by circles, factors as squares and agents as octagons. Figure 1. View largeDownload slide A sample factor graph representation of a DCOP, with six function/factor nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}, stands for a global objective function F(x0,x1,x2,x3,x4,x5,x6,x7,x8). It also illustrates the relationship among variables, factors and agents of a DCOP. In the figure, variables are denoted by circles, factors as squares and agents as octagons. To date, the factor graph representation of the aforementioned DCOP formulation follows the SMP protocol to exchange messages in the GDL-based message passing algorithms. Notably, both the Max-Sum and the BMS algorithms (two key algorithms based on GDL) use Equations (2) and (3) for their message passing, and they can be directly applied to the factor graph representation of a DCOP. Specifically, the variable and the function nodes of a factor graph continuously exchange messages (variable xi to function Fj (Equation (2)) and function Fj to variable xi (Equation (3))) to compute an approximation of the impact that each of the agents’ actions have on the global objective function by building a local objective function Zi(xi). In Equations (2–4), Mi stands for the functions connected to xi and Nj represents the set of variables connected to Fj. Once the function is built (Equation (4)), each agent picks the value of a variable that maximizes the function by finding argmaxxi(Zi(xi)). Even though some extensions of the Max-Sum and the BMS algorithms (e.g. Fast Max-Sum (FMS) [2], BFMS [30] and BnB-FMS [32]) modify these equations slightly, the SMP protocol still underpins these algorithms. The reason behind this is that a message passing protocol, by definition, does not depend on how the messages are generated; rather, it ensures when a message should be computed and exchanged [23, 27]. Qxi→Fj(xi)=∑Fk∈Mi⧹FjRFk→xi(xi) (2) RFj→xi(xi)=maxxj⧹xi[Fj(xj)+∑xk∈Nj⧹xiQxk→Fj(xk)] (3) Zi(xi)=∑Fj∈MiRFj→xi(xi) (4) Algorithm 1 gives an overview of how SMP operates on a factor graph in a multi-agent system. Here, a number of variable and function nodes of a factor graph FG are held by a set of agents A. The corresponding agents act (i.e. generate and transmit messages) on behalf of the nodes they hold. Initially, only the variable and the function nodes that are connected to a minimum number of neighbouring nodes in FG, denoted by iNodes, are permitted to send messages to their neighbours. Line 1 of Algorithm 1 finds the set of agents Am∈A that hold iNodes. Specifically, the function messageUpdate() represents the messages sent by the agents on behalf of the permitted nodes they hold to their permitted neighbours in a particular time step within a factor graph. Notably, the SMP protocol ensures that a node, variable or function, within a factor graph cannot generate and transmit a message to its particular neighbour before receiving messages from the rest of its neighbour(s). According to this regulation of SMP, the permitted nodes ( pNodes) and their corresponding permitted neighbours ( pNodes.pNeighbours) for a particular time step are determined. At the very first time step, agents Am on behalf of iNodes, also denoted as Am.iNodes, send NULL values to all their neighbouring nodes ( iNodes.allNeighbours) within FG (Line 2). Now, following the SMP protocol, a set of agents Am′ on behalf of pNodes, namely Am′.pNodes, compute messages ( generatedMessages) using Equation (2) or (3) for those neighbours ( pNodes.pNeighbours) they are allowed to send to (Line 4). The while loop in Lines 3–7 ensures that this will continue until each of the nodes sends messages to all their neighbours. Within this loop, once a variable xi receives messages from all of its neighbours, it can build a local objective function Zi(xi), and the corresponding agent chooses the value to maximize it by finding argmaxxi(Zi(xi))(Lines 5–7). Algorithm 1 Overview of the SMP protocol on a factor graph. View Large Algorithm 1 Overview of the SMP protocol on a factor graph. View Large Figure 2 demonstrates a worked example of how SMP works on the factor graph shown in Fig. 1. Here, Equation (5) and Equations (6–9) illustrate two samples of how the variable-to-function (e.g. x5 to F4 or Qx5→F4(x5)) and the function-to-variable (e.g. F4 to x6 or RF4→x6(x6)) messages are computed based on Equations (2) and (3), respectively. All the messages are generated considering the local utilities depicted at the bottom of Fig. 2 for domain: {R,B}. During the computation of the messages, red and blue colours are used to distinguish the values of the domain states R and B, respectively. In the former example, apart from the receiving node F4, the sending node x5 has only one other neighbouring node (i.e. F3). Therefore, x5 only needs to forward the message it received from F3 to the node F4 (Equation (5)). In the latter example, the computation of the message from F4 to x6 includes a maximization operation on the summation of the local utility function F4 and the messages received by F4 from its neighbours other than x6 (i.e. x5, x7). Given the messages are sent by the corresponding agents on behalf of the nodes they hold, for simplicity, we skip mentioning agents from our worked examples in this paper. Qx5→F4(x5)=RF3→x5(x5)={81,82} (5) RF4→x6(x6)=max{x5,x6,x7}⧹{x6}[F4(x5,x6,x7)+(Qx5→F4(x5)+Qx7→F4(x7))] (6) =max{x5,x7}[[6925762732]+[8181818182828282]+[4045404540454045]] (7) =max{x5,x7}[127135146133128154125129] (8) ={154,146} (9) Z1(x1)=RF0→x1(x1)+RF1→x1(x1)={12,10}+{135,144}={147,154} (10) Figure 2. View largeDownload slide Worked example of SMP on the factor graph of Fig. 1. In the factor graph, each of the tables represents the corresponding local utility of a function for domain {R,B}. The values within a curly bracket represent a message computed based on these local utilities, and each arrow indicates the sending direction of the message. Figure 2. View largeDownload slide Worked example of SMP on the factor graph of Fig. 1. In the factor graph, each of the tables represents the corresponding local utility of a function for domain {R,B}. The values within a curly bracket represent a message computed based on these local utilities, and each arrow indicates the sending direction of the message. In the example of Fig. 2, initially, only x0,x2,x6 and x8 (i.e. iNodes) can send messages to their neighbours at the very first time step (Line 2 of Algorithm 1). According to the SMP protocol, in time step 2, only F0 and F5 can generate and send messages to x1 and x7, respectively. It is worth mentioning that, despite receiving a message from x2 and x6 respectively in time step 1, F1 and F4 cannot send messages in time step 2 as they have to receive messages from at least two of their three neighbours, according to the regulation of SMP (Line 4). Hence, F1 has to wait for the message from x1 to be able to generate and send a message to x3. Similarly, F1 cannot generate a message for x2 until it receives messages from x1 and x3. Subsequently, x3 cannot send a message to F1 until it receives the message from F2. In this process, a variable x1 can only build its local objective function Z1(x1) when it receives messages from all of its neighbours F0 and F1 (Equation (10)), and this is common for all the variables. In Equation (10), from the value {R,B} = {147,154} generated by Z1(x1) for the variable x1, the holding agent of x1 chooses B=154, that is argmaxx1(Z1(x1)). Following the SMP protocol, the complete message passing procedure will end after each node receives messages from all of its neighbours. These are the dependencies we discussed in Section 1, which make GDL-based algorithms less scalable in practice for large real world problems. Formally, the total time required to complete the message passing procedure for a particular factor graph can be termed as the completion time T, and the ultimate objective is to reduce the completion time while maintaining the same solution quality from a message passing algorithm. To address this issue, we introduce a new message passing protocol in the following section. 3. THE PARALLEL MESSAGE PASSING PROTOCOL PMP uses a means similar to that of its SMP counterpart in computing messages. For example, Max-Sum messages are used when PMP is applied to the Max-Sum algorithm, and FMS messages are used when it is applied to the FMS algorithm. Even so, PMP reduces the completion time by splitting the factor graph into a number of clusters (Definition 3.1), and independently running the message passing3 on those clusters in parallel. As a result of this, the average waiting time of the nodes, during the message passing, is lessened. In particular, the completion time of PMP is reduced to TsmpNC; where Tsmp is the completion time of the algorithm that follows the SMP protocol, and NC is the number of clusters. However, PMP ignores inter cluster links (i.e. messages) during the formation of clusters. Hence, it is not possible to obtain the same solution quality as the original algorithm by executing only one round of message passing. This is why PMP requires two rounds of message passing and an additional intermediate step. The role of the intermediate step is to generate the ignored messages (Definition 3.2) for the split node(s) of a cluster, so that the second round can use these as initial values for those split node(s) to compute the same messages as the original algorithm. To be precise, a representative agent (or a cluster head) takes the responsibility of performing the operation of the intermediate step for the corresponding cluster. To make it possible, we assume that each of the cluster heads retains full knowledge of that cluster, and it can communicate with its neighbouring clusters, making PMP a partially centralized approach. As a consequence of two rounds of message passing and an intermediate step, the total completion time of PMP (i.e. Tpmp) becomes 2×TsmpNC+Tintm, where Tintm is the time required to complete the intermediate step. As the sizes of the clusters can be different in PMP, a more precise way to compute Tpmp is through Equation (11). Here, Tclargest stands for the time required to complete the message passing process of the largest cluster in PMP. Having discussed how to compute the completion time of PMP, we explain the details of our proposed algorithm in the remainder of the section. Tpmp=2×Tclargest+Tintm (11) 3.1. Algorithm overview Algorithm 2 gives an overview of PMP. Similar to SMP, it works on a factor graph FG, and the variable and the function nodes of FG are being held by a set of agents A. To form the clusters in a decentralized manner, PMP finds an agent Ac∈A that holds a special function node ( firstFunction), which initiates the cluster formation procedure (Line 1). Specifically, firstFunction is a function node that shares variable(s) with only one function node. As PMP operates on acyclic or transformed acyclic factor graphs, such node(s) will always be found. Now, each agent that holds a function node maintaining this property broadcasts an initiator message, and any agent can be picked if more than one agent is found. Then, in Line 2, agent Ac initiates the procedure distributeNodes(FG,Ac) that distributes the nodes of FG to the clusters {c0,c1,…,cNC} in a decentralized way, and the detail of the cluster formation procedure will be explained shortly in Algorithm 3. Note that, in PMP all the operations within each cluster are performed in parallel. Algorithm 2 Overview of the PMP protocol on a factor graph. View Large Algorithm 2 Overview of the PMP protocol on a factor graph. View Large Algorithm 3 Parallel Message Passing. View Large Algorithm 3 Parallel Message Passing. View Large After the cluster formation procedure has completed, PMP starts the first round of message passing (Lines 3–6). Line 3 finds the set of agents Am∈A that hold the variable and function nodes iNodesci that are connected to the minimum number of neighbouring nodes within each cluster ci. Then, messageUpdate() of Line 4 represents those messages with NULL values sent by Am on behalf of iNodesci, also denoted as Am.iNodesci, to all their neighbouring nodes ( iNodesci.allNeighbours) within the cluster ci. Afterwards, following the same procedure as SMP, a set of nodes Am′.pNodes generate the messages ( generatedMessages) for the neighbours ( pNodes.pNeighbours) they are allowed to send messages to (Line 6). However, unlike SMP where the message passing procedure operates on the entire FG, PMP executes the first round of message passing on the clusters having only one neighbouring cluster in parallel (Line 5). This is because, in the first round, it is redundant to run message passing on the cluster having more than one neighbouring cluster, as a second round will re-compute the messages (see the explanation in Section 3.3). The while loop in Lines 5 and 6 ensures that this procedure will continue until each of the nodes sends messages to all its neighbouring nodes within the participating clusters of the first round. Next, a representative agent from each cluster ci computes the values (Definition 3.2) ignored during the cluster formation procedure for that particular cluster (Line 7). Note that these ignored values, represented by ignVal(ci), are the same values for those edges, if we run SMP on the complete factor graph FG. Finally, the second round of message passing is started on all of the clusters in parallel, by considering ignVal(ci) as initial values for those ignored edges (Line 8). Similar to the first round, the while loop ensures that this procedure will continue until each of the nodes sends messages to all its neighbouring nodes within ci (Lines 9–13). Within the second round of message passing, once a variable xi receives messages from all of its neighbours within ci, it can build a local objective function Zi(xi). Then, the corresponding agent chooses the value that maximizes it by finding argmaxxi(Zi(xi)) (Lines 12 and 13). By considering the ignVal(ci) values in the second round, PMP generates the same solution quality as SMP. We give a more detailed description of each part of PMP with a worked example in the remainder of this section. To be exact, Section 3.2 concentrates on the cluster formation and the message passing procedure. Then, Section 3.3 presents the intermediate step. Finally, Section 3.4 ends this section with a complete comparative example, SMP vs. PMP, in terms of the completion time. Definition 3.1 (Cluster, Neighbouring Clusters and Split Node). A cluster ciis a sub factor graph of a factor graph FG. Two clusters ciand cjare neighbours if and only if they share a common variable node (i.e. split node) xp. For instance, c1and c2of Fig. 3are two sub factor graphs of the entire factor graph shown in Fig. 2. Here, c1and c2are neighbouring clusters as they share variable x3as a split node. Definition 3.2 (Ignored Values of a Cluster, ignVal(ci)). The value(s) overlooked, through the split node(s) of each cluster ci, during the first round of message passing. In other words, these are the incoming messages through the split node(s), should the SMP protocol have been followed. The intermediate step of PMP takes the responsibility of computing these ignored values, so that they can be used in the second round in order to obtain the same solution quality from an algorithm as its SMP counterpart. In the example of Fig. 3, the intermediate step recovers {R,B}={119,126}for the split node x3of cluster c1, which is going to be used as an initial value for x3in the second round of message passing, instead of {R,B}={0,0}. Figure 3. View largeDownload slide Worked example of PMP (participating clusters: first round—( c1,c3) and second round—( c1,c2,c3)) on the same factor graph and local utility as in Fig. 2. In this figure, blue circles represent split variables for each cluster and coloured messages show the ignored values ( ignVal()) recovered during the intermediate step, where yellow messages require synchronous computations but green underlined ones are ready after the first round. Figure 3. View largeDownload slide Worked example of PMP (participating clusters: first round—( c1,c3) and second round—( c1,c2,c3)) on the same factor graph and local utility as in Fig. 2. In this figure, blue circles represent split variables for each cluster and coloured messages show the ignored values ( ignVal()) recovered during the intermediate step, where yellow messages require synchronous computations but green underlined ones are ready after the first round. 3.2. Cluster formation and message passing PMP operates on a factor graph FG of a set of variables X and a set of functions F. Specifically, lines 1–12 of Algorithm 3 generate NCclusters by splitting FG. In the process, Lines 1 and 2 compute the maximum number of function nodes per cluster ( N), and associated variable nodes of a corresponding function goes to the same cluster. Now, Line 3 gets a special function node firstFunction, which is a function node that shares variable node(s) with only one function node (i.e. min_nFunction(F)). Any node can be chosen in case of a tie. Then, Line 4 initializes the variable ‘ node’ with the chosen firstFunction, which is the first member node of the cluster c1. The for loop of Lines 5–12 iteratively adds the member nodes to each cluster ci. In order to do this in a decentralized manner, a special variable count is used as a token to keep track of the current number of function nodes belonging to ci. When a new node added to ci is held by a different agent, the variable count is passed to the new holding agent. The while loop (Lines 7–11) iterates as long as the member count for a cluster ci remains less than the maximum nodes per cluster (N), then the new node becomes a member of the next cluster. In the worked example of Fig. 3, we use the same factor graph shown in Fig. 2 that consists of six function nodes {F0,F1,…,F5} and nine variable nodes {x0,x1,…,x8}. Here, F0 and F5 satisfy the requirements to become the firstFunction node as both of them share variable nodes with only one function node ( F1 and F4, respectively). We pick F0 randomly as the firstFunction which eventually becomes the first node for the first cluster c1; therefore, the holding agent of node F0 now holds the variable count as long as the newly added nodes are held by the same agent. Assume, the number of clusters ( NC) for this example is 3: c1,c2 and c3 (see Section 5 for more details about the appropriate number of clusters). In that case, each cluster can retain a maximum of two nodes (functions). According to the cluster formation process of PMP, the sets of function nodes {F0, F1}, {F2, F3} and {F3, F4} belong to the clusters c1,c2 and c3, respectively. Moreover, c1 and c2 are neighbouring clusters as they share a common split node x3. Similarly, split node x5 is shared by the neighbouring clusters c2 and c3. At this point, the first of the two rounds of message passing initiates as the cluster formation procedure has completed. The for loop in Lines 13–16 acts as the first round of message passing. This involves only computing and sending the variable-to-function (Equation (2)) and the function-to-variable (Equation (3)) messages within clusters ( SN) having only a single neighbouring cluster in parallel. It can be seen from our worked example of Fig. 3 that among the clusters c1, c2 and c3, only c2 has more than one neighbouring clusters. Therefore, only c1 and c3 will participate in the first round of message passing. Unlike the first round, all the clusters participate in the second round of message passing in parallel (Lines 19–23). In the second round, instead of using the null values (i.e. predefined initial values) for initializing all the variable-to-function messages, we exploit the recovered ignored values (Definition 3.2) from the intermediate step (Lines 17 and 18) as initial values for the split variable nodes, as shown in Line 21. Here, all the ignored messages from the split nodes of a cluster ci are denoted as QignEdge(ci). The rest of the messages are then initialized as null (Lines 20 and 22). Here, ∀Q(ci) and ∀R(ci) represent all the variable-to-function and the function-to-variable messages within a cluster ci, respectively. For example, in cluster c3 all the messages are initialized as zeros for the first round of message passing. Therefore, during the first round, the variable nodes x5 and x8 start the message passing with the values {0, 0} and {0, 0} to the function nodes F4 and F5, respectively, in Fig. 3. However, in the second round, split node x5 starts with the value {81,82} instead of {0, 0} to the function node F4. Note that this value {81,82} is the ignored value for the split node x5 of cluster c3 computed during the intermediate step of PMP. Significantly, this is the same value transmitted by the variable node x5 to the function node F4 if we follow the SMP protocol (see Fig. 2), which ensures the same solution quality from both the protocols. We describe this intermediate step of PMP shortly. Finally, PMP will converge with Equation (4) by computing the value Zi(xi) and hence finding argmaxxi(Zi(xi)). 3.3. Intermediate step A key part of PMP is the intermediate step (Algorithm 4). It takes a cluster ( ci) provided by Line 18 of Algorithm 3 as an input, and returns the ignored values (Definition 3.2) for each of the ignored links of that cluster. A representative of each cluster ci (cluster head chi) performs the operation of the intermediate step for that cluster. Note that each cluster head operates in parallel. Initially, each cluster head needs to receive the StatusMessages from the rest of the cluster heads (Line 1 of Algorithm 4). Each StatusMessage contains the factor graph structure of the sending cluster along with the utility information. Notably, the StatusMessages can be formed and exchanged during the time of the first round, thus it does not incur an additional delay. The for loop in Lines 2–14 computes the ignored values for each of the split nodes Sj∈S (where, S={S1,S2,…,Sk}) of the cluster ci by generating a Dependent Acyclic Graph, DG(Sj) (Definition 3.3). In addition to StatusMessages, a cluster head also requires a factor to split variable message ( Mr) from each of the participating clusters of the first round. This is significant, as only clusters with one neighbouring cluster can participate in the first round, and the Mr message is prepared for the split node Sj of that neighbouring cluster. The content of Mr will not change as the participating cluster of the first round has no other clusters on which it depends. As a consequence, if a neighbouring cluster of ci has participated in the first round, the Dependent Acyclic Graph DG(Sj) for Sj comprises only one edge having the message Mr. In more detail, cp stands for the neighbouring cluster of ci that shares the split node Sj (i.e. adjCluster(ci,Sj)), and the variable dCountcp holds the value of the total number of clusters adjacent to cp obtained from the function totalAdjCluster(cp) (Lines 3 and 4). If the cluster cp has no cluster to depend on apart from ci (i.e. cp has participated in the first round of message passing), there is no need for further computation as the ignored value for Sj (i.e. Sj.values) is immediately ready (READY.DG(Sj)) after the first round (Lines 6 and 7). Here, the function append(Sj.values) appends the ignored value for Sj in ignVal(ci). Algorithm 4 intermediateStep(Cluster ci). View Large Algorithm 4 intermediateStep(Cluster ci). View Large On the other hand, if the cluster cp has other clusters to depend on, further computations in the graph are required. This creates the need to find each node of that graph DG(Sj) (Lines 8–14). Line 9 initializes the first function node dNode of DG(Sj), which is connected to the split node Sj and member of the cluster cp (i.e. adjNode(Sj,cp)). The while loop (Lines 10–12) repeatedly forms that graph through extracting the adjacent nodes from the first selected node, dNode. Finally, synchronous executions (explained shortly) from the farthest node to the start node (i.e. split node Sj) of DG(Sj) produce the desired value, Sj.values for Sj (Line 13), which eventually becomes the ignored value for that split node of the cluster ci (Line 14). This value will be used as an initial value during the second round of message passing for the corresponding split node. Definition 3.3 (Dependent Acyclic Graph, DG(Sj)). A DG(Sj)is an acyclic directed graph for a split node Sjof a cluster cifrom the furthest node within the factor graph FGfrom the node Sjtowards it. Note that apart from the node Sj, none of the nodes of DG(Sj)can belong to the cluster ci. During the intermediate step, synchronous operations are performed at the edges of this graph in the same direction to compute each ignored value of a cluster, ignVal(ci). In the example of Fig. 3, F3→F2→x3is the dependent acyclic graph for split node x3of cluster c1in the intermediate step. RF8→x0(x0)=Qx0→F7(x0)=DF8→F7(x0) (12) RF8→x0(x0)=max{x0,x1,x2,x3,x4}⧹{x0}[F8(x0,x1,x2,x3,x4)+Qx1→F8(x1)+Qx2→F8(x2)+Qx3→F8(x3)+Qx4→F8(x4)]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{Qx1→F8(x1)+Qx4→F8(x4)}+{RF9→x2(x2)}+{RF3→x3(x3)+RF4→x3(x3)}]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{RF9→x2(x2)+RF3→x3(x3)+RF4→x3(x3)}]=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{DF9→F8(x2)+DF3→F8(x3)+DF4→F8(x3)}] (13) DF8→F7(x0)=max{x1,x2,x3,x4}[F8(x0,x1,x2,x3,x4)+{DF9→F8(x2)+DF3→F8(x3)+DF4→F8(x3)}] (14) DFj→Fp(xi)=maxxj⧹xi[Fj(xj)+∑k∈Cj⧹FpDFk→Fj(xt)] (15) As discussed, the entire operation of the intermediate step is performed by the corresponding cluster head for each cluster. Therefore, apart from receiving the Mr values, which is literally a single message from the participating clusters of the first round, there is no communication cost in this step. This produces a significant reduction of communication cost (time) in PMP. Moreover, we can avoid the computation of variable-to-factor messages in the intermediate step as they are redundant and produce no further significance in this step. In the example of Fig. 4, we consider every possible scenario while computing the message F8→x0 (i.e. F8→F7), and show that the variable-to-factor messages (x4→F8,x1→F8,x2→F8,x3→F8) are redundant during the intermediate step of PMP (Equations (12)–(14)). Here, Q1→8(x1)={0,0,…,0} and Q4→8(x4))={0,0,…,0}, as x1 and x4 do not have any neighbours apart from F8. As a result, we get Equation (14) from Equations (12) and (13), and Equation (15) is the generalization of Equation (14). c1BeforeFirstRoundofMessagePassing:splitnodeofc1,S={S1=x3}initialvaluesforx3={0,0}BeforeIntermediateStep:ignoredvaluesforx3={RF2→x3(x3)}AfterIntermediateStep:initialvaluesforx3=S1.values={65,72} (16) c2BeforeFirstRoundofMessagePassing:splitnodeofc2,S={S1=x3,S2=x5}initialvaluesforx3={0,0}initialvaluesforx5={0,0}BeforeIntermediateStep:ignoredvaluesforx3={RF1→x3(x3)}ignoredvaluesforx5={RF4→x5(x5)}AfterIntermediateStep:initialvaluesforx3=S1.values={27,28}initialvaluesforx5=S2.values={65,72} (17) c3BeforeFirstRoundofMessagePassing:splitnodeofc3,S={S1=x5}initialvaluesforx5={0,0}BeforeIntermediateStep:ignoredvaluesforx5={RF3→x5(x5)}AfterIntermediateStep:initialvaluesforx5=S1.values={81,82} (18) Figure 4. View largeDownload slide Single computation within the intermediate step. In the figure, directed dashed arrows indicate the dependent messages to generate the desired message from F8 to x0 or F8 to F7 (directed straight arrows). Figure 4. View largeDownload slide Single computation within the intermediate step. In the figure, directed dashed arrows indicate the dependent messages to generate the desired message from F8 to x0 or F8 to F7 (directed straight arrows). Despite the aforementioned advantages, each synchronous execution (i.e. DFj→Fp(xi)) within DG(Sj) is still as expensive as a factor-to-variable message, and can be computed using Equation (15). In this context, Cj denotes the set of indexes of the functions connected to function Fj in the dependent acyclic graph ( DG(Sj)) of the intermediate step, and xt stands for a variable connected to both functions Fk and Fj. Notably, Equation (15) retains similar properties as Equation (3), but the receiving node is a function node (or the split node) instead of only a variable node. For example, x3 is a split node for cluster c1 in Fig. 3, where F2 and F3 are the nodes of the directed acyclic graph for x3. Here, the cluster head of c1 receives the Mr value {65,72}. Then the first operation on the graph produces {102,99} for the edge F3→F2. Afterwards, by taking {102,99} as the input, the cluster head of c1 generates {119,126}, which is the ignored value for x3∈c1 generated during the intermediate step. Now, instead of {0,0} the second round uses {119,126} as the initial value for node x3 in cluster c1. On the other hand, cluster c2 has two neighbouring clusters, c1 and c3, and neither of them have other clusters to depend on. Therefore, there is no need for further computation in the intermediate step for the split nodes x3 and x5 of cluster c2. The Mr values {27,28} and {65,72} need to be used as the ignored values for the split node x3 and x5, respectively, for the cluster c2. During different steps of PMP, all the values related to the split nodes of the clusters c1, c2 and c3 are shown in the set of Equations (16–18), respectively. Note that each synchronous operation (i.e. Equation (15)) on each edge of DG(Sj)) in the intermediate step still requires a significant amount of computation due to the potentially large parameter domain size and constraints with high arity. Considering this, in order to improve the computational efficiency of this step, we propose an algorithm to reduce the domain size over which the maximization needs to be computed (Algorithm 5). In other words, Algorithm 5 operates on Equation (15), that represents a synchronous operation of the intermediate step, to reduce its computational cost. This algorithm requires incoming messages from the neighbour(s) of a function in DG(Sj), and each local utility must be sorted independently by each state of the domain. Specifically, this sorting can be done before computing the StatusMessage during the time of the first round of message passing. Therefore, it does not incur an additional delay. Finally, this algorithm returns a pruned range of values for each state of the domain (i.e. {d1,d2,…,dr}) over which the maximization will be computed. Algorithm 5 Domain pruning to compute DFj→Fp(xi) in intermediate step of PMP. View Large Algorithm 5 Domain pruning to compute DFj→Fp(xi) in intermediate step of PMP. View Large As discussed in the previous section, DFj→Fp(xi) stands for a synchronous operation where Fj computes a value for Fp within DG(Sj). Initially, Line 2 computes m, which is the summation of the maximum values of the messages received by the sending function Fj, other than Fp. In the worked example of Fig. 5, we illustrate the complete process of domain pruning of the state B, while computing a sample message from F4 to F5. Notably, this is the same example we previously used in Section 2 to explain the function-to-variable message computation process (see Equations (6)–(9)), and it can be seen that the synchronous operation (i.e. F4 to F5) in the intermediate step is similar to that of the function-to-variable ( F4 to x6) computation. Here, the messages received by the sending node F4 are {81,82} and {40,45}. As the maximum of the received messages are 82 and 45, the value of m=82+45=127. Now, the for loop in Lines 3–13 generates the range of the values for each state di∈{d1,d2,…,dr} of the domain from where we will always find the maximum value for the function Fj, and discard the rest. To do so, Line 4 of the algorithm initially generates the maximum value p for the state di of the function Fj (i.e. maxdi(Fj(xj))). Then, Line 5 computes b, which is the summation of the corresponding values of p from the incoming messages of Fj. In the example of Fig. 5, the sorted local utility for B is {25,7,3,2}, from where we get b=81+40=121 for the maximum value, p=25. Afterwards, Line 6 gets the base case t, which is a subtraction of b from m (i.e. t=m−b=127−121=6). Figure 5. View largeDownload slide Worked example of domain pruning during the intermediate step of PMP. In this example, red and blue colours are used to distinguish the domain state R and B while performing the domain pruning. Figure 5. View largeDownload slide Worked example of domain pruning during the intermediate step of PMP. In this example, red and blue colours are used to distinguish the domain state R and B while performing the domain pruning. At this point, a value s, which is less than p, is picked from the sorted list of that state di (Line 8). Here, the function getVal(j) finds the value of s, with j representing the number of attempts. In the first attempt (i.e. j=1), it will randomly pick a value of s from a range of top log2∣di∣ values of di, where ∣di∣ is the size of di. Finally, if the value of t is less than or equal to p−s, the desired maximization must be found within the range of [p,s). Otherwise, we need to pick a smaller value of s from the next top log2∣di∣ values, and repeat the checking (Lines 9–12). In the worked example, we pick the value s=3 which is in the top 2 (i.e. log2∣4∣) values of B. Here, t is smaller than p−s, that is 6<(25−3), and it satisfies the condition of Line 9. As a result, the maximum value for the state B will definitely be found from range [25,3) or [25,7]. Hence, it is not required to consider the smaller values of s for this particular scenario. Eventually, introducing the domain pruning technique allows PMP to ignore these redundant operations during the intermediate step, thus reducing the computational cost in terms of completion time. Even for such a small example, this approach reduces half of the search space. Therefore, the overall completion time of the intermediate step can be reduced significantly by employing the domain pruning algorithm. 3.4. Comparative example Having discussed each individual step of PMP separately, Fig. 6 illustrates a complete worked example that compares the performance of SMP and PMP in terms of completion time. In so doing, we use the same factor graph shown in Fig. 1. Additionally, the message computation and transmission costs for the nodes, based on which the completion time is generated, are given in the figure. In general, a function-to-variable message is computationally significantly more expensive to generate compared to a variable-to-function message, and a node with a higher degree requires more time to compute a message than a node with a lower degree [19, 25, 27]. In this example, the values were chosen to reflect this observation. For instance, a function node F1 with degree 3 requires 20 ms to compute a message for any of its neighbouring nodes, and F2 (with degree 2) requires 10 ms to compute a message. On the other hand, for the variable-to-function messages, if a variable has only one neighbouring node (e.g. x0,x2), it generally sends a predefined initial message to initiate the message passing process. Therefore, the time required to generate such a message is negligible. On the contrary, we consider 2 ms as the time it takes to produce a variable-to-function message when the variable has degree 2 (e.g. x1,x4). Moreover, we consider 5 ms as the time it takes to transmit a message from one node to another in this example. Furthermore, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. Figure 6. View largeDownload slide Comparative example of SMP (top) and PMP (bottom), in terms of completion time, based on the factor graph shown in Fig. 1. In the figure, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means, F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. Figure 6. View largeDownload slide Comparative example of SMP (top) and PMP (bottom), in terms of completion time, based on the factor graph shown in Fig. 1. In the figure, each edge weight within a first parentheses represents the time required to compute and transmit a message from a node to its corresponding neighbouring node. For instance, the edge weight from F0 to x0 in SMP is 136–150 ms. That means, F0 starts computing a message for x0 after 135 ms of initiating the message passing process, and the receiving node x0 receives the message after 150 ms. The total calculation of the completion time following the SMP protocol is depicted at the top of the figure. At the beginning, nodes x0, x2, x6 and x8 initiate the message passing, and their corresponding receiving nodes F0, F1, F4 and F5 receive messages after 5 ms. Then, x1 and x7 receive messages from F0 and F5, respectively, after 20 ms. Although F1 has already received a message from x2, it cannot generate a message as it requires at least two messages to produce one. In this process, the message passing process will complete when all of the nodes receive messages from all of their neighbours. In this particular example, this is when x0 and x8 receive messages from F0 and F5, respectively, after 150 ms. Thus, the completion time of SMP is 150 ms. On the other hand, PMP splits the original factor graph into three clusters in this example. Each of the clusters executes the message passing in parallel following the similar means and regulation as its SMP counterpart. To be precise, the first round of message passing is completed after 52 ms. Afterwards, the intermediate step to recover the ignored values for the split nodes is initiated. For cluster c1, x3 is the split node that ignored the message coming from F2 during the first round. Two synchronous operations, DF3→F2(x4) and DF2→x3(x3), are required to obtain the desired value for the split node x3. Each of these operations is as expensive as the corresponding function-to-variable messages. However, Algorithm 5 can be used to reduce the cost of these operations, and we consider a reduction of 40%, since this is the minimum reduction we get from the empirical evaluation (see Section 4). In this process, the intermediate step of cluster c1 and c3 is completed after 64 ms. Unlike those two clusters, cluster c2 shares split node x3 ( x5) with such a cluster c1 ( c3) that has no other cluster to depend on apart from c2. Therefore, the ignored values for x3 and x5 are ready immediately after the completion of the first round (see Algorithm 4). As a result, cluster c2 can start its second round after 52 ms. In any case, the second round utilizes the recovered ignored values as the initial values for the split nodes to produce the same outcome as its SMP counterpart. We can observe that the second round of message passing completes after 116 ms. Thus, even for such a small factor graph of six function nodes and nine variable nodes, we can save around 23% of the completion time by replacing SMP with PMP. 4. EMPIRICAL EVALUATION Given the detailed description in previous section, we now evaluate the performance of PMP to show how effective it is in terms of completion time compared with the benchmarking algorithms that follow the SMP protocol. To maintain the distributed nature, all the experiments were performed on a simulator in which we generated different instances of factor graph representations of DCOPs that have varying numbers of function nodes 100–10,000. Hence, the completion time that is reported in this section is a simulated distributed metric, and the factor graphs are generated by randomly connecting a number of variable nodes per function from the range 2–7. Although we use these ranges to generate factor graphs for this paper, the results are comparable for larger settings. Now, to evaluate the performance of PMP on different numbers of clusters, we report the result for the number of clusters 2–99 for the factor graph of 100–900 function nodes, and 2–149 for the rest. These ranges were chosen because the best performances are invariably found within these ranges, and the performance steadily gets worse for larger numbers of clusters. As both SMP and PMP are generic protocols that can be applied to any GDL-based DCOP formulation, we run our experiments on the generic factor graph representation, rather than an application specific setting. To be exact, we only focus our objective on evaluating the performance of PMP compared with the SMP protocol in terms of the completion time of message passing process, rather than concentrating on the overall algorithm of a DCOP solution approach. Notably, the completion time of such algorithms mainly depends on following three parameters: Average time to compute a potentially expensive function-to-variable message within a factor graph, denoted as Tp1. Average time to compute an inexpensive variable-to-function message within a factor graph, denoted as Tp2. Average time to transmit a message between nodes of a factor graph, denoted as Tcm. In the MAS literature, a number of extensions of the Max-Sum/BMS algorithms have been developed. Significantly, each of them can be defined by different ratios of the above mentioned parameters. For example, the value of Tp1Tp2 is close to 1 for algorithms such as FMS, BFMS or BnB-FMS, because they restrict the domain sizes for the variables always to 2 [2, 32]. In contrast, in a DCOP setting with large domain size, the value of Tp1Tp2 is much higher for a particular application of the Max-Sum or the BMS algorithm [24, 22, 25]. Additionally, the communication cost or the average message transmission cost (Tcm) can vary due to different reasons such as environmental hazard in disaster response or climate monitoring application domains [33, 34]. To reflect all these issues in evaluating the performance of PMP, we consider different ratios of those parameters to show the effectiveness of PMP over its SMP counterpart in a wide range of conceivable settings. To be exact, we run our experiments on seven different settings, each of which has identical ratios of the parameters: Tp1, Tp2 and Tcm. Note that, once the values of each of the parameters have been fixed for a particular setting, the outcome remains unchanged for both SMP and the different versions of PMP even if we repeat the experiments for that setting. This is because, we run both the protocols on acyclic or transformed acyclic version of a factor graph, hence they always provide a deterministic outcome. Hence, there is no need to perform an analysis of statistical significance for this set of experiments. Note that all of the following experiments are performed on a simulator implemented on an Intel i7 Quadcore 3.4 GHz machine with 16 GB of RAM. 4.1. Experiment E1: Tp1>Tp2 AND Tp1≈Tcm Figures 7(a) and (b) illiterate the comparative measure on completion time for SMP and PMP under experimental setting E1 for the factor graph with the number of function nodes 100–900 and 3000–10,000, respectively. Each line of the figures shows the result of both SMP (Number of Clusters=1) and PMP (Number of Clusters>1). The setting E1 characterizes a scenario where average computation cost (time) of a function-to-variable message ( Tp1) is moderately more expensive than a variable-to-function message ( Tp2), and the average time to transmit a message between nodes ( Tcm) is approximately similar to Tp1. To be precise, we consider Tp2 be 100 times less expensive than a randomly taken Tp1 for this particular experiment. The scenario E1 is commonly seen in the following GDL-based DCOP algorithms: Max-Sum, BMS and FMS. Once these three parameters have been determined, the completion time of SMP (i.e. Tsmp) and PMP (i.e. Tpmp) can be generated using Equations (19) and (20), respectively. Here, the function requiredTime() takes Tp1,Tp2,Tcm and an acyclic factor graph as an input, and computes the time it needs to finish the message passing by following the regulation of SMP. Tsmp=requiredTime(FG,Tp1,Tp2,Tcm) (19) Tpmp=2×requiredTime(clargest,Tp1,Tp2,Tcm)+Tintm (20) Figure 7. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster >1) for the experimental setting E1: ( Tp1>Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 7. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster >1) for the experimental setting E1: ( Tp1>Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. As discussed in Section 3, due to parallel execution on each cluster, for PMP, we only need to consider the largest cluster of FG (i.e. clargest) instead of the complete factor graph FG. Altogether the completion time of PMP includes the time required to complete the two rounds of message passing on the largest cluster with the addition of the time it takes to complete the intermediate step (Tintm). In the intermediate step, each synchronous operation is as expensive as the factor-to-variable message ( Tp1). However, during the intermediate step, the proposed domain pruning technique of PMP (i.e. Algorithm 5) minimizes the cost of Tp1 by reducing the size of the domain (i.e search space) over which maximization needs to be computed. To empirically evaluate the performance of the domain pruning technique, we independently test it on a randomly generated local utility table that has a varying domain size from 2 to 20. In general, we observe a significant reduction of the search space, ranging from 40% to 75%, by using this technique, and as expected the results are getting better with the increase of the domain size (see Section 3.3). Hence, to reflect the worst case scenario, we consider only a 40% reduction for each operation of the intermediate step while computing the completion time of PMP for all the results reported in this paper. According to Fig. 7(a), the best performance of PMP compared to the SMP protocol can be found if the number of clusters is picked from the range {5–25}. In particular, for the smaller factor graphs this range becomes smaller. For example, when we are dealing with a factor graph of 100 function nodes the best results are found within the range of {5–18} clusters; afterwards, the performance of PMP gradually decreases. This is because the time required to complete the intermediate step increases steadily when the cluster size gets smaller (i.e. the number of clusters gets larger). On the other hand, the time it takes to complete the two rounds of message passing increases when the cluster size becomes larger. As a consequence, it is observed from the results that the performance of PMP drops steadily with the increase of the number of clusters after reaching to its peak with a certain number of clusters. Generally, we observe a similar trend in each scenario. Therefore, a proper balance is necessary to obtain the best possible performance from PMP (see Section 5). Notably, for the larger factor graphs, the comparative performance gain of PMP is more substantial in terms of completion time due to the consequence of parallelism. As observed, PMP running over a factor graph with 100–300 function nodes achieves around 53–59% performance gain (Fig. 7(a)) over its SMP counterpart. On the other hand, PMP takes 61–63% less time than SMP when larger factor graphs (600–900 functions) are considered. Finally, Fig. 7(b) depicts that this performance gain reaches around 61–65% for the factor graph having 3000–10 000 function nodes. Here, this performance gain of PMP is achieved if the number of clusters is chosen from the range of {25–44}. 4.2. Experiment E2: Tp1≫Tp2 AND Tp1≫Tcm In experimental setting E2, we generated the results based on similar comparative measures and representations as the setting E1 (Fig. 8). However, E2 characterizes the scenario where the average computation cost (time) of a function-to-variable message ( Tp1) is extremely expensive compared to the variable-to-function message ( Tp2), and the average time to transmit a message between nodes ( Tcm) is considerably more inexpensive compared to Tp1. To be exact, we consider Tp2 be 10 000 times less expensive than a randomly taken Tp1 for this particular setting. Here, Tp1 is considered 200 times more time consuming than Tcm. Max-Sum and BMS are two exemplary GDL-based algorithms where E2 is commonly seen. More specifically, this particular setting reflects those applications that contain variables with high domain size. For example, assume the domain size is 15 for all five variables involved in a function. In this case, to generate each of the function-to-variable messages, the corresponding agent needs to perform 155 or 7,59,375 operations. Since Tp1 is extremely expensive in this experimental setting, the performance of PMP largely depends on the performance of the domain pruning technique. Similar to the above experiment, Fig. 8(a) shows the results for the factor graphs having 100–900 function nodes, and the results obtained by applying on larger factor graphs (3000–10 000 function nodes) are shown in Fig. 8(b). This time, the best performance of PMP for those two cases are observed when the number of clusters are picked from the ranges {15–41} and {45–55} respectively. Afterwards the performance of PMP drops gradually due to the same reason as witnessed in E1. Notably, the performance gain reaches around 37–42% for the factor graphs having 100–900 function nodes, and 41–43% for 3000–10 000 function nodes. Figure 8. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E2: ( Tp1≫Tp2 AND Tp1≫Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 8. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E2: ( Tp1≫Tp2 AND Tp1≫Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.3. Experiment E3: Tp1≫Tp2 AND Tp1>Tcm Experimental setting E3 possesses similar properties and scenarios as E2, apart from the fact that here Tp1 is moderately more expensive than Tcm instead of extremely more expensive. Similar to the previous experiment, we consider Tp2 be 10 000 times less expensive than a randomly taken Tp1. However, Tcm is taken only 10 times less expensive than Tp1. It is observed from the results that even without the domain pruning technique, PMP minimizes the cost of Tcm and Tp2 significantly in E3. This is because Tcm is not too inexpensive; and the operations of the intermediate step do not include any communication cost. Moreover, given Tp1 is also very expensive, PMP produces better performance than what we observed in E2 by utilizing the domain pruning technique. Altogether, Figs. 9(a) and (b) show that PMP consumes 45–49% less time than SMP for this setting when the number of clusters is chosen from the range {17–47}. Max-Sum and BMS are the exemplary algorithms where settings similar to E3 are commonly seen. Figure 9. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E3: ( Tp1≫Tp2 AND Tp1>Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 9. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E3: ( Tp1≫Tp2 AND Tp1>Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.4. Experiment E4: Tp1≫Tp2 AND Tp1≈Tcm Figure 10 shows the comparative results of PMP over SMP for experimental setting E4. E4 characterizes the scenarios where Tp1 is extremely more expensive than Tp2, and approximately equal to Tcm. To be exact, we consider Tp2 be approximately 5000 times less expensive than randomly taken values of Tp1 and Tcm. Here, both Tp1 and Tcm are substantial, and hence PMP achieves notable performance gains over SMP, compared with the previous experiments. According to the graphs of Figs. 10(a) and (b) PMP takes 59–73% less time compared to its SMP counterpart. The preferable range of number of clusters for the setting E4 is {15–55}. Figure 10. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E4: ( Tp1≫Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 10. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E4: ( Tp1≫Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.5. Experiment E5: Tp1≈Tp2 AND Tp1≈Tcm Experiment E5 characterizes the scenarios where Tp1 and Tcm are approximately equal to the inexpensive Tp2. Such a scenario normally occurs when message passing (SMP or PMP) is applied on the following algorithms: FMS, BnB Max-Sum [33], Generalized Fast Belief Propagation [26] or Max-Sum/BMS with small domain size and inexpensive communication cost in terms of time. This is a trivial setting where each of the three parameters is not that expensive. Specifically, as Tp1 is inexpensive, the domain pruning technique has less impact on reducing the completion time of PMP. However, the effect of parallelism from the clustering process coupled with the avoidance of redundant variable-to-function messages during the intermediate step allows PMP to take 55–67% less time than its SMP counterpart (Figs. 11(a) and (b)). The preferable range of number of clusters is same as the setting E4. Figure 11. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E5: ( Tp1≈Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 11. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E5: ( Tp1≈Tp2 AND Tp1≈Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.6. Experiment E6: Tp1≈Tp2 AND Tp1≪Tcm Figure 12 illustrates the comparative results of PMP over SMP for experimental setting E6, which possess similar properties, scenarios and the applied algorithms as E5. However, in E6, the average message transmission cost Tcm is considerably more expensive than Tp1 and Tp2. To be exact, we consider Tp1 be 15 times less expensive than a randomly taken value of Tcm. As Tcm is markedly more expensive and Tp2 is approximately equal to Tp1, the performance gain of PMP increases to the highest level (70–91%). To be precise, the reduction of communication by avoiding the variable-to-function messages during the intermediate step, which is extremely expensive in this setting, helps PMP achieves this performance. This result signifies that PMP performs best in those settings where the communication cost is expensive. Note that the preferable range of number of clusters for the setting E6 is {15–61}. Figure 12. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E6: ( Tp1≈Tp2 AND Tp1≪Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 12. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the experimental setting, E6: ( Tp1≈Tp2 AND Tp1≪Tcm). (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. 4.7. Experiment E7: Tp1≈Tp2 AND Tp1<Tcm Experiment E7 possesses similar properties, scenarios and exemplary algorithms as the setting E6, with the following exception: Tcm in E7 is moderately more expensive than Tp1 instead of considerably more expensive. To be precise, we consider Tp1 be four times less expensive than randomly taken values of Tcm. Due to the less substantial value of Tcm, unlike E6 where the performance gain reaches to the maximum level, PMP consumes 65–82% less time compared to its SMP counterpart (Figs. 13(a) and (b)). The preferable range of number of clusters for E7 is {17–50}. Figure 13. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the Experimental setting. E7: ( Tp1≈Tp2 AND Tp1<Tcm). (a) Number of function nodes (factors): 100–900 and (b) Number of function nodes (factors): 3000–10000. Figure 13. View largeDownload slide Completion time: Standard Message Passing (Number of Cluster=1); Parallel Message Passing (Number of Cluster>1) for the Experimental setting. E7: ( Tp1≈Tp2 AND Tp1<Tcm). (a) Number of function nodes (factors): 100–900 and (b) Number of function nodes (factors): 3000–10000. 4.8. Total number of messages The most important finding to emerge from the results of the experiments is that PMP significantly reduces the completion time of the GDL-based message passing algorithms for all the settings. However, PMP requires more messages to be exchanged compared with its SMP counterpart due to two rounds of message passing. To explore this trade-off, Fig. 14 illustrates the comparative results of PMP and SMP in terms of the total number of messages for factor graphs with a number of function nodes 50–1200 with an average five variables connected to a function node. The results are comparable for settings with higher arities. Specifically, we find that PMP needs 27–45% more messages than SMP for a factor graph having less than 500 function nodes and 15–25% more messages for a factor graph having more than 500 nodes. As more messages are exchanged at the same time in PMP due to the parallel execution, this phenomenon does not affect the performance gain in terms of completion time. Figure 14. View largeDownload slide Total number of messages: SMP vs. PMP. Figure 14. View largeDownload slide Total number of messages: SMP vs. PMP. Now, based on the extensive empirical results, we can claim that, in PMP, even randomly splitting a factor graph into a number of clusters, within the range of around 10–50 clusters, always produces a significant reduction in completion time of GDL-based DCOP algorithms. However, this performance gain is neither guaranteed to be the optimal one, nor deterministic for a given DCOP setting. Therefore, we need an approach to predict how many clusters would produce the best performance from PMP for a given scenario. At this point, we only have a range from which we should pick the number of clusters for a certain factor graph representation of a DCOP. 5. APPROXIMATING THE APPROPRIATE NUMBER OF CLUSTERS FOR A DCOP In this section, we turn to the challenge of determining the appropriate number of clusters for a given scenario in PMP. The ability to predict a specific number in this regard would allow PMP to split the original factor graph representation of a DCOP accurately into a certain number of clusters, prior to executing the message passing. In other words, this information allows PMP to be applied more precisely in different multi-agent DCOPs. However, it is not possible to predict the optimal number of clusters due to the diverse nature of the application domains, and the fact that a graphical representation of a DCOP can be altered at runtime. Therefore, we use an approximation. To be precise, we use a linear regression method, and run it off-line to approximate a specific number of clusters for a DCOP before initiating the message passing of PMP. In this context, logistic regression, Poisson regression and a number of classification models could be used to predict information from a given finite data set. However, they are more suited to estimate categorical information rather than predicting specific numerical data required for our model. Therefore, we choose the linear regression method for our setting. Moreover, this method is time efficient in terms of computational cost because as an input it only requires an approximate number of function nodes of the corresponding factor graph representation of a DCOP in advance. The remainder of this section is organized as follows. In Section 5.1, we explain the linear regression method, and detail of how it can be used along with the PMP protocol to predict the number of clusters for a specific problem instance. Then, Section 5.2 presents our empirical results of using this method on different experimental settings (i.e. E1, E2,…,E7) defined and used in the previous section. Specifically, we show the differences in performance of PMP considering the prediction method compared with its best possible results in terms of completion time. Notably, PMP’s performance gain, for each value within the preferred range of number of clusters, is shown in the graphs of the previous section. Here, we run a similar experiment to obtain the best possible performance gain for a certain problem instance, and then compare this with the gain obtained by using the predicted number of clusters. Finally, we end this section by evaluating the performance of PMP as opposed to SMP on two explicit implementations of GDL-based algorithms. 5.1. Determining the appropriate number of clusters Regression analysis is one of the most widely used approaches for numeric prediction [35, 36]. The regression method can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable which is continuous valued. Many problems can be solved by linear regression, and even more can be handled by applying transformations to the variables so that a non-linear problem can be converted to a linear one. Specifically, the linear regression with a single predictor variable is known as straight-line linear regression, meaning it only involves a response variable Y and a single predictor variable X. Here, the response variable Y is modelled as a linear function of the predictor variable X (Equation (21)). Y=W0+W1X (21) W1=∑i=1∣D∣(Xi−X¯)(Yi−Y¯)∑i=1∣D∣(Xi−X¯)2 (22) W0=Y¯−W1X¯ (23) In Equation (21), the variance of Y is assumed to be constant, and W0 and W1 are regression coefficients which can be thought of as weights. These coefficients can be solved by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. Let D be the training set consisting of values of the predictor variable X and their associated values for the response variable Y. This training set contains ∣D∣ data points of the form (X1,Y1),(X2,Y2),…,(X∣D∣,Y∣D∣). Equations (22) and (23) are used to generate the regression coefficients W1 and W0, respectively. Now, the linear regression analysis can be used to predict the number of clusters for a certain application given that continuously updated training data from the experimental results of PMP exists. To this end, Table 1 contains the sample training data taken from the results shown in Section 4. Here, we formulate this training data D so that straight-line linear regression can be applied, where D consists of the values of a predictor variable X (number of function nodes) and their associated values for a response variable Y (number of clusters). In more detail, this training set contains ∣D∣ (number of nodes − number of clusters) data of the form (X1,Y1),(X2,Y2),…,(X∣D∣,Y∣D∣). Initially, Equations (22) and (23) are used to generate regression coefficients W1 and W0, respectively, which are used to predict the appropriate number of clusters (response variable, Y) for a factor graph with a certain number of function nodes (predictor variable, X) (Equation (21)). For instance, based on the training data of Table 1, we can predict that for factor graphs with 4500 and 9200 function nodes PMP should split the graphs into 43 and 51 clusters, respectively (Table 2). As we need to deal with only a single predictor variable, we are going to use the terms linear regression and straight-line linear regression interchangeably. In the remainder of this section, we evaluate the performance of this extension through extensive empirical evidence. Table 1. Sample training data from Figs. 7–13. Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 View Large Table 1. Sample training data from Figs. 7–13. Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 Number of nodes ( X) Number of clusters ( Y) Experimental setting 3000 25 E1 5000 33 E1 8000 38 E1 10 000 40 E1 3000 47 E2 5000 50 E2 8000 52 E2 10 000 55 E2 3000 32 E3 5000 36 E3 8000 44 E3 10 000 47 E3 3000 40 E4 5000 50 E4 8000 52 E4 10 000 55 E4 3000 38 E5 5000 46 E5 8000 50 E5 10 000 52 E5 3000 50 E6 5000 55 E6 8000 58 E6 10 000 61 E6 3000 40 E7 5000 46 E7 8000 49 E7 10 000 50 E7 View Large Table 2. Predicted number of clusters by applying the straight-line linear regression (Equations (21–23)) on the training data of Table 1. Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 View Large Table 2. Predicted number of clusters by applying the straight-line linear regression (Equations (21–23)) on the training data of Table 1. Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 Number of nodes ( X) Predicted number of clusters ( Y) 3050 41 4500 43 5075 44 6800 47 7500 48 8020 49 8050 49 9200 51 9975 52 View Large 5.2. Empirical evaluation In this section, we evaluate the performance of PMP by considering the number of clusters predicted using the linear regression method in terms of completion time, and compare this with the highest possible performance gain from PMP, which is the best case outcome of PMP using a certain number of clusters. In so doing, we use the same experimental settings (E1, E2,…,E7) used in Section 4. Specifically, Table 3 illustrates the comparative performance gain of PMP using the straight-line linear regression and the highest possible gain for five factor graphs having the number of function nodes: 3050, 5075, 6800, 8050 and 9975 based on the experimental setting E1 and E6. We repeat the experiments of Section 4 for each of these factor graphs to obtain the highest possible performance gain from PMP. That is, we reported the performance of PMP for all the clusters ranging from 2 to 150. From these results, we get the highest possible gain and the performance based on the predicted number of clusters of PMP. It can be seen that for the factor graph with 3050 function nodes the highest possible gain of PMP reaches to 60.69%, meaning PMP takes 60.69% less time to complete the message passing operation to solve the DCOP representing the factor graph than its SMP counterpart. Now, if PMP is applied considering the predicted number of clusters (i.e. 41) using the straight-line linear regression (Table 2), the gain reaches to 60.21%. This indicates, PMP ensures 98.7% of its possible performance gain by applying the straight-line linear regression. Similarly, PMP ensures 99.64% of the possible performance gain for a factor graph with 9975 function nodes in the experimental setting E6 while applied based on the number of clusters obtained from the linear regression method. Notably, this trend is common for the rest of the factor graphs, and in each case more than 98.42% of the best possible results are assured by applying the straight-line linear regression according to the results shown in Table 3, and all the results are comparable for all the other experimental settings. Significantly, it can be ascertained from our experiments that a minimum of around 98.5% of the best possible results of PMP can be achieved if the number of clusters is predicted by the straight-line linear regression method. Notably, a common phenomenon is noticed from the empirical evaluation of Section 4 that the performance of PMP falls very slowly after it reaches to its peak by a certain number of clusters on either increasing or decreasing that number. This is why, approximating a number of clusters produces such good results. Table 3. Performance gain of PMP using the linear regression method compared to the highest possible gain from PMP. Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 View Large Table 3. Performance gain of PMP using the linear regression method compared to the highest possible gain from PMP. Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 Experiment: E1 Experiment: E6 Number of function nodes Best possible performance from PMP (%) Performance of PMP using linear regression (%) Best possible performance from PMP (%) Performance of PMP using linear regression (%) 3050 60.69 60.21 88.82 88.12 5075 63.08 62.50 89.45 89.10 6800 63.95 59.62 89.80 89.35 8050 64.82 59.62 90.17 89.80 9975 64.83 64.18 90.25 89.93 View Large 6. SUPPLEMENTARY EMPIRICAL RESULTS In the final experiment, we analyse the performance of PMP (compared with SMP) based on two GDL-based algorithms: Max-Sum and Fast Max-Sum. We do this to observe the performance of PMP on the actual runtime metric that complements our controlled and systematic experiments of Section 4. In the experiment, we consider the predicted number of clusters obtained using the linear regression method. Here, the factor graphs are generated in the same way as they are for the experiments of Section 4. Additionally, we make use of the Frodo framework [37] to generate local utility tables (i.e. cost function) for the function nodes of the factor graphs. On the one hand, we use two ranges, (5–8) and (12–15) of the variables’ domain size to generate the utility tables for Max-Sum. In doing so, we are able to observe the comparative result for different ratios of the parameters Tp1 and Tp2. To be precise, these two ranges reflect the scenarios Tp1>Tp2 and Tp1≫Tp2, respectively. On the other hand, we restrict the domain size to exactly 2 for all the variables in case of Fast Max-Sum so that it reflects the characteristic of the algorithm (i.e. Tp1≈Tp2). Notably, it is not possible to emulate a realistic application, such as disaster response or climate monitoring in a simulated environment that provides the actual value of Tcm [38]. Consequently, we observe the value of Tcm to be very small in this experiment. It can be seen from the solid-gray line of Fig. 15(a) that PMP takes around 55–60% less time than SMP to complete the message passing process for Fast Max-Sum on the factor graph having 100–900 function nodes. Meanwhile PMP reduces 35–42% of SMP’s completion time for Max-Sum where the variables’ domain size is picked from the range 5–8 (dashed-black line). This is because in the former case, all three parameters (i.e. Tp1, Tp2 and Tcm) are small and comparable. Therefore, the parallel execution of message passing, along with the avoidance of variable-to-factor messages in the intermediate step, allows PMP attain this performance in this case. In contrast, its performance in the latter case mainly depends on the impact of domain reduction in the intermediate step, given that the value of Tp2 and Tcm is negligible when compared to Tp1. The same holds true for Max-Sum with a larger domain size, where we observe a 67–72% reduction in completion time by PMP, as opposed to its SMP counterpart (dashed-gray line). However, the observed outcome in this case indicates that the impact of domain reduction in intermediate step gets better with an increase in domain size. Figure 15(b) illustrates a similar trend in the performance of PMP, wherein we take larger factor graphs of 3000 to around 10 000 function nodes into consideration. Here, we observe an even better performance for each of the cases, due to the impact of parallelism in larger settings. Figure 15. View largeDownload slide Empirical performance of PMP vs. SMP running on two GDL-based algorithms. Error bars are calculated using standard error of the mean. (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. Figure 15. View largeDownload slide Empirical performance of PMP vs. SMP running on two GDL-based algorithms. Error bars are calculated using standard error of the mean. (a) Number of function nodes (factors): 100–900 and (b) number of function nodes (factors): 3000–10 000. It can be seen from the solid-gray line of Fig. 15(a) that PMP takes around 55–60% less time than SMP to complete message passing process for Fast Max-Sum on the factor graph having 100–900 function nodes. Whereas, PMP reduces 35–42% of SMP’s completion time for Max-Sum where variables’ domain size is picked from the range 5–8 (dashed-black line). This is due to the fact that, in the former case, all three parameters (i.e. Tp1, Tp2 and Tcm) are small and comparable. Hence, the parallel execution of message passing, coupled with the avoidance of variable-to-factor messages in the intermediate step, allows PMP achieve this performance in this case. In contrast, the performance of PMP in the latter case mainly depends on the impact of domain reduction in the intermediate step, as the value of Tp2 and Tcm is negligible, compared with Tp1. The same reason is true for Max-Sum with larger domain size, where we observe 67–72% reduction of completion time by PMP, compared with its SMP counterpart (dashed-gray line). However, the observed outcome in this case gives us a notable indication that the impact of domain reduction in intermediate step gets better with the increase in the domain size. Figure 15(b) illustrates similar trend in the performance of PMP, in which we consider larger factor graphs of 3000 to around 10 000 function nodes. Here, we observe even better performance for each of the cases, due to the impact of parallelism in larger settings. 7. RELATED WORK DCOP algorithms often find a solution in a completely decentralized way. However, centralizing part of the problem can reduce the effort required to find globally optimal solutions. Although PMP is based on the GDL framework, which provides DCOPs solutions in a decentralized manner, a representative agent from each cluster (i.e. cluster head) takes the responsibility for working out a number of synchronous operations during its intermediate step. In other words, PMP uses only the cluster heads to complete the operation of the intermediate step, instead of using all cooperating agents in the system. This phenomenon means PMP is a partially centralized approach. In the multi-agent systems literature, a number of approaches utilize the computational power of comparatively more powerful agents to find and solve hard portions of a DCOP. In particular, Optimal Asynchronous Partial Overlay (OptAPO) is an exact algorithm that discovers the complex parts of a problem through trial and error, and centralizes these sub-problems into mediating agent(s) [12]. The message complexity of OptAPO is significantly smaller than the benchmark exact algorithm ADOPT. However, in order to guarantee that an optimal solution has been found, one or more agents may end up centralizing the entire problem, depending on the difficulty of the problem and the tightness of the interdependence between the variables. As a consequence, it is impossible to predict where and what portion of the problem will eventually be centralized, or how much computation the mediators have to resolve. Nonetheless, it is possible that several mediators needlessly duplicate effort by solving overlapping problems. To address these issues, [11] introduces a partial centralization technique (PC-DPOP) based on another benchmarking exact algorithm DPOP. Unlike OptAPO, it also offers a prior exact prediction about communication, computation and memory requirements. However, although PC-DPOP provides better control over what parts of the problem are centralized than OptAPO, this control cannot be guaranteed to be optimal. Our approach originates from the non-exact GDL framework that is suitable for larger settings. Moreover, our approach is mostly decentralized, and the specific part (i.e. intermediate step) of the algorithm which needs to be done by the cluster heads is known in advance. Therefore, no effort is required to find these parts of a DCOP. As a consequence, there is no ambiguity in deciding which part of a problem should be done by which agent, nor is there any possibility of duplicating efforts in solving overlapping problems during this step of PMP. As far as the splitting of graphical representation is concerned, [18] proposes the use of a divide-and-coordinate approach in order to provide quality guarantees for non-exact DCOP algorithms. Nevertheless, their approach is neither targeted at reducing the completion time of DCOP algorithms, nor specifically based on GDL framework. Additionally, [28, 39] utilize greedy methods to maximize ‘residual’ at each iteration of classical belief propagation method, so that the overall convergence property (i.e. solution quality) can be improved. On the contrary, in this paper, we do not aim to improve the solution quality or convergence property. Rather, our objective is to minimize the completion time of GDL-based DCOP algorithms while maintaining the same solution quality. Over the past few years, a number of efforts have been sought to improve the scalability of GDL-based message passing algorithms. They are mainly built upon the Max-Sum or BMS algorithm, and focus on reducing the cost of the maximization operator of those algorithms. However, most of them typically limit the general applicability of such algorithms. For instance, FMS, BFMS and BnB-FMS can only be applied to a specific problem formulation of task allocation domain, and [40] proposes an explorative version of Max-Sum to solve a modified DCOP formulation specifically designed for mobile sensor agents [41]. On the other hand, other approaches rely on a pre-processing step, thus denying the opportunity of obtaining local utilities at runtime [26, 33, 42]. Despite these criticisms, these extensions perform well in certain DCOP settings. Moreover, given all of the extensions of Max-Sum and BMS follow the SMP protocol (or its asynchronous version), PMP can be easily applied on those algorithms in place of SMP. 8. CONCLUSIONS AND FUTURE WORK In this paper, we propose a generic framework which significantly reduces the required completion time of GDL-based message passing algorithms while maintaining the same solution quality. To be precise, our approach is applicable to all the GDL-based algorithms which use factor graph as a graphical representation of DCOPs. In particular, we provide a significant reduction in completion time for such algorithms, ranging from a reduction of 37–91% depending on the scenario. To achieve this performance, we introduced a cluster based method to parallelize the message passing procedure. Additionally, a domain reduction algorithm is proposed to further minimize the cost of the expensive maximization operation. Afterwards, we addressed the challenge of determining the appropriate number of clusters for a given scenario. In so doing, we propose the use of a linear regression prediction method for approximating the appropriate number of clusters for a DCOP. Remarkably, through the empirical results, we observe that more than 98% of the best possible outcomes can be achieved if PMP is applied on the number of clusters predicted by the straight-line linear regression. This means, if we know the size of a factor graph representation of a DCOP prior to performing the message passing, we can utilize the straight-line linear regression method to find out how many clusters should be created from that factor graph. Thus, we make PMP a deterministic approach. Given this, by using the PMP approach, we now can indeed use GDL-based algorithms to efficiently solve larger DCOPs. Notably, similar to the DCOP algorithms based on the Standard Message Passing protocol, PMP-based algorithms always find the optimal solutions for acyclic factor graphs and bounded approximate solution for cyclic factor graphs. The sacrifice in solution quality for cyclic graphical representation of DCOPs still limits the applicability of GDL-based approaches. Moreover, when the number of cycles is higher in a factor graph, the complexity of obtaining transformed acyclic factor graphs from the factor graph increases significantly [22]. At the same time, it is challenging to maintain the acceptable solution quality. In future work, we intend to investigate whether the clustering process of PMP can be utilized to obtain better solution quality for cyclic factor graphs to further extend the applicability of GDL-based approaches. We also intent to investigate the influence of a good cluster-head selection strategy on PMP in the future work. Moreover, often partially centralized approaches trade privacy for higher scalability. In the future, we also intend to analyse PMP in the context of DCOP privacy. Furthermore, as discussed in Section 2, the function and variable nodes of a factor graph are distributed among a number of agents. To date, there is no approach that determines how many agents should participate in this process for a particular situation. We intend to develop a model that addresses this issue for a given scenario, and thus we can avoid unnecessary agents’ involvement in the message passing procedure that would reduce communication costs by its own nature. As a result, such algorithms will be able to cope with even larger multi-agent settings. References 1 Modi , P.J. , Shen , W. , Tambe , M. and Yokoo , M. ( 2005 ) Adopt: asynchronous distributed constraint optimization with quality guarantees . Artif. Intell. , 161 , 149 – 180 . Google Scholar CrossRef Search ADS 2 Ramchurn , S.D. , Farinelli , A. , Macarthur , K.S. and Jennings , N.R. ( 2010 ) Decentralized coordination in RoboCup rescue . Comput. J. , 53 , 1447 – 1461 . Google Scholar CrossRef Search ADS 3 Zivan , R. , Yedidsion , H. , Okamoto , S. , Glinton , R. and Sycara , K. ( 2014 ) Distributed constraint optimization for teams of mobile sensing agents . Auton. Agent Multi Agent Syst. , 29 , 495 – 536 . Google Scholar CrossRef Search ADS 4 Farinelli , A. , Rogers , A. and Jennings , N.R. ( 2014 ) Agent-based decentralised coordination for sensor networks using the max-sum algorithm . Auton. Agent Multi Agent Syst. , 28 , 337 – 380 . Google Scholar CrossRef Search ADS 5 Junges , R. and Bazzan , A.L. ( 2008 ) Evaluating the Performance of DCOP Algorithms in a Real World, Dynamic Problem. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 599–606. IFAAMAS. 6 Maheswaran , R.T. , Tambe , M. , Bowring , E. , Pearce , J.P. and Varakantham , P. ( 2004 ) Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-event Scheduling. Proc. 3rd Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, New York, USA, July 9–23, pp. 310–317. ACM. 7 Cerquides , J.B. , Farinelli , A. , Meseguer , P. and Ramchurn , S.D. ( 2013 ) A tutorial on optimization for multi-agent systems . Comput. J. , 57 , 799 – 824 . Google Scholar CrossRef Search ADS 8 Dechter , R. ( 2003 ) Constraint Processing ( 1st edn ). Morgan Kaufmann , Massachusetts, USA . 9 Yeoh , W. , Felner , A. and Koenig , S. ( 2008 ) Bnb-Adopt: An Asynchronous Branch-and-Bound DCOP Algorithm. Proc. 7th international Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 591–598. IFAAMAS. 10 Petcu , A. and Faltings , B. ( 2005 ) A Scalable Method for Multiagent Constraint Optimization. Proc. 19th Int. Joint Conf. Artificial Intelligence, Edinburgh, Scotland, July 30–August 5, pp. 266–271. AAAI Press. 11 Petcu , A. , Faltings , B. and Mailler , R. ( 2007 ) Pc-DPOP: A New Partial Centralization Algorithm for Distributed Optimization. Proc. 20th Int. Joint Conf. Artificial Intelligence, Hyderabad, India, January 6–12, pp. 167–172. AAAI Press. 12 Mailler , R. and Lesser , V. ( 2004 ) Solving Distributed Constraint Optimization Problems using Cooperative Mediation. Proc. 3rd Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, New York, USA, July 9–23, pp. 438–445. ACM. 13 Fitzpatrick , S. and Meertens , L. ( 2003 ) Distributed Coordination through Anarchic Optimization’. In Lesser , V. , Ortiz , C.L. and Tambe , M. (eds.) Distributed Sensor Networks: A Multiagent Perspective . Springer , Boston, USA . 14 Maheswaran , R.T. , Pearce , J.P. and Tambe , M. ( 2004 ) Distributed Algorithms for DCOP: A Graphical-Game-based Approach. Proc. ISCA 17th Int. Conf. Parallel and Distributed Computing Systems (ISCA PDCS), San Francisco, USA, September 15–17, pp. 432–439. ACTA Press. 15 Kiekintveld , C. , Yin , Z. , Kumar , A. and Tambe , M. ( 2010 ) Asynchronous Algorithms for Approximate Distributed Constraint Optimization with Quality Bounds. Proc. 9th Int. Conf. Autonomous Agents and Multiagent Systems, Toronto, Canada, May 10–14, pp. 133–140. IFAAMAS. 16 Pearce , J.P. and Tambe , M. ( 2007 ) Quality Guarantees on k-Optimal Solutions for Distributed Constraint Optimization Problems. Proc. 20th Int. Joint Conf. Artificial Intelligence, Hyderabad, India, January 6–12, pp. 1446–1451. AAAI Press. 17 Bowring , E. , Pearce , J.P. , Portway , C. , Jain , M. and Tambe , M. ( 2008 ) On k-Optimal Distributed Constraint Optimization Algorithms: New Bounds and Algorithms. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 607–614. IFAAMAS. 18 Vinyals , M. , Pujol , M. , Rodriguez-Aguilar , J. and Cerquides , J. ( 2010 ) Divide-and-Coordinate: DCOPS by Agreement. Proc. 9th Int. Conf. Autonomous Agents and Multi-Agent Systems, Toronto, Canada, May 10–14, pp. 149–156. IFAAMAS. 19 Farinelli , A. , Vinyals , M. , Rogers , A. and Jennings , N.R. ( 2013 ) Distributed Constraint Handling and Optimization. In Weiss , G. (ed.) Multiagent Systems. . MIT Press , Cambridge, Massachusetts, USA . 20 Leite , A.R. , Enembreck , F. and Barthès , J.A. ( 2014 ) Distributed constraint optimization problems: review and perspectives . Expert Syst. Appl. , 41 , 5139 – 5157 . Google Scholar CrossRef Search ADS 21 Zivan , R. and Peled , H. ( 2012 ) Max/Min-Sum Distributed Constraint Optimization through Value Propagation on an Alternating Dag. Proc. 11th Int. Conf. Autonomous Agents and Multi-Agent Systems, Valencia, Spain, June 4–8, pp. 265–272. IFAAMAS. 22 Rogers , A. , Farinelli , A. , Stranders , R. and Jennings , N. ( 2011 ) Bounded approximate decentralised coordination via the max-sum algorithm . Artif. Intell. , 175 , 730 – 759 . Google Scholar CrossRef Search ADS 23 Aji , S.M. and McEliece , R. ( 2000 ) The generalized distributive law . IEEE Trans. Inf. Theory , 46 , 325 – 343 . Google Scholar CrossRef Search ADS 24 Farinelli , A. , Rogers , A. , Petcu , A. and Jennings , N.R. ( 2008 ) Decentralised Coordination of Low-Power Embedded Devices using the Max-Sum Algorithm. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, Estoril, Portugal, May 12–16, pp. 639–646. IFAAMAS. 25 Lesser , V. and Corkill , D. ( 2014 ) Challenges for Multi-agent Coordination Theory Based on Empirical Observations. Proc. 13th Int. Conf. Autonomous Agents and Multi-Agent Systems, Paris, France, May 5–9, pp. 1157–1160. IFAAMAS. 26 Kim , Y. and Lesser , V. ( 2013 ) Improved Max-Sum Algorithm for DCOP with n-Ary Constraints. Proc. 12th Int. Conf. Autonomous Agents and Multi-Agent Systems, Minnesota, USA, May 6–10, pp. 191–198. IFAAMAS. 27 Kschischang , F.R. , Frey , B.J. and Loeliger , H. ( 2001 ) Factor graphs and the sum-product algorithm . IEEE Trans. Inf. Theory , 47 , 498 – 519 . Google Scholar CrossRef Search ADS 28 Elidan , G. , McGraw , I. and Koller , D. ( 2006 ) Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing. Proc. 22nd Conf. Uncertainty in AI, Massachussetts, USA, July 13–16, pp. 200–208. AUAI. 29 Peri , O. and Meisels , A. ( 2013 ) Synchronizing for Performance-dcop algorithms. Proc. 5th Int. Conf. Agents and Artificial Intelligence, Barcelona, Spain, February 15–18, pp. 5–14. Springer. 30 Macarthur , K. ( 2011 ) Multi-agent Coordination for Dynamic Decentralised Task Allocation. PhD thesis University of Southampton, Southampton, UK. 31 Fioretto , F. , Pontelli , E. and Yeoh , W. ( 2016 ) Distributed constraint optimization problems and applications: A survey. CoRR, abs/1602.06347. 32 Macarthur , K.S. , Stranders , R. , Ramchurn , S.D. and Jennings , N.R. ( 2011 ) A Distributed Anytime Algorithm for Dynamic Task Allocation in Multi-Agent Systems Fast-Max-Sum. Proc. 25th AAAI Conf. Artificial Intelligence, San Francisco, USA, May 12–16, pp. 701–706. AAAI Press. 33 Stranders , R. , Farinelli , A. , Rogers , A. and Jennings , N.R. ( 2009 ) Decentralised Coordination of Mobile Sensors using the Max-Sum Algorithm. Proc. 21st Int. Joint Conf. Artificial Intelligence, California, USA, July 13–17, pp. 299–304. AAAI Press. 34 Vinyals , M. , Rodriguez-Aguilar , J.A. and Cerquides , J. ( 2011 ) A survey on sensor networks from a multiagent perspective . Comput. J. , 54 , 455 – 470 . Google Scholar CrossRef Search ADS 35 Kutner , M.H. , Nachtsheim , C. and Neter , J. ( 2004 ) Applied Linear Regression Models ( 4th edn ). McGraw-Hill , Irwin . 36 Han , J. , Pei , J. and Kamber , M. ( 2011 ) Data Mining: Concepts and Techniques ( 3rd edn ). Elsevier , Amsterdam, Netherlands . 37 Léauté , T. , Ottens , B. and Szymanek , R. ( 2009 ) FRODO 2.0: An Open-source Framework for Distributed Constraint Optimization. Proc. IJCAI’09 Distributed Constraint Reasoning Workshop (DCR’09), Pasadena, California, USA, July 13, pp. 160–164. https://frodo-ai.tech. 38 Sultanik , E.A. , Lass , R.N. and Regli , W.C. ( 2008 ) Dcopolis: A Framework for Simulating and Deploying Distributed Constraint Reasoning Algorithms. Proc. 7th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems: Demo Papers, Estoril, Portugal, May 12–16, pp. 1667–1668. IFAAMAS. 39 Knoll , C. , Rath , M. , Tschiatschek , S. and Pernkopf , F. ( 2015 ) Message Scheduling Methods for Belief Propagation. Proc. Joint Eur. Conf. Machine Learning and Knowledge Discovery in Databases, Porto, Portugal, September 7–11, pp. 295–310. Springer. 40 Yedidsion , H. , Zivan , R. and Farinelli , A. ( 2014 ) Explorative Max-Sum for Teams of Mobile Sensing Agents. Proc. 13th Int. Conf. Autonomous Agents and Multi-Agent Systems, Paris, France, May 5–9, pp. 549–556. IFAAMAS. 41 Zivan , R. , Glinton , R. and Sycara , K. ( 2009 ) Distributed Constraint Optimization for Large Teams of Mobile Sensing Agents. Proc. IEEE/WIC/ACM Int. Joint Conf. Web Intelligence and Intelligent Agent Technology, Milan, Italy, September 15–18, pp. 347–354. IEEE. 42 Zivan , R. , Parash , T. and Naveh , Y. ( 2015 ) Applying Max-Sum to Asymmetric Distributed Constraint Optimization. Proc. 24th Int. Joint Conf. Artificial Intelligence, Buenos Aires, Argentina, July 25–August 01, pp. 432–438. AAAI Press. Footnotes 1 Here, we consider both the computation and communication cost of an algorithm in terms of time. 2 BMS has been proposed as a generic approach that can be applied to all DCOP settings, while BFMS can only be applied to specific formulation of task allocation domain (see [30] for more detail of the formulation). 3 It can either be SMP or its asynchronous alternative, without loss of generality, we use SMP from now on (see Section 1). Author notes Handling editor: Franco Zambonelli © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

Journal

The Computer JournalOxford University Press

Published: Mar 17, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off