Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A Survey on Fault Tolerant Multi Agent System

A Survey on Fault Tolerant Multi Agent System I.J. Information Technology and Computer Science, 2016, 9, 39-48 Publis hed Online September 2016 in MECS (http://www.mecs -pres s .org/) DOI: 10.5815/ijitcs .2016. 09.06 Yas ir Arfat Department of Computer Science King Abdul Aziz Univers ity Jeddah, Saudi Arabia. E-mail: yas [email protected] m Fathy Elbourae y Eas s a Department of Computer Science King Abdul Aziz Univers ity Jeddah, Saudi Arabia. E-mail: [email protected] Abstract—A mult i-agent s ys tem (MAS) is formed by a be contained in, in order to mitigate failure. number of agents connected together to achieve the des ired goals s pecified by the des ign. Us ually in a mu lti  Agents need to collaborate with each other to agent s ys tem, agents work on behalf of a us er to avoid failure accomplis h given goals . In MAS co -ordination,  Information s ent over MAS s hould be trans parent co-operation, negotiation and communication are during trans mis s ion important as pects to achieve fault tolerance in MAS. The  Availability of other agents in MAS s hould be mu lti-agent sys tem is like ly to fail in a dis tributed ens ured when an agent fails environment and as an outcome o f s uch, the res ources for  The agent’s system s hould have the ability to take MAS may not be available due to the failure of an agent, decis ions bas ed on knowledge mach ine cras hes , process failure, s oftware fa ilure,  Agents need to communicate in a s ecure manner communicat ion fa ilure and/or hardware fa ilure. Therefore, and data s hould be protected in cas e of failure many res earchers have propos ed fault tolerance  Agents s hould have autonomy in cas e of fa ilure. approaches to overcome the failure in MAS. So we have They s hould be able to provide s ervices without s urveyed these approaches in this pape r, whereby our affecting other agents contribution is threefold. Firs tly, we have provided  An agent s hould have s calability and comple xity s o taxono my of faults and techniques in MAS. Secondly, we that it can deal with any s ize of agent without have provided a qualitative comparis on of exis ting fault affecting performance tolerance approaches . Thirdly, we have provided an  An agent s hould be robus t enough to confront any evaluation of e xis ting fau lt tole rance techniques . Res ults failure i.e. proces s failure, cras hing failure etc. s how that mos t of the exis ting s chemes are not very whereby it can provide s ervices without any efficient, due to various reasons like high computation interruption. costs , cos tly replication and large co mmunicat ion  Agents s hould have the ability to adapt to any overheads . condition, in any environment, in cas e of failure. Index Terms—Mu lti Agent Sys tem, Fau lt Tolerance, In MAS, there are s everal factors that decreas e Agents , Adaptive Replication, Redundancy . performance and reliability. One of thes e is a failure of the s ys tem. If there is any fault in the sys tem, it will s top working and caus e a delay in achiev ing the required goals . I. INT RODUCT ION In order to increas e the re liab ility of MAS, t he s ys tem A Multi-agent s ys tem (MAS) is compos ed of multip le s hould be fault tolerant. If there is any fault in the s ys tem interacting intelligent agents , with in a g iven environment. it s hould have the ability to mas k the fa ilure in order to Thes e agents co-operate to s olve difficult ies that are continue providing the necess ary services without any beyond the capability or knowledge of each s ingle delay ref. Abbas et al.[5] . If a s ys tem is fau lt tolerant then problem s olver. There are s everal key characteris tics of it will als o increas e the performance of the s ys tem. agents , such as adaptation, s calability, re -us ability, local In this res earch paper, we c las s ify the fault tolerant view, autonomy, res pons iveness and distribution. In order mu lti-agent s ys tem (FTMAS) into different categories , on to achieve the necess ary goals , agents are required to be the bas is of recovery techniques and by pres enting the able to communicate with many other agents in the taxono my of both faults and techniques . Als o, we have environment ref. Byrs ki et al.[1] There a re various provided a qualitative co mparis on of the recent fault applications of MAS like aircraft maintenance, recovery of MAS. We dis covered that res earchers are environment monitoring, military demin ing, s urveillance, applying replicat ion and non replication bas ed fault internet agent, health care, s pacecraft control and recovery approaches for FTMAS. We als o exa mined indus trial monitoring [2][3][4]. e xis ting techniques on the bas is of their attributes s uch as , In this paper, we focus ed on the fault tolerance of MAS. characteris tics , failures , types of agents , environment and There are various fault tolerance needs that MAS should replication protocol. Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 40 A Survey on Fault Tolerant M ulti Agent Sy stem The res t of the paper is organized as follows . Section II III. LIT ERAT URE REVIEW contains a brief background of FTMAS, Section III This s ection pres ents an overview of the briefly des cribes a s umma ry of e xis ting literature. Sect ion s tate-of-the-art of fault tolerance techniques in the multi IV pres ents a taxonomy of FTMAS. In s ection V, we agent s ys tem (FTMAS). Th is overview co mpris es of have provided a comparis on of FTMAS with other dis cus s ion about the ass umptions , objectives , techniques . Section VI s ummarizes the advantages and methodologies and key approaches pres ent in exis ting dis advantages of exis ting techniques in FTMAS. In works . Bas ed on this s ection, we will pres ent a ta xonomy s ection VII, we conducted an evaluation. In s ection VIII, and comparis on in the following s ectio ns . we have dis cuss ed future challenges and iss ues . Finally, s ection IX concludes the paper. A. Towards FTMAS Architecture Ku mar et al. [11] des cribe that there we re many poss ibilit ies that failure could happen at any time in MAS II. BACKGROUND of any dis tributed s ys tem. Many agents were not Multi agent s ys tem compris es of various agents , available due to process failure, e xceptions and entities etc. A s ingle agent is capable of carrying out breakdown of co mmunication. There we re many faults independent actions to achieve the delegated goals . This that exis ted ranging fro m databas e recovery, TS agent s ystem may work in different environ ments monitoring, res ource manager and fau lt tole rance according to the tas ks s et to it, res pons ibilit ies as s igned to dis tributive s ys tems up to application s erver. There were it by the s ys tem or progra m ins ide the agent s ystem. many is s ues in thes e techniques , s uch as us ing replicat ion Figure1 s hows the bas ic operation and work performed s chemes as a critica l s ys tem for monitoring. Ho wever, by the agent s ys tem when it increas es the reliability of the s ystem it duplicates the data and s ervices . Moreover, many s ys tems s aved the application s tate but it als o created many proble ms during recovery. To overco me this tradit ional fault tole rance technique they propos ed Adaptive Agent Architecture (AAA) for the mu lti-agent system (MAS). Whereby, AAA overcomes a proble m like a broker fa ilure without incurring undue overheads . There may be more than one of many s uch brokers in the large mult i-agent s ys tem. In Fig.1. A mult i agent syst em the cas e of sudden unavailability of a broker in AAA, they us ed the team bas ed approach for automatic A s pecific agent will ca rry out s pecific goals in the recovery of MAS. Furthermo re for the recovery, they mu lti agent s ys tem, for e xa mple : Environment: Patient & ass umed three different recovery s chemes , na me ly log ical hospital, Goal: Hea lthy patients , Actions: Tes ts and characterizat ion, recovery s cheme and recovery s cenario. treatments , Percepts: Patient s ympto ms , Agent Type: In thes e ass umptions , they des cribed different s teps , Medical diagnos is . An agent has a range of characteris tics theorems and characterizations of performance. The ir i.e. reactiveness , reliability, s calability, autonomy, res ults show that autonomous agents can ma ke a robustness , intelligence, pers is tency, goal-orientation, multi-agent s ys tem more robus t. adaptability and s ociability. Thes e are the bas ic B. Towards Adaptive FTDMAS characteris tics of an agent that it contains . In multi agent systems (MAS), s everal agents are Marin et a l. [12] have a ls o propos ed an adapt ive working together to achieve tas k-oriented goals on behalf architecture for the mu lti-agent s ys tem (MAS). It deals of the us er or human ref. Macie l et al.[6] Succes s ful with e xis ting proble ms in MAS us ing new methodologies . interaction is required a mong agents in MAS to negotiate, MAS as a dis tributed s ys tem may by its very nature coordinate and cooperate with each agent in the accrue failure at any time in the s ystem. Moreover, due to environment ref. Ge rra rd et al.[7] The bes t exa mp les of it be ing a d is tributed s ys tem, computations of dynamic MAS are Internet agents and as us ed in Spacecraft control app lic at ions we re often chang ed, du ring e xe cut ion. ref. Li et a l.[8] Nowadays , res earchers and developers Neve rthe les s , they tried to ma ke it mo re fle xib le to alike are us ing the agent in the dis tributed environment, overcome the flaws of the conventional s ys tem. On the s uch as thos e used as environment agents who need propos ed architecture, we can either replicate or replicate co-ordination, co-operation, and negotiation. Thes e are the software ele ment on the s pot. The advantage of this the bas ic is s ues that MAS has in each environment ref. approach is that we can change replication tactics in a Davoodi et al.[9]. As the fa ilure rate increas es when there matter of a few s econds . The ma in objective of s elected is les s co-ordination, co-operation and communicat ion architecture is to ma ke fault to lerance mo re e fficient fo r among the agents , this leads to the failure o f the s ystem. MAS, us ing s elective replication techniques . An outcome Hence, thes e types of fa ilures a re s ubject to the hos t, of th is approach is to develop arch itecture , which is machine and exception s et ref. Wang al.[10]. s uitable for dyna mic fau lt tole rance for app licat ions . There are s everal fau lt tole rance MAS techniques that They us ed the s elective replic ation s che me as many have been propos ed to mas k the faults in MAS. Each problems e xis ted for approaches to dyn amic applications . technique differs in its ability to mas k failure in MAS. Moreover, they a ls o introduced a fra me work na me ly, Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 41 (dynamic adaptive rep licat ion e xtens ion) DA RX, which Figure 2 s hows the working of decentralized architecture, us es both active and pass ive replication, s pecially name ly Virtual Agent Clus ter (VAC). When a s ingle des igned for the distributed application. It has many agent platform is deployed it includes all machines . A advantages i.e. to dynamica lly add or re move replicas , s imilarity e xis ts between virtual agent clus ter and clus ter atomic and ordered mu lticas t for each replication group computing, where the front processor dis tributes the load etc. To manage the failure of the s ys tem, there is a among the machines . replicat ion manager as s ociated with each group which Agent Communicat ion Language (ACL) a ls o acts as a performs the following functions : 1)Maintenance of front process or. It is us ed as an interface to communicate informat ion within the group 2)Perform s us pens ion and with another agent in the s ystem. The co mmunicat ion res umption activity 3)Diffus ion of a mes s age and between machines is bi-directional through IP addres s es , 4)Switching the rep licat ion s trategy. The benefits of whereby in each machine there is an Agent Management performing thes e functions are that: the replication Sys tem (AMS). It is organized in s uch a manner that manager can recover the failure quic kly; when one group failure of one machine does not affect the other. There are fails , the other groups have all the information needed in s everal characteris tics of this a rchitecture, which inc ludes order to active a new replica. A s imulat ion dis play fault tolerance and recovery, autonomy, applicat ion ensures that minimu m energy is utilized between nodes to layering a rchitecture (inter VA C and intra VAC) and load carry out the tas k, as a s ingle copy of data will be s ent balancing. Thes e characteris tics ma ke this decentralized and it als o improves the probability of delivery. architecture more fle xib le in s cale and fau lt tolerant. Moreover, fault tole rance is the greates t s ubstantial C. Towards Automatic FTMAS advantage that can be achieved through this decentralized Almeida et al. [13] pres ented an automated fault architecture. tolerance (FT) MAS s cenario. They des cribed that there E. Plane-Based Replication for FTMAS are many poss ibilities whereby an e xception or failure can occur at any time in the s ys tem. Thes e failures o ccur Almeida et al. [15] pres ented a plane-bas ed replication when recovery and fau lt tolerance approaches are defined of the fault tolerant multi-agent system. In their propos ed at the des ign level. Indeed, it is very difficult to decide at s cheme, they us ed this method for s tipulating the the des ign level when and where to apply the FT dependability for MAS through replicat ion. This method approach (i.e. replication). But conventional approaches is diffe rent fro m others cited above, here they focus on are out of order when it co mes to dynamic s ys tems (i.e. predictive and adaptive replication whereby the crit ical MAS). Thes e applications could be ambient intelligence agents are replicated to overcome fa ilures . As s ome of the s ys tems , related to e-co mme rce, c ris is manage ment application us es s tatic rep licat ion, in contras t here they s ys tems or the air tra ffic control s ys tem. According to the us e dynamic replication. The latter has advantages over s ituation and nature of interdependencies in thes e s tatic replication i.e. re-a llocation of tas ks , changing the applications , an agent can change their role during the role of an agent, fle xible organization etc. Moreover, it is computation s tage. Therefore, to overcome a ll d ifficu lties very important to replicate an agent through dynamic and and to make the FT management auto matic and dynamic, automatic means . Here, they are mo re focus ed on they cons idered a s elf-adaptation FT approach. building re liable MAS. Hence, a plan bas ed fault In MAS, mult iple errors may occur as they are only tolerance method promis ing prevention becaus e it cons idered as cras h type failures . Thus , to mas k thes e predicts upcoming behavioral patterns of an agent. To types of failures , replication is cons idered an ideal ma ke MAS mo re reliable, original predictive approach approach. There are various types of replicat ion calculates criticality of the agent dynamica lly. Then this approaches , fro m s tatic to non -s tatic and e xplicit critica lity of the agent is us ed for replication, in a manner replicat ion. Moreover, for replication they pres ented to increas e the dependent ability on the bas is of res ources dynamic and automatic control of replication. Hence, that are available. They a ls o validated their approach on they chos e a DARX fra mework, which has dynamic the DARX fra me work and DIMA. In this s trategy, an dis tributed replication features . Us ing this s ystem, they agent is accomp lis hed as DIMA agent and us age of have es timated the critica l es s ence of the s ys tem by DARX in co mmand is us ed to obtain replicat ion concluding with different types of in formation i.e. capabilities . mes s ages , plans and roles etc. F. Adaptive and Automated FTMAS D. Decentralized Architecture for FTMAS Singh et al. [16] have propos ed this frame work fo r a Khan et al. [14] pres ented fault tolerant decentralized critica l agent in the mult i-agent s ys tem (MAS), bas ed on architecture for the mult i-agent s ystem. Mos t applications the cardinality of an agent. So metimes rep licat ion can have a lack of fault tole rance. There is an e xpectatio n that become very cos tly due to the comple xity of the s ys tem; us age of MAS in different dis tributed applications will moreover, dynamic replication is als o a need of a ll agents increas e. Ho wever, there are many fau lts e xis ting within in fault to lerance MAS. Hence, to overcome thes e is s ues the agent platform, caus ing a mult itude of problems . To they propos ed this particular fra mewo rk. They mixed two overcome a ll thes e problems they introduced t ech n iqu es na me ly , act ive an d pas s iv e rep lic at ion . decentralized architecture, as an alternative to the Thereby , crit ica l agents will act ive ly replic ate, more centralized arch itecture of the agent platform (AP). focus ed relatively to other agents . The benefit of this Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 42 A Survey on Fault Tolerant M ulti Agent Sy stem approach is to reduce the comple xity of the s ystem, cos t, are obs ervation and feedback control. The rep licat ion optima l utilizat ion and more impo rtantly, optimal fault manager utilizes thes e features . The Obs ervation module tolerance. The propos ed fra mewo rk is hybrid, having the collects informat ion about the s ystem and pass es this automated and adaptive characteris tics of fault tolerance. informat ion to feedback control. All this info rmation is This fra mework has three different co mponents : 1) process ed by feedback control, which then decides which Replica Store 2) Fault Management Agent (FMA) and 3) agent is mos t critica l, having ca lculated their relative Event Monitoring Agent (EMA). critica l value. Then it applies the adaptive replication In replica s tore central fault management unit (CFMU) policy bas ed on the criticality of the s ystem. This divides it into two phas es , active and pass ive replica s tore. architecture covers the crash type of failu re in mult i-agent A pass ive replica is us ed to update the agent periodically. s ys tems . In this res earch paper, they also as sume that the For critica l agents , active replica are us ed to having a s ampling period will ma ximize accuracy, reduce the cos t working rep lica . All faults manage ment control is done of replication and increas e the respons e time of the by the FMA. It als o retains information about the s ys tem. replicat ion, whether it is active or pas s ive replication. The I. A Decision-mak ing based approach for Fault las t component is EMA, wh ich is res pons ible for keeping Handing in Multi-Agent System track of the informat ion re lated to cras hes , putting in the s ubstitute replica for that agent. Res ults s how that fifty Mirian, Marya m S. et al. [19] introduced a new percent actively replicated agent can remove the decis ion-bas ed technique for fault handling in the comple xity of the s ystem. Moreover, fro m the propos ed mu lti-agent s ys tem. They des cribed the mult i-agent s ys tem, s calability of the fault tolerance mult i-agent s ys tem more like a d is tributed s ystem where fau lt can s ys tems can be improved. occur at any time in the s ys tem. In the paper, they focus ed on the faulty agent and their recovery in the G. Hybrid Based Approach for FTMAS mu lti-agent s ys tem. In the pres ented technique, if a fault Koppens teiner et al. [17] have propos ed a hybrid fault agent requests its other agents or its team agents come to tolerance mult i-agent s ys tem us ing the heartbeat know that this agent is fau lty and needs help, then there mechanis m. They us ed this mechanis m to detect failure in are s everal help reques ts that exis t. However, which help MAS. They found three different types of failu res here, reques t is appropriate and which will be effective are all name ly: 1) Sys tem dis turbance 2) Phys ical Co mponent decided at the decis ion -ma king phas e. At this s tage, they Failure and 3) Soft ware Entity Fa ilure . To recover fro m als o us e the bes t fit, firs t co me firs t s erve and s hortest job phys ical co mponent failure i.e. a fa ilure in tangible firs t algorith m to making the decis ion fo r the help reques t. hardware or failure in block bas e application controlling For this methodology there is no central agent, all agents function, they introduced the heartbeat mechanis m. Us ing are decentralized. Each agent has knowledge about the the heartbeat between the LLC (Lo w-Leve l Control layer) environment and e xis ting agents in the environment. and HLC (High-Level Control layer) they minimized They all als o have the ability to perform the tas k of other mes s ages to ma intain the s ys tem’s s tability. This agents . If an agent fails in the s ys tem another agent can approach als o imple ments the heartbeat method e xc lus ive help bas ed on the decis ion -making phas e. fro m dis tribution of mes s ages on the s ys tem. If the re is a J. Distributed Adaptive Fault -Tolerant Consensus fault ins ide the s ys tem, they can only co mmun icate Control of Multi-Agent System with Actuator Faults mes s ages if neces s ary. In a s ituation of co mplete failure in the s ys tem, both LLC and HLC will be us ed to detect Khalili et al. [20] pres ented a dis tributed FT cons ens us which agent has failed. Ut ilizing the heartbeat method, control of MAS with actuator faults . This FTMAS is they will try to fix it. bas ed on three different as sumptions . In this dis tributed s ys tem an FT control component was developed to H. Choice of Sampling Rates in FTMAS perform a two-s tep process between the agents . The firs t Bora et al. [18] propos ed fault tolerance in a would diagnos e the fault in the MAS while the s econd mu lti-agent s ystem bas ed on the s ampling period. To would provide an opportunity to recover in an adaptive increas e the fault tolerance in d is tributed and dynamic manner. Thes e as sumptions are constructed us ing s ys tems , adaptive replication techniques were very us eful. mathe matica l equations and in particular, vectors . Us ing But there is one dis advantage of this approach; it the ass umptions , it can chec k the s ys tem’s s tability with increas es the cos t due to adaptive replication. To the clos ed-loop mechanis m. The ma in objective of this overcome this drawback, a s amp ling period was s ys tem is to develop an algorithm that diagnos es and introduced to minimize the cos t. This technique whereby recovers faults . A unique feature of this algorithm is that it monitors critica l agents , properly choos es the it takes an information-neighboring a lgorith m and applies appropriate replication for the agent bas ed on its its actions . critica lity. They applied this technique on abs tract architecture for adaptive replication. This architecture cons is ts of replication manager; which is res pons ible for IV. TAXONOMY OF FAULT S IN MAS providing active and pas s ive replication a mong d iffe rent In this s ection, we have pres ented a ta xonomy of faults replica . It als o mon itors and handles faults ins ide the and their related techniques . Firs t of all, we divided the replica . The Ma in modules contained in this a rchitecture faults into two different categories , namely fail s ilent and Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 43 fail uncontrolled. Thes e are the faults we found in a agent sys tem. We have c las s ified th es e approaches into diffe rent paper that we s urveyed. Fail s ilent faults are three different categories . These are replication bas ed, thos e, which belong to the crash type of failure. On the non-replication bas ed and hybrid approaches . Then we other hand fail uncontrolled are thos e failures , where any further s ubdivided thes e techniques , for e xa mple the type of fault or fa ilure can occur. The faults are then replicat ion bas ed approach has active replication, pas s ive further s ubdivided into different types as given in Figure 2. replicat ion and adaptive replication. Moreover we als o further s ubdivided the non -replicat ion bas ed approaches into two different types , thes e being architecture oriented and mathe matica l/algorith mic, wh ich are g iven in Figure 3. Thes e are the e xis ting techniques that are us ed for fault tolerance, if there is any fault in the system us ing thes e techniques we can avoid fa ilure of the whole s ystem. Thes e approaches have their own advantages and dis advantages , which vary according to the environ ment where thes e methods are being applied. V. QUALIT AT IVE COMP ARISON OF FTMAS In this s ection, we have provided a qualitative comparis on of the e xis ting fault to lerance approaches of MAS as given in table 1. For this purpos e we have us ed the follo wing para meters : 1) Agent type 2) Fault Fig.2. T axonomy of fault s in MAS tolerance technique 3) Object ives 4) Language 5) Type of failure 6) Replication protocol 7) Characteris tics and 8) Environment. According to this table, Ku ma r et a l. [11] have adopted the object replication bas ed fault tolerance approach for MAS, which has characteris tics like autonomy and local view, whereby the ma in objectives of this approach achieve a fas ter fault recovery as they us e the broker process failure. Moreover, Marin et al. [12] and Almeida et al. [15] provide the dynamic replication approach for the fault tolerance mult i-agent s ys tem, having the objective to achieve that the agent should execute the goal. Furthermore , this technique covers machine and host failures , network fa ilu re and dis tributed agent failures . They are als o us ing different types of agents , name ly s elective agent and critica l agent. Moreover, both Fig.3. T axonomy of t echniques against t he fault s in MAS are us ing the Knowledge Query Manipulation Language (KQML) for agent communication among each o ther. We als o found out the ta xonomy of the techniques that res earchers are applying for fault tole rance in the mu lti T able 1. Qualit at ive Comparison Re search Type of Re pli cation Agent Te ch n ique C h aracteristics O bje ctives En vi ron ment Pape r Fai l u re Protocols Type i) Machine i) T o achieve Crashes. Adapt ive Object group warm backups. Kumar et Aut onomy, ii) End of Complex Virt ual Agent replicat ion ii) Object group al. [11] Local views broker process. Agent s Environment Archit ecture and virt ual iii) Net work Synchrony Break Down i) Host and i) Efficient FT Net work for MAS t hrough Bypass Aut onomy, Run failures. Select ive Agent Marin et dynamic Dynamic Select ive Cont inuous t ime replication ii) Failure of an Replicat ion al. [12] Replicat ion. Replicat ion Agent Environment change agent in ii) Appropriate dist ribut ed MA archit ecture applicat ions for dynamic FT Self i) Crash t ype of i) Make Aut onomous, Almeida adapt at ion of failure cause Adapt ive aut onomous t he Crit ical Discret e Dynamic, et al. [13] fault by t he int ernal Replicat ion Management of Agent Environment aut omatic t olerance (hardware fault t olerance, Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 44 A Survey on Fault Tolerant M ulti Agent Sy stem issue and OS ii) T o make this crashes) or fault t olerance ext ernal factor management (malicious dynamic and at t acks, aut omatic environment t ragedy and power failure) i) Cent ralized AMS lack of Fast er Recovery fault t olerance. and Fault i) T o embrace ii) Cent ralized T olerant, peer-to-peer syst em become Aut onomous, comput ing bot t leneck Archit ecture for paradigm. Khan et Virt ual Agent under heavy Act ive Act ive Virt ual applicat ion ii) Eliminat e the al. [14] Clust er load. replicat ion Agent Environment layering limit at ions in iii) Ut ilizat ion (int raVAC and present in of information int erVAC), exist ing Agent service and Balancing t he P lat form (AP) provision Load QOS (t imelines and reliabilit y) i) P rovide some act ion based on aft er calculation of value of a i) An agent syst em t hat will failure or a be execut ed by machine. Act ive and t he agent near P redict ive, ii) Host failure P assive Almeida Replicat ion of fut ure in case of Crit ical Cont inuous Aut omatic and and process Replicat ion, et al. [15] crit ical agents failure of an Agent Environment Adapt ive failure base on (Dynamicall agent . adapt ive failure y apply) ii) Each agent indicat ors should execut e hierarchy. each plan (act ion) in order t o achieve the goal. i) Using t he bot h act ive and passive Aut omatic i) Crash or replicat ion make and adapt ive failure of an t he MAS more Singh et fault Adapt ive Crit ical T ransient Local views agent . scalable, reliable al. [16] recovery, Replicat ion Agent Environment ii) Crit ical and fault Cent ral fault agent failure t olerant. Management ii) T o reduce cost and complexity of t he syst em. Heart beat i) P hysical mechanism Components (failure failure i) T o increase the det ect ion), (Breakdown of st abilit y of t he Supervisor whole syst em Koppenste Agent resource, T o shorter t he Aut onomous, Complex Discret e ir et al. Approach t emporary react ion t ime. Local views Agent Environment [17] (syst em failure) ii) T o enhance failure ii) Soft ware t he fault absorpt ion, ent it y failure t olerance of a fault iii) P art ially complex system recovery) agent failures T o reduce t he cost of Dist ribut ed i) Crash failure Khalili et Aut onomous, Adapt ive replicat ion in Crit ical Discret e adapt ive ii) failure of an al. [20] Adapt ive. Replicat ion MAS and Agent Environment fault -t olerant agent s comput ation overhead. Additionally, A lme ida et a l. [13] and Khan et al. [14] architecture was thought to develop bottlenecks under a als o pres ented a replication and non -replication bas ed heavy load. They have us ed the Agent Co mmunicat ion approach respectively. To overcome the dra wback of the Language (ACL) for co mmun ication between agents by Khan et al. [14] centralized MAS s ys tem, they pres ented us ing active agents . The ma in objectives of th is technique decentralized architecture whereby centralized are i) To embrace a peer-to-peer computing paradigm and Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 45 ii) To eliminate the limitations pres ent in e xis ting Agent technique that they provided covers the following types Platform (AP). of failures : On the other hand Almeida et a l. [13] have provided KQM L fo r the agent. They als o us ed a dis crete i) Cras h or failure of an agent ii) Critical agent failure. environment where the agents can perform only limited actions . The goal of th is approach is to make autonomous Koppens teiner et al. [17] and Khalili et al. [18] have manage ment of fault tole rance and als o make this fault pres ented the heartbeat mechanis m and choice of the tolerance management dynamic and automatic. s ampling rate for fault recovery res pectively. On the other Singh, Aart i et al. [16] have provided auto matic and hand thes e approaches cover phys ical components failure, adaptive fault recovery for the multi -agent s ystem. partial agent failure, cras h failure and failure of agents . Characteris tics of thes e approaches are a local view and This comparis on g ives us a clear idea of different autonomy. The ma in objective of this approach is to approaches for fault tolerant mu lti -agent s ys tems and how achieve i) us ing both active and pas s ive replication to to deal with thes e failures us ing an appropriate approach, ma ke MAS more s calable, re liable and fault tolerant. ii) which is efficient, with les s overheads , easy to use and To reduce cost and comple xit y of the s ystem. They are les s cos tly. als o us ing the agent in a t rans ient environment. The T able 2. P ros and Cons of Exist ing T echniques Re s earch Paper Te ch n ique Pros C on s i) More focused on broker failure i) Effect s of recovery on response t ime. t olerance. Adapt ive Agent Kumar et al. [11] ii) Effect s of t ransition on response t ime. ii) Less focused on individual agent s. Archit ecture iii) Less overheads of using t eamwork. iii) Require ext ra comput ing for t he management of brokerage layer. i) Fast way t o handle fault s and recovery. Bypass dynamic i) Replicat ion is very costly Marin et al. [12] ii) Improve reliability, fault-tolerance. Replicat ion ii) More Computations are required. iii) Improve accessibility. Self adapt at ion of fault i) It provides dynamic replication; we can i) When we use bot h act ive replication and t olerance Almeida et al. [13] use bot h act ive replication and passive passive replication in MAS, such an (Dynamic adaptation of replicat ion. It provides bet ter recovery. environment proves very cost ly. replicat ion strategies) i) It falls short of addressing t he i) Decart elized MAS is less fault y as het erogeneity issue. compared t o centralized. Khan et al. [14] Virt ual Agent Clust er ii) Cost implication of recovery in mult i ii) More reliable. organizat ional context. iii) Fast er as compared t o centralized. iii) More overheads. i) It provides replication for the criticality i) It is very hard t o find out which agent is Replicat ion of critical Almeida et al. [15] of agent , which is more critical than more crit ical in multi agent system for agent s applying t he replication for t hem. fault t olerance. i) T his approach has high overheads. Aut omatic and adaptive i) Using t his approach, it provides ii) Reliabilit y of t his technique is less as Singh et al. [16] fault recovery, central aut omatic recovery for faults when t hey compared t o t he other list ed t echniques in fault Management occur and adapt ive fault recovery. t his t able. i) Comput ation cost is high as it involves a Heart beat mechanism i) It provides faster fault recovery. lot of work communicating wit h each (failure det ect ion) ii) Sending messages at a specified period agent . Koppensteiner et and Supervisor agent of t ime t hrough which it can easily find out ii) Reliabilit y is very low. al. [17] approach (system which agent has failed, t hus providing iii) Very slow t echnique t hat causes more failure absorpt ion, fault quick recovery. overheads due t o sending regular messages recovery) aft er a specified period of time. Using t he sampling rate replication cost will decrease. Response t ime of swit ching from active to Adapt ive Replication passive and visa versa will decrease. i) Adapt ive Replication is very costly. Bora et al. [18] based on sampling Fault and reliabilit y can be achieved easily ii) Overhead will be very high. rat es. using t he sampling rat e. Adapt ive replication increases t he response t ime of t he syst em. [16], [17],[18],[19],[20]. In this s ection we find out the VI. PROS AND CONS OF EXIST ING TECHNIQUES advantages and d is advantages of th e fau lt to le rance approaches that we s urveyed in the literature revie w. In Here in this s ection we have des cribed the different given table 2 we can s ee that there are s ome techniques a pp ro ach es a nd the ir p ros & co ns us ing d iffe ren t that are providing better fau lt tolerance recovery for any parameters as given above [11], [12], [13], [14], [15], Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 46 A Survey on Fault Tolerant M ulti Agent Sy stem failure of the mult i agent s ys tem. This s hows us how ma in objective was that an agent should execute each much the techniqu es are e ffective for fau lt recovery. plan (of action) in order to achieve the overall goal. When the res earcher propos ed a technique for fault Singh et al. [16] chos e an automatic and adaptive fault tolerance they ignored other as pects of fault tolerance, as recovery and central fault management technique. They it can caus e high overheads and perform s o me e xpens ive tried to min imize s ystem cras hes , agent failures and computations . They als o decreas ed the reliability of the critica l agent fa ilure . Their ma in objectives are to reduce s ys tem and reduced the performance of MAS. Moreover, cost and comple xity of the s ys tem. Meanwhile, table 2 s hows overheads of fault recovery, reliability, Koppens teiner et al. [17] have imp le mented the heartbeat improve ment in performance and co mputational cos t of mechanis m for failure detection, s upervis or agent thes e approaches . approach for s ys tem fa ilu re abs orption and fault recovery. They applied this technique to overco me phys ical components failure and breakdown of the whole res ource including te mporary fa ilure . The ma in goal is to increas e VII. DISCUSSION OF EVALUAT IONS AND COMP ARISON the stability of the s ys tem and s horten the reaction time - In this s ection, we evaluated different s chemes and overall - to enhance the fault tolerance of a comple x access ed them for fau lt tolerance recovery. Beg inning s ys tem. In Khalili, Mohs en et al. [20] they have propos ed with Ku mar et a l. [11] who propos ed adaptive agent a s amp ling rate in FTMAS which tries to overcome cras h architecture to mas k the failure in the mult i-agent s ystem. failures or fa ilures of an agent in the MAS environment, This has s everal characteris tics namely , autonomy, local whereby they applied adaptive rep licat io n for redundancy view and mobility. Us ing this approach they have to mas k the fa ilu re in FTMAS. This technique has covered different types of failure na me ly, machine performed re latively better than other techniques but it cras hes , end of broke r proces s and network bread down. als o has large overheads of FT for recovery. Moreover, they applied the object group rep licat ion for Moreover, we have s een that some approaches perform redundancy in MAS. The main objective o f this approach better in fau lt recovery as co mpared to other techniques . is to achieve wa rm bac kup, object group and virtual Some techniques have higher overheads and perform s ynchrony. Marin, Oliv ier et a l. [12] have propos ed the costly computations for fault tolerance. There s hould be dynamic replication technique. Characteris tics of this an appropriate technique that enhances performance on approach are Autonomy and it has run time rep licat ion fault recovery, rather than making it s uffer. changes . Us ing this technique they mas k the failure of host and network, thus effectively the failure of any agent in the dis tributed environment. They applied dynamic VIII. FUT URE CHALLENGES AND ISSUES replicat ion protocol for redundancy. The main objective of this approach provides efficient FT for MAS through There has been advancement in mu lti -agent s ystem s elective agent replication. Almeida et a l. [13] have technology and its us age in our daily life is increas ing. applied s elf-adaptation of fault tolerance approach having Even though a lot of work has been done for fault the characteris tics of dynamic and automat ic recovery of tolerance in a mult i-agent s ys tem (FTMAS) but the iss ue fault. They tried to mas k fa ilures s uch as : cras hes caus ed regarding failure recovery of MAS has still not been by internal hardware is s ues and operating s ystem cras hes overcome yet. As MAS gets further dis tributed, fa ilure can or e xterna l malic ious attacks , environmental tragedy and occur at any time. power failure. They have applied adaptive replication. There are various challenges in FTMAS The main object ive of this technique is to make implementation. It is s till a complex tas k. autonomous the management of fault tolerance and als o to ma ke this fau lt tolerance manage ment dynamic and  Fro m the literature s urvey, we found that mos t of automatic. Khan, Abbas et al. [14] have virtual agent the exis ting fault tolerance approaches are not clus ter (VA C) with the following characteris tics : Fas ter providing bas ic fault recovery features in MAS recovery and fault tolerant, autonomous , architecture for like reliability, s calability, adaptability and application layering into intra VA C and interVA C and robus tnes s. balancing the load. They tried to mas k fa ilures s uch as  A challenging is s ue in des igning fault tolerance cras hes caus ed by internal hardware is s ues and OS architecture for Multi Agent s ys tem (MAS) is its cras hes or e xternal ma licious attacks , environmental dis tributed nature, prone to failure at any time. tragedy and power failure us ing the active replication.  Another ma jor proble m is that there is no s tandard The ma in objective of th is s cheme is to embrace a evaluation for the fra mewo rk of FTMAS that is peer-to-peer co mputing paradig m and eliminate the needed for comparis on purposes . Currently each limitat ions pres ent in e xis ting agent platform (AP). For res earcher us es their own criteria for evaluation. communicat ion of agents with other agents in MAS, they  MAS has a lack of reliability in progra mming tools have us ed Agent Commun ication Language (ACL). and s pecialized debugging tools . Skills a re als o Almeida, Aknine et al. [15] have provided replication of needed to s hift fro m an analys is and des ign phas e to critica l agents having features namely, p redictive, coding, as well as is s ues in unders tanding the automatic and adaptive. They tried to mas k agent or environment and methodology. mach ine failure. Hos t failu re and process failure are bas ed on an adaptive failure indicators hierarchy. The ir Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 47 Ubiquitous App licat ions in Smart Environments." In Agent Technology for Intelligent M obile Services and IX. CONCLUSION Smart Societies, p p . 106-116. Sp ringer Ber lin Heidelb er g, Currently, mu lti-agent s ys tem is being us ed in different 2015. applications in a dis tributed environ ment. In MAS, as [7] Gerrard, Clair e E., John M cCall, Christop her M acleod, and Geor ge M . Cogh ill. "App lications and design of there are many agents s o there are several challenges that coop erative multi-agent ARN-based systems." Soft can occur. For e xa mp le, co -ord ination, co-operation, Comp uting (2015): 1-14. negotiation and communicat ion in a dis tributed [8] Li, Ni, Xian g Li, Yuzhong Shen, Zhumin g Bi, and environment. When one agent does not co -operate due to a M inghui Sun. "Risk assessment model based on fault then other components of MAS als o do not provide multi-agent sy stems for comp lex p roduct design." their s ervices . Then fa ilures like machine cras hes , process Information Sy stems Front iers 17, no. 2 (2015): 363-385. failure, s oftware failure , co mmunication fa ilure and [9] Davoodi, M ohammad Reza, Khashay ar Khorasani, Heid ar hardware fa ilure occur. There fore, in this res earch paper, Ali Talebi, and Hamid Reza M omeni. "Distributed fault detection and isolation filter design for a n etwork of we have s urveyed the many techniques for fault tole rance heterogen eous multiagent systems." Control Sy stems in a mult i-agent s ys tem s o that failures can be overcome. Technology , IEEE Transactions on 22, no. 3 (2014) : In this res earch paper, we have pres ented exis ting 1061-1069. techniques , which are very effective for fau lt t olerance, by [10] Wan g, Yannan, Yunin g Son g, and Frank Lewis. "Robust providing related work and then clas s ifying thes e Adaptive Fault -tolerant Control of M ulti-agent Sy stems approaches into different categories . We als o categorized with Uncertain Non-identical Dy namics and Undetectable failures that occur in the mu lti agent s ys tem. Fu rthermore, Actuation Failures." (2015). we have a ls o provided a qualitative co mparis on of e xis ting [11] Kumar, San jeev, and Philip R. Cohen. "Towards fault tolerance approaches . In this comparis on, we locate fault-tolerant multi-agent system architecture." Proceedin gs of the fourth international confer ence on diffe rent para meters s o that we can identify by co mparing Autonomous agents. ACM , 2000. which technique is better for mas king a fault in MAS. We [12] M arin, Olivier, Pierre Sens, Jean-Pierre Briot, and Zahia have provided the pros and cons of exis ting fault tolerance Guessoum. "Towards adap tive fault tolerance for techniques . It s hows that mos t of the e xis ting s chemes are distributed multi-agent sy stems." In Proceedings of not effic ient due to various reas ons like high co mputation ERSADS, p p . 195-201. 2001. cost, cos tly replication and large co mmunicat ion [13] Almeida, Alessandro, Jean-Pierre Br iot, Samir Aknin e, overheads . We have found out that when res earchers Zahia Guessoum, and Oliv ier M arin. "Towards autonomic propos ed a technique for fault to lerance, they ignored its fault-tolerant multi-agent sy stems." In The 2nd Latin overheads which when applied to MAS, proved very American Autonomic Co mp uting Sy mp osium (LAACS’2007), Petrop olis, RJ, Brésil. 2007. costly. It provides fault tole rance but on the other hand, it [14] Khan, Zaheer Abbas, Salman Sh ahid, H. Farooq Ahmad, als o degrades the performance of the s ys tem and reliability. Arshad Ali, and Hiroki Suguri. "Decentralized There s hould be an appropriate technique, which provides architecture for fau lt tolerant multi agent sy stem." In fault tolerance with fe wer overheads and henc e less Autonomous Decentralized Sy stems, 2005. ISADS 2005. expens ive for computation. Proceedings, p p . 167-174. IEEE, 2005. [15] Alessandro de Lun a Almeida , Samir Aknine , Jean-Pierr e REFERENCES Briot , Jacques M alenfant, Plan-based rep lication for fault-tolerant multi-agent sy stems, Proceedin gs of the [1] By rski, Aleksander, Rafał Dreżewski, Leszek Siwik, and 20th international conference on Parallel and d istributed M arek Kisiel-Dorohinicki. "Evolutionary multi-agent p rocessing, p .347-347, Ap ril 25-29, 2006, Rhodes Island, systems." The Knowledge En gineerin g Review 30, no. 02 Greece. (2015): 171-186. [16] Sin gh, Aarti, Dimp le Juneja, and A. K. Sharma. "Adap tive [2] Eddy , Foo, H. B. Gooi, and S. X. Chen. "Multi-Agent and automated fault -tolerance for mu lti-agent sy stems." In Sy stem for Distributed M anagement of Comp uter Science and Automation En gineer in g (C SAE), M icrogrids." Power Sy stems, IEEE Transactions on 30, 2011 IEEE International Conf erence on, vo l. 1, p p . 53-57. no. 1 (2015): 24-34. IEEE, 2011. [3] Sajja, Priti Sr inivas. "Automatic Gen eration of Agents [17] Kopp ensteiner, Gottfried, M unir M erdan, Wilfried using Reusab le Soft Comp uting Code Libraries to develop Lep uschitz, and Ingo Hegny . "Hy brid based app roach for M ulti Agent Sy stem for Healthcare."International Journal fault tolrance in a multi-agent sy stem." In Advanced of Information Technology and Comp uter Science Intelligent M echatronics, 2009. AIM 2009. IEEE/ASM E (IJITCS) 7, no. 5 (2015): 48. International Conference on, p p . 679-684. IEEE, 2009. [4] Yadav, Sandeep Singh, and M andeep Singh Yad av. [18] Bora, Sebn em, and O guz Dikenelli. "On the choice of "Develop ment of Sy stem for Automated & Secure samp ling r ates in a fau lt -tolerant multi-agent sy stem." In Generation of Content (ASCGS)."International Journal of 2012 International Sy mp osium on Innovations in Information Technolo gy and Comp uter Science (IJITCS) Intelligent Sy stems and Ap p lications. 2012. 7, no. 11 (2015): 81. [19] M irian, M ary am S., M ajid Nili Ahmadabadi, and [5] Abbas, Hosny Ahmed, Samir Ibrah im Shah een, and Zainalabed m Navab i. "A decision-mak in g based ap p roach M ohammed Hussein Amin. "Or ganization of multi-agent for fault-handlin g in multi-agent sy stems." InNeural systems: an overview." arXiv p rep rint arXiv:1506.09032 Information Processin g, 2002. ICONIP'02. Proceed in gs of (2015). the 9th International Conference on, vol. 4, p p . 1905-1909. [6] M aciel, Cristiano, Patricia Cristiane d e Souza, José IEEE, 2002. Viterbo, Fabiana Freitas M endes, and Amal El Fallah [20] Khalili, M ohsen, Xiaodon g Zhan g, Yon gcan Cao, and Seghrouchn i. "A M ulti-agent Architecture to Supp ort Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 48 A Survey on Fault Tolerant M ulti Agent Sy stem Jonathan A. M use. "Distributed Adap tive Fault -Tolerant Consensus Control of M ulti-Agent Sy stems with Actuator Faults". Authors’ Profiles Yasir Arfat is currently student of M .S in comp uter science at Kin g Abdulaziz University , Jeddah, Saudi Arabia. He has comp leted his bachelor degr ee in Software Engineer in g with Distinction (Gold M edal) from the University of Azad Jammu & Kashmir, Pakistan in 2011. His research interests include network security , software security , software agent systems and big data, high p erforman ce comp uting. Fathy E. Eassa receiv ed his B. Sc degree in electronics and electrical commu nication engineer in g from Cairo University , Egy pt in 1978 and the M .Sc. degr ee in co mp uters and Sy stems engineer in g from Al Azhar University , Cairo, Egyp t in 1984, and Ph.D degr ee in comp uters and systems engineer in g from Al-Azhar University , Cairo, Egy pt with joint sup ervision with University of Colorado, U.S.A, in 1989. He is a full p rofessor with comp uter Scien ce d ep t, Faculty of Comp uting and Information technology , King Abdul Aziz University , Saudi Arabia. His research interests include agent based software engineerin g, cloud co mp uting, software engineer in g, big data, distributed systems, exascale sy stem testing. How to cite this paper: Yasir Arfat, Fathy Elbouraey Eassa, "A Survey on Fault Tolerant M ulti Agent Sy stem", International Journal of Information Technolo gy and Comp uter Science (IJITCS), Vol.8, No.9, pp .39-48, 2016. DOI: 10.5815/ijitcs.2016.09.06 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Information Technology and Computer Science Unpaywall

A Survey on Fault Tolerant Multi Agent System

International Journal of Information Technology and Computer ScienceSep 8, 2016

Loading next page...
 
/lp/unpaywall/a-survey-on-fault-tolerant-multi-agent-system-cEyS5KezWm

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2074-9007
DOI
10.5815/ijitcs.2016.09.06
Publisher site
See Article on Publisher Site

Abstract

I.J. Information Technology and Computer Science, 2016, 9, 39-48 Publis hed Online September 2016 in MECS (http://www.mecs -pres s .org/) DOI: 10.5815/ijitcs .2016. 09.06 Yas ir Arfat Department of Computer Science King Abdul Aziz Univers ity Jeddah, Saudi Arabia. E-mail: yas [email protected] m Fathy Elbourae y Eas s a Department of Computer Science King Abdul Aziz Univers ity Jeddah, Saudi Arabia. E-mail: [email protected] Abstract—A mult i-agent s ys tem (MAS) is formed by a be contained in, in order to mitigate failure. number of agents connected together to achieve the des ired goals s pecified by the des ign. Us ually in a mu lti  Agents need to collaborate with each other to agent s ys tem, agents work on behalf of a us er to avoid failure accomplis h given goals . In MAS co -ordination,  Information s ent over MAS s hould be trans parent co-operation, negotiation and communication are during trans mis s ion important as pects to achieve fault tolerance in MAS. The  Availability of other agents in MAS s hould be mu lti-agent sys tem is like ly to fail in a dis tributed ens ured when an agent fails environment and as an outcome o f s uch, the res ources for  The agent’s system s hould have the ability to take MAS may not be available due to the failure of an agent, decis ions bas ed on knowledge mach ine cras hes , process failure, s oftware fa ilure,  Agents need to communicate in a s ecure manner communicat ion fa ilure and/or hardware fa ilure. Therefore, and data s hould be protected in cas e of failure many res earchers have propos ed fault tolerance  Agents s hould have autonomy in cas e of fa ilure. approaches to overcome the failure in MAS. So we have They s hould be able to provide s ervices without s urveyed these approaches in this pape r, whereby our affecting other agents contribution is threefold. Firs tly, we have provided  An agent s hould have s calability and comple xity s o taxono my of faults and techniques in MAS. Secondly, we that it can deal with any s ize of agent without have provided a qualitative comparis on of exis ting fault affecting performance tolerance approaches . Thirdly, we have provided an  An agent s hould be robus t enough to confront any evaluation of e xis ting fau lt tole rance techniques . Res ults failure i.e. proces s failure, cras hing failure etc. s how that mos t of the exis ting s chemes are not very whereby it can provide s ervices without any efficient, due to various reasons like high computation interruption. costs , cos tly replication and large co mmunicat ion  Agents s hould have the ability to adapt to any overheads . condition, in any environment, in cas e of failure. Index Terms—Mu lti Agent Sys tem, Fau lt Tolerance, In MAS, there are s everal factors that decreas e Agents , Adaptive Replication, Redundancy . performance and reliability. One of thes e is a failure of the s ys tem. If there is any fault in the sys tem, it will s top working and caus e a delay in achiev ing the required goals . I. INT RODUCT ION In order to increas e the re liab ility of MAS, t he s ys tem A Multi-agent s ys tem (MAS) is compos ed of multip le s hould be fault tolerant. If there is any fault in the s ys tem interacting intelligent agents , with in a g iven environment. it s hould have the ability to mas k the fa ilure in order to Thes e agents co-operate to s olve difficult ies that are continue providing the necess ary services without any beyond the capability or knowledge of each s ingle delay ref. Abbas et al.[5] . If a s ys tem is fau lt tolerant then problem s olver. There are s everal key characteris tics of it will als o increas e the performance of the s ys tem. agents , such as adaptation, s calability, re -us ability, local In this res earch paper, we c las s ify the fault tolerant view, autonomy, res pons iveness and distribution. In order mu lti-agent s ys tem (FTMAS) into different categories , on to achieve the necess ary goals , agents are required to be the bas is of recovery techniques and by pres enting the able to communicate with many other agents in the taxono my of both faults and techniques . Als o, we have environment ref. Byrs ki et al.[1] There a re various provided a qualitative co mparis on of the recent fault applications of MAS like aircraft maintenance, recovery of MAS. We dis covered that res earchers are environment monitoring, military demin ing, s urveillance, applying replicat ion and non replication bas ed fault internet agent, health care, s pacecraft control and recovery approaches for FTMAS. We als o exa mined indus trial monitoring [2][3][4]. e xis ting techniques on the bas is of their attributes s uch as , In this paper, we focus ed on the fault tolerance of MAS. characteris tics , failures , types of agents , environment and There are various fault tolerance needs that MAS should replication protocol. Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 40 A Survey on Fault Tolerant M ulti Agent Sy stem The res t of the paper is organized as follows . Section II III. LIT ERAT URE REVIEW contains a brief background of FTMAS, Section III This s ection pres ents an overview of the briefly des cribes a s umma ry of e xis ting literature. Sect ion s tate-of-the-art of fault tolerance techniques in the multi IV pres ents a taxonomy of FTMAS. In s ection V, we agent s ys tem (FTMAS). Th is overview co mpris es of have provided a comparis on of FTMAS with other dis cus s ion about the ass umptions , objectives , techniques . Section VI s ummarizes the advantages and methodologies and key approaches pres ent in exis ting dis advantages of exis ting techniques in FTMAS. In works . Bas ed on this s ection, we will pres ent a ta xonomy s ection VII, we conducted an evaluation. In s ection VIII, and comparis on in the following s ectio ns . we have dis cuss ed future challenges and iss ues . Finally, s ection IX concludes the paper. A. Towards FTMAS Architecture Ku mar et al. [11] des cribe that there we re many poss ibilit ies that failure could happen at any time in MAS II. BACKGROUND of any dis tributed s ys tem. Many agents were not Multi agent s ys tem compris es of various agents , available due to process failure, e xceptions and entities etc. A s ingle agent is capable of carrying out breakdown of co mmunication. There we re many faults independent actions to achieve the delegated goals . This that exis ted ranging fro m databas e recovery, TS agent s ystem may work in different environ ments monitoring, res ource manager and fau lt tole rance according to the tas ks s et to it, res pons ibilit ies as s igned to dis tributive s ys tems up to application s erver. There were it by the s ys tem or progra m ins ide the agent s ystem. many is s ues in thes e techniques , s uch as us ing replicat ion Figure1 s hows the bas ic operation and work performed s chemes as a critica l s ys tem for monitoring. Ho wever, by the agent s ys tem when it increas es the reliability of the s ystem it duplicates the data and s ervices . Moreover, many s ys tems s aved the application s tate but it als o created many proble ms during recovery. To overco me this tradit ional fault tole rance technique they propos ed Adaptive Agent Architecture (AAA) for the mu lti-agent system (MAS). Whereby, AAA overcomes a proble m like a broker fa ilure without incurring undue overheads . There may be more than one of many s uch brokers in the large mult i-agent s ys tem. In Fig.1. A mult i agent syst em the cas e of sudden unavailability of a broker in AAA, they us ed the team bas ed approach for automatic A s pecific agent will ca rry out s pecific goals in the recovery of MAS. Furthermo re for the recovery, they mu lti agent s ys tem, for e xa mple : Environment: Patient & ass umed three different recovery s chemes , na me ly log ical hospital, Goal: Hea lthy patients , Actions: Tes ts and characterizat ion, recovery s cheme and recovery s cenario. treatments , Percepts: Patient s ympto ms , Agent Type: In thes e ass umptions , they des cribed different s teps , Medical diagnos is . An agent has a range of characteris tics theorems and characterizations of performance. The ir i.e. reactiveness , reliability, s calability, autonomy, res ults show that autonomous agents can ma ke a robustness , intelligence, pers is tency, goal-orientation, multi-agent s ys tem more robus t. adaptability and s ociability. Thes e are the bas ic B. Towards Adaptive FTDMAS characteris tics of an agent that it contains . In multi agent systems (MAS), s everal agents are Marin et a l. [12] have a ls o propos ed an adapt ive working together to achieve tas k-oriented goals on behalf architecture for the mu lti-agent s ys tem (MAS). It deals of the us er or human ref. Macie l et al.[6] Succes s ful with e xis ting proble ms in MAS us ing new methodologies . interaction is required a mong agents in MAS to negotiate, MAS as a dis tributed s ys tem may by its very nature coordinate and cooperate with each agent in the accrue failure at any time in the s ystem. Moreover, due to environment ref. Ge rra rd et al.[7] The bes t exa mp les of it be ing a d is tributed s ys tem, computations of dynamic MAS are Internet agents and as us ed in Spacecraft control app lic at ions we re often chang ed, du ring e xe cut ion. ref. Li et a l.[8] Nowadays , res earchers and developers Neve rthe les s , they tried to ma ke it mo re fle xib le to alike are us ing the agent in the dis tributed environment, overcome the flaws of the conventional s ys tem. On the s uch as thos e used as environment agents who need propos ed architecture, we can either replicate or replicate co-ordination, co-operation, and negotiation. Thes e are the software ele ment on the s pot. The advantage of this the bas ic is s ues that MAS has in each environment ref. approach is that we can change replication tactics in a Davoodi et al.[9]. As the fa ilure rate increas es when there matter of a few s econds . The ma in objective of s elected is les s co-ordination, co-operation and communicat ion architecture is to ma ke fault to lerance mo re e fficient fo r among the agents , this leads to the failure o f the s ystem. MAS, us ing s elective replication techniques . An outcome Hence, thes e types of fa ilures a re s ubject to the hos t, of th is approach is to develop arch itecture , which is machine and exception s et ref. Wang al.[10]. s uitable for dyna mic fau lt tole rance for app licat ions . There are s everal fau lt tole rance MAS techniques that They us ed the s elective replic ation s che me as many have been propos ed to mas k the faults in MAS. Each problems e xis ted for approaches to dyn amic applications . technique differs in its ability to mas k failure in MAS. Moreover, they a ls o introduced a fra me work na me ly, Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 41 (dynamic adaptive rep licat ion e xtens ion) DA RX, which Figure 2 s hows the working of decentralized architecture, us es both active and pass ive replication, s pecially name ly Virtual Agent Clus ter (VAC). When a s ingle des igned for the distributed application. It has many agent platform is deployed it includes all machines . A advantages i.e. to dynamica lly add or re move replicas , s imilarity e xis ts between virtual agent clus ter and clus ter atomic and ordered mu lticas t for each replication group computing, where the front processor dis tributes the load etc. To manage the failure of the s ys tem, there is a among the machines . replicat ion manager as s ociated with each group which Agent Communicat ion Language (ACL) a ls o acts as a performs the following functions : 1)Maintenance of front process or. It is us ed as an interface to communicate informat ion within the group 2)Perform s us pens ion and with another agent in the s ystem. The co mmunicat ion res umption activity 3)Diffus ion of a mes s age and between machines is bi-directional through IP addres s es , 4)Switching the rep licat ion s trategy. The benefits of whereby in each machine there is an Agent Management performing thes e functions are that: the replication Sys tem (AMS). It is organized in s uch a manner that manager can recover the failure quic kly; when one group failure of one machine does not affect the other. There are fails , the other groups have all the information needed in s everal characteris tics of this a rchitecture, which inc ludes order to active a new replica. A s imulat ion dis play fault tolerance and recovery, autonomy, applicat ion ensures that minimu m energy is utilized between nodes to layering a rchitecture (inter VA C and intra VAC) and load carry out the tas k, as a s ingle copy of data will be s ent balancing. Thes e characteris tics ma ke this decentralized and it als o improves the probability of delivery. architecture more fle xib le in s cale and fau lt tolerant. Moreover, fault tole rance is the greates t s ubstantial C. Towards Automatic FTMAS advantage that can be achieved through this decentralized Almeida et al. [13] pres ented an automated fault architecture. tolerance (FT) MAS s cenario. They des cribed that there E. Plane-Based Replication for FTMAS are many poss ibilities whereby an e xception or failure can occur at any time in the s ys tem. Thes e failures o ccur Almeida et al. [15] pres ented a plane-bas ed replication when recovery and fau lt tolerance approaches are defined of the fault tolerant multi-agent system. In their propos ed at the des ign level. Indeed, it is very difficult to decide at s cheme, they us ed this method for s tipulating the the des ign level when and where to apply the FT dependability for MAS through replicat ion. This method approach (i.e. replication). But conventional approaches is diffe rent fro m others cited above, here they focus on are out of order when it co mes to dynamic s ys tems (i.e. predictive and adaptive replication whereby the crit ical MAS). Thes e applications could be ambient intelligence agents are replicated to overcome fa ilures . As s ome of the s ys tems , related to e-co mme rce, c ris is manage ment application us es s tatic rep licat ion, in contras t here they s ys tems or the air tra ffic control s ys tem. According to the us e dynamic replication. The latter has advantages over s ituation and nature of interdependencies in thes e s tatic replication i.e. re-a llocation of tas ks , changing the applications , an agent can change their role during the role of an agent, fle xible organization etc. Moreover, it is computation s tage. Therefore, to overcome a ll d ifficu lties very important to replicate an agent through dynamic and and to make the FT management auto matic and dynamic, automatic means . Here, they are mo re focus ed on they cons idered a s elf-adaptation FT approach. building re liable MAS. Hence, a plan bas ed fault In MAS, mult iple errors may occur as they are only tolerance method promis ing prevention becaus e it cons idered as cras h type failures . Thus , to mas k thes e predicts upcoming behavioral patterns of an agent. To types of failures , replication is cons idered an ideal ma ke MAS mo re reliable, original predictive approach approach. There are various types of replicat ion calculates criticality of the agent dynamica lly. Then this approaches , fro m s tatic to non -s tatic and e xplicit critica lity of the agent is us ed for replication, in a manner replicat ion. Moreover, for replication they pres ented to increas e the dependent ability on the bas is of res ources dynamic and automatic control of replication. Hence, that are available. They a ls o validated their approach on they chos e a DARX fra mework, which has dynamic the DARX fra me work and DIMA. In this s trategy, an dis tributed replication features . Us ing this s ystem, they agent is accomp lis hed as DIMA agent and us age of have es timated the critica l es s ence of the s ys tem by DARX in co mmand is us ed to obtain replicat ion concluding with different types of in formation i.e. capabilities . mes s ages , plans and roles etc. F. Adaptive and Automated FTMAS D. Decentralized Architecture for FTMAS Singh et al. [16] have propos ed this frame work fo r a Khan et al. [14] pres ented fault tolerant decentralized critica l agent in the mult i-agent s ys tem (MAS), bas ed on architecture for the mult i-agent s ystem. Mos t applications the cardinality of an agent. So metimes rep licat ion can have a lack of fault tole rance. There is an e xpectatio n that become very cos tly due to the comple xity of the s ys tem; us age of MAS in different dis tributed applications will moreover, dynamic replication is als o a need of a ll agents increas e. Ho wever, there are many fau lts e xis ting within in fault to lerance MAS. Hence, to overcome thes e is s ues the agent platform, caus ing a mult itude of problems . To they propos ed this particular fra mewo rk. They mixed two overcome a ll thes e problems they introduced t ech n iqu es na me ly , act ive an d pas s iv e rep lic at ion . decentralized architecture, as an alternative to the Thereby , crit ica l agents will act ive ly replic ate, more centralized arch itecture of the agent platform (AP). focus ed relatively to other agents . The benefit of this Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 42 A Survey on Fault Tolerant M ulti Agent Sy stem approach is to reduce the comple xity of the s ystem, cos t, are obs ervation and feedback control. The rep licat ion optima l utilizat ion and more impo rtantly, optimal fault manager utilizes thes e features . The Obs ervation module tolerance. The propos ed fra mewo rk is hybrid, having the collects informat ion about the s ystem and pass es this automated and adaptive characteris tics of fault tolerance. informat ion to feedback control. All this info rmation is This fra mework has three different co mponents : 1) process ed by feedback control, which then decides which Replica Store 2) Fault Management Agent (FMA) and 3) agent is mos t critica l, having ca lculated their relative Event Monitoring Agent (EMA). critica l value. Then it applies the adaptive replication In replica s tore central fault management unit (CFMU) policy bas ed on the criticality of the s ystem. This divides it into two phas es , active and pass ive replica s tore. architecture covers the crash type of failu re in mult i-agent A pass ive replica is us ed to update the agent periodically. s ys tems . In this res earch paper, they also as sume that the For critica l agents , active replica are us ed to having a s ampling period will ma ximize accuracy, reduce the cos t working rep lica . All faults manage ment control is done of replication and increas e the respons e time of the by the FMA. It als o retains information about the s ys tem. replicat ion, whether it is active or pas s ive replication. The I. A Decision-mak ing based approach for Fault las t component is EMA, wh ich is res pons ible for keeping Handing in Multi-Agent System track of the informat ion re lated to cras hes , putting in the s ubstitute replica for that agent. Res ults s how that fifty Mirian, Marya m S. et al. [19] introduced a new percent actively replicated agent can remove the decis ion-bas ed technique for fault handling in the comple xity of the s ystem. Moreover, fro m the propos ed mu lti-agent s ys tem. They des cribed the mult i-agent s ys tem, s calability of the fault tolerance mult i-agent s ys tem more like a d is tributed s ystem where fau lt can s ys tems can be improved. occur at any time in the s ys tem. In the paper, they focus ed on the faulty agent and their recovery in the G. Hybrid Based Approach for FTMAS mu lti-agent s ys tem. In the pres ented technique, if a fault Koppens teiner et al. [17] have propos ed a hybrid fault agent requests its other agents or its team agents come to tolerance mult i-agent s ys tem us ing the heartbeat know that this agent is fau lty and needs help, then there mechanis m. They us ed this mechanis m to detect failure in are s everal help reques ts that exis t. However, which help MAS. They found three different types of failu res here, reques t is appropriate and which will be effective are all name ly: 1) Sys tem dis turbance 2) Phys ical Co mponent decided at the decis ion -ma king phas e. At this s tage, they Failure and 3) Soft ware Entity Fa ilure . To recover fro m als o us e the bes t fit, firs t co me firs t s erve and s hortest job phys ical co mponent failure i.e. a fa ilure in tangible firs t algorith m to making the decis ion fo r the help reques t. hardware or failure in block bas e application controlling For this methodology there is no central agent, all agents function, they introduced the heartbeat mechanis m. Us ing are decentralized. Each agent has knowledge about the the heartbeat between the LLC (Lo w-Leve l Control layer) environment and e xis ting agents in the environment. and HLC (High-Level Control layer) they minimized They all als o have the ability to perform the tas k of other mes s ages to ma intain the s ys tem’s s tability. This agents . If an agent fails in the s ys tem another agent can approach als o imple ments the heartbeat method e xc lus ive help bas ed on the decis ion -making phas e. fro m dis tribution of mes s ages on the s ys tem. If the re is a J. Distributed Adaptive Fault -Tolerant Consensus fault ins ide the s ys tem, they can only co mmun icate Control of Multi-Agent System with Actuator Faults mes s ages if neces s ary. In a s ituation of co mplete failure in the s ys tem, both LLC and HLC will be us ed to detect Khalili et al. [20] pres ented a dis tributed FT cons ens us which agent has failed. Ut ilizing the heartbeat method, control of MAS with actuator faults . This FTMAS is they will try to fix it. bas ed on three different as sumptions . In this dis tributed s ys tem an FT control component was developed to H. Choice of Sampling Rates in FTMAS perform a two-s tep process between the agents . The firs t Bora et al. [18] propos ed fault tolerance in a would diagnos e the fault in the MAS while the s econd mu lti-agent s ystem bas ed on the s ampling period. To would provide an opportunity to recover in an adaptive increas e the fault tolerance in d is tributed and dynamic manner. Thes e as sumptions are constructed us ing s ys tems , adaptive replication techniques were very us eful. mathe matica l equations and in particular, vectors . Us ing But there is one dis advantage of this approach; it the ass umptions , it can chec k the s ys tem’s s tability with increas es the cos t due to adaptive replication. To the clos ed-loop mechanis m. The ma in objective of this overcome this drawback, a s amp ling period was s ys tem is to develop an algorithm that diagnos es and introduced to minimize the cos t. This technique whereby recovers faults . A unique feature of this algorithm is that it monitors critica l agents , properly choos es the it takes an information-neighboring a lgorith m and applies appropriate replication for the agent bas ed on its its actions . critica lity. They applied this technique on abs tract architecture for adaptive replication. This architecture cons is ts of replication manager; which is res pons ible for IV. TAXONOMY OF FAULT S IN MAS providing active and pas s ive replication a mong d iffe rent In this s ection, we have pres ented a ta xonomy of faults replica . It als o mon itors and handles faults ins ide the and their related techniques . Firs t of all, we divided the replica . The Ma in modules contained in this a rchitecture faults into two different categories , namely fail s ilent and Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 43 fail uncontrolled. Thes e are the faults we found in a agent sys tem. We have c las s ified th es e approaches into diffe rent paper that we s urveyed. Fail s ilent faults are three different categories . These are replication bas ed, thos e, which belong to the crash type of failure. On the non-replication bas ed and hybrid approaches . Then we other hand fail uncontrolled are thos e failures , where any further s ubdivided thes e techniques , for e xa mple the type of fault or fa ilure can occur. The faults are then replicat ion bas ed approach has active replication, pas s ive further s ubdivided into different types as given in Figure 2. replicat ion and adaptive replication. Moreover we als o further s ubdivided the non -replicat ion bas ed approaches into two different types , thes e being architecture oriented and mathe matica l/algorith mic, wh ich are g iven in Figure 3. Thes e are the e xis ting techniques that are us ed for fault tolerance, if there is any fault in the system us ing thes e techniques we can avoid fa ilure of the whole s ystem. Thes e approaches have their own advantages and dis advantages , which vary according to the environ ment where thes e methods are being applied. V. QUALIT AT IVE COMP ARISON OF FTMAS In this s ection, we have provided a qualitative comparis on of the e xis ting fault to lerance approaches of MAS as given in table 1. For this purpos e we have us ed the follo wing para meters : 1) Agent type 2) Fault Fig.2. T axonomy of fault s in MAS tolerance technique 3) Object ives 4) Language 5) Type of failure 6) Replication protocol 7) Characteris tics and 8) Environment. According to this table, Ku ma r et a l. [11] have adopted the object replication bas ed fault tolerance approach for MAS, which has characteris tics like autonomy and local view, whereby the ma in objectives of this approach achieve a fas ter fault recovery as they us e the broker process failure. Moreover, Marin et al. [12] and Almeida et al. [15] provide the dynamic replication approach for the fault tolerance mult i-agent s ys tem, having the objective to achieve that the agent should execute the goal. Furthermore , this technique covers machine and host failures , network fa ilu re and dis tributed agent failures . They are als o us ing different types of agents , name ly s elective agent and critica l agent. Moreover, both Fig.3. T axonomy of t echniques against t he fault s in MAS are us ing the Knowledge Query Manipulation Language (KQML) for agent communication among each o ther. We als o found out the ta xonomy of the techniques that res earchers are applying for fault tole rance in the mu lti T able 1. Qualit at ive Comparison Re search Type of Re pli cation Agent Te ch n ique C h aracteristics O bje ctives En vi ron ment Pape r Fai l u re Protocols Type i) Machine i) T o achieve Crashes. Adapt ive Object group warm backups. Kumar et Aut onomy, ii) End of Complex Virt ual Agent replicat ion ii) Object group al. [11] Local views broker process. Agent s Environment Archit ecture and virt ual iii) Net work Synchrony Break Down i) Host and i) Efficient FT Net work for MAS t hrough Bypass Aut onomy, Run failures. Select ive Agent Marin et dynamic Dynamic Select ive Cont inuous t ime replication ii) Failure of an Replicat ion al. [12] Replicat ion. Replicat ion Agent Environment change agent in ii) Appropriate dist ribut ed MA archit ecture applicat ions for dynamic FT Self i) Crash t ype of i) Make Aut onomous, Almeida adapt at ion of failure cause Adapt ive aut onomous t he Crit ical Discret e Dynamic, et al. [13] fault by t he int ernal Replicat ion Management of Agent Environment aut omatic t olerance (hardware fault t olerance, Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 44 A Survey on Fault Tolerant M ulti Agent Sy stem issue and OS ii) T o make this crashes) or fault t olerance ext ernal factor management (malicious dynamic and at t acks, aut omatic environment t ragedy and power failure) i) Cent ralized AMS lack of Fast er Recovery fault t olerance. and Fault i) T o embrace ii) Cent ralized T olerant, peer-to-peer syst em become Aut onomous, comput ing bot t leneck Archit ecture for paradigm. Khan et Virt ual Agent under heavy Act ive Act ive Virt ual applicat ion ii) Eliminat e the al. [14] Clust er load. replicat ion Agent Environment layering limit at ions in iii) Ut ilizat ion (int raVAC and present in of information int erVAC), exist ing Agent service and Balancing t he P lat form (AP) provision Load QOS (t imelines and reliabilit y) i) P rovide some act ion based on aft er calculation of value of a i) An agent syst em t hat will failure or a be execut ed by machine. Act ive and t he agent near P redict ive, ii) Host failure P assive Almeida Replicat ion of fut ure in case of Crit ical Cont inuous Aut omatic and and process Replicat ion, et al. [15] crit ical agents failure of an Agent Environment Adapt ive failure base on (Dynamicall agent . adapt ive failure y apply) ii) Each agent indicat ors should execut e hierarchy. each plan (act ion) in order t o achieve the goal. i) Using t he bot h act ive and passive Aut omatic i) Crash or replicat ion make and adapt ive failure of an t he MAS more Singh et fault Adapt ive Crit ical T ransient Local views agent . scalable, reliable al. [16] recovery, Replicat ion Agent Environment ii) Crit ical and fault Cent ral fault agent failure t olerant. Management ii) T o reduce cost and complexity of t he syst em. Heart beat i) P hysical mechanism Components (failure failure i) T o increase the det ect ion), (Breakdown of st abilit y of t he Supervisor whole syst em Koppenste Agent resource, T o shorter t he Aut onomous, Complex Discret e ir et al. Approach t emporary react ion t ime. Local views Agent Environment [17] (syst em failure) ii) T o enhance failure ii) Soft ware t he fault absorpt ion, ent it y failure t olerance of a fault iii) P art ially complex system recovery) agent failures T o reduce t he cost of Dist ribut ed i) Crash failure Khalili et Aut onomous, Adapt ive replicat ion in Crit ical Discret e adapt ive ii) failure of an al. [20] Adapt ive. Replicat ion MAS and Agent Environment fault -t olerant agent s comput ation overhead. Additionally, A lme ida et a l. [13] and Khan et al. [14] architecture was thought to develop bottlenecks under a als o pres ented a replication and non -replication bas ed heavy load. They have us ed the Agent Co mmunicat ion approach respectively. To overcome the dra wback of the Language (ACL) for co mmun ication between agents by Khan et al. [14] centralized MAS s ys tem, they pres ented us ing active agents . The ma in objectives of th is technique decentralized architecture whereby centralized are i) To embrace a peer-to-peer computing paradigm and Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 45 ii) To eliminate the limitations pres ent in e xis ting Agent technique that they provided covers the following types Platform (AP). of failures : On the other hand Almeida et a l. [13] have provided KQM L fo r the agent. They als o us ed a dis crete i) Cras h or failure of an agent ii) Critical agent failure. environment where the agents can perform only limited actions . The goal of th is approach is to make autonomous Koppens teiner et al. [17] and Khalili et al. [18] have manage ment of fault tole rance and als o make this fault pres ented the heartbeat mechanis m and choice of the tolerance management dynamic and automatic. s ampling rate for fault recovery res pectively. On the other Singh, Aart i et al. [16] have provided auto matic and hand thes e approaches cover phys ical components failure, adaptive fault recovery for the multi -agent s ystem. partial agent failure, cras h failure and failure of agents . Characteris tics of thes e approaches are a local view and This comparis on g ives us a clear idea of different autonomy. The ma in objective of this approach is to approaches for fault tolerant mu lti -agent s ys tems and how achieve i) us ing both active and pas s ive replication to to deal with thes e failures us ing an appropriate approach, ma ke MAS more s calable, re liable and fault tolerant. ii) which is efficient, with les s overheads , easy to use and To reduce cost and comple xit y of the s ystem. They are les s cos tly. als o us ing the agent in a t rans ient environment. The T able 2. P ros and Cons of Exist ing T echniques Re s earch Paper Te ch n ique Pros C on s i) More focused on broker failure i) Effect s of recovery on response t ime. t olerance. Adapt ive Agent Kumar et al. [11] ii) Effect s of t ransition on response t ime. ii) Less focused on individual agent s. Archit ecture iii) Less overheads of using t eamwork. iii) Require ext ra comput ing for t he management of brokerage layer. i) Fast way t o handle fault s and recovery. Bypass dynamic i) Replicat ion is very costly Marin et al. [12] ii) Improve reliability, fault-tolerance. Replicat ion ii) More Computations are required. iii) Improve accessibility. Self adapt at ion of fault i) It provides dynamic replication; we can i) When we use bot h act ive replication and t olerance Almeida et al. [13] use bot h act ive replication and passive passive replication in MAS, such an (Dynamic adaptation of replicat ion. It provides bet ter recovery. environment proves very cost ly. replicat ion strategies) i) It falls short of addressing t he i) Decart elized MAS is less fault y as het erogeneity issue. compared t o centralized. Khan et al. [14] Virt ual Agent Clust er ii) Cost implication of recovery in mult i ii) More reliable. organizat ional context. iii) Fast er as compared t o centralized. iii) More overheads. i) It provides replication for the criticality i) It is very hard t o find out which agent is Replicat ion of critical Almeida et al. [15] of agent , which is more critical than more crit ical in multi agent system for agent s applying t he replication for t hem. fault t olerance. i) T his approach has high overheads. Aut omatic and adaptive i) Using t his approach, it provides ii) Reliabilit y of t his technique is less as Singh et al. [16] fault recovery, central aut omatic recovery for faults when t hey compared t o t he other list ed t echniques in fault Management occur and adapt ive fault recovery. t his t able. i) Comput ation cost is high as it involves a Heart beat mechanism i) It provides faster fault recovery. lot of work communicating wit h each (failure det ect ion) ii) Sending messages at a specified period agent . Koppensteiner et and Supervisor agent of t ime t hrough which it can easily find out ii) Reliabilit y is very low. al. [17] approach (system which agent has failed, t hus providing iii) Very slow t echnique t hat causes more failure absorpt ion, fault quick recovery. overheads due t o sending regular messages recovery) aft er a specified period of time. Using t he sampling rate replication cost will decrease. Response t ime of swit ching from active to Adapt ive Replication passive and visa versa will decrease. i) Adapt ive Replication is very costly. Bora et al. [18] based on sampling Fault and reliabilit y can be achieved easily ii) Overhead will be very high. rat es. using t he sampling rat e. Adapt ive replication increases t he response t ime of t he syst em. [16], [17],[18],[19],[20]. In this s ection we find out the VI. PROS AND CONS OF EXIST ING TECHNIQUES advantages and d is advantages of th e fau lt to le rance approaches that we s urveyed in the literature revie w. In Here in this s ection we have des cribed the different given table 2 we can s ee that there are s ome techniques a pp ro ach es a nd the ir p ros & co ns us ing d iffe ren t that are providing better fau lt tolerance recovery for any parameters as given above [11], [12], [13], [14], [15], Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 46 A Survey on Fault Tolerant M ulti Agent Sy stem failure of the mult i agent s ys tem. This s hows us how ma in objective was that an agent should execute each much the techniqu es are e ffective for fau lt recovery. plan (of action) in order to achieve the overall goal. When the res earcher propos ed a technique for fault Singh et al. [16] chos e an automatic and adaptive fault tolerance they ignored other as pects of fault tolerance, as recovery and central fault management technique. They it can caus e high overheads and perform s o me e xpens ive tried to min imize s ystem cras hes , agent failures and computations . They als o decreas ed the reliability of the critica l agent fa ilure . Their ma in objectives are to reduce s ys tem and reduced the performance of MAS. Moreover, cost and comple xity of the s ys tem. Meanwhile, table 2 s hows overheads of fault recovery, reliability, Koppens teiner et al. [17] have imp le mented the heartbeat improve ment in performance and co mputational cos t of mechanis m for failure detection, s upervis or agent thes e approaches . approach for s ys tem fa ilu re abs orption and fault recovery. They applied this technique to overco me phys ical components failure and breakdown of the whole res ource including te mporary fa ilure . The ma in goal is to increas e VII. DISCUSSION OF EVALUAT IONS AND COMP ARISON the stability of the s ys tem and s horten the reaction time - In this s ection, we evaluated different s chemes and overall - to enhance the fault tolerance of a comple x access ed them for fau lt tolerance recovery. Beg inning s ys tem. In Khalili, Mohs en et al. [20] they have propos ed with Ku mar et a l. [11] who propos ed adaptive agent a s amp ling rate in FTMAS which tries to overcome cras h architecture to mas k the failure in the mult i-agent s ystem. failures or fa ilures of an agent in the MAS environment, This has s everal characteris tics namely , autonomy, local whereby they applied adaptive rep licat io n for redundancy view and mobility. Us ing this approach they have to mas k the fa ilu re in FTMAS. This technique has covered different types of failure na me ly, machine performed re latively better than other techniques but it cras hes , end of broke r proces s and network bread down. als o has large overheads of FT for recovery. Moreover, they applied the object group rep licat ion for Moreover, we have s een that some approaches perform redundancy in MAS. The main objective o f this approach better in fau lt recovery as co mpared to other techniques . is to achieve wa rm bac kup, object group and virtual Some techniques have higher overheads and perform s ynchrony. Marin, Oliv ier et a l. [12] have propos ed the costly computations for fault tolerance. There s hould be dynamic replication technique. Characteris tics of this an appropriate technique that enhances performance on approach are Autonomy and it has run time rep licat ion fault recovery, rather than making it s uffer. changes . Us ing this technique they mas k the failure of host and network, thus effectively the failure of any agent in the dis tributed environment. They applied dynamic VIII. FUT URE CHALLENGES AND ISSUES replicat ion protocol for redundancy. The main objective of this approach provides efficient FT for MAS through There has been advancement in mu lti -agent s ystem s elective agent replication. Almeida et a l. [13] have technology and its us age in our daily life is increas ing. applied s elf-adaptation of fault tolerance approach having Even though a lot of work has been done for fault the characteris tics of dynamic and automat ic recovery of tolerance in a mult i-agent s ys tem (FTMAS) but the iss ue fault. They tried to mas k fa ilures s uch as : cras hes caus ed regarding failure recovery of MAS has still not been by internal hardware is s ues and operating s ystem cras hes overcome yet. As MAS gets further dis tributed, fa ilure can or e xterna l malic ious attacks , environmental tragedy and occur at any time. power failure. They have applied adaptive replication. There are various challenges in FTMAS The main object ive of this technique is to make implementation. It is s till a complex tas k. autonomous the management of fault tolerance and als o to ma ke this fau lt tolerance manage ment dynamic and  Fro m the literature s urvey, we found that mos t of automatic. Khan, Abbas et al. [14] have virtual agent the exis ting fault tolerance approaches are not clus ter (VA C) with the following characteris tics : Fas ter providing bas ic fault recovery features in MAS recovery and fault tolerant, autonomous , architecture for like reliability, s calability, adaptability and application layering into intra VA C and interVA C and robus tnes s. balancing the load. They tried to mas k fa ilures s uch as  A challenging is s ue in des igning fault tolerance cras hes caus ed by internal hardware is s ues and OS architecture for Multi Agent s ys tem (MAS) is its cras hes or e xternal ma licious attacks , environmental dis tributed nature, prone to failure at any time. tragedy and power failure us ing the active replication.  Another ma jor proble m is that there is no s tandard The ma in objective of th is s cheme is to embrace a evaluation for the fra mewo rk of FTMAS that is peer-to-peer co mputing paradig m and eliminate the needed for comparis on purposes . Currently each limitat ions pres ent in e xis ting agent platform (AP). For res earcher us es their own criteria for evaluation. communicat ion of agents with other agents in MAS, they  MAS has a lack of reliability in progra mming tools have us ed Agent Commun ication Language (ACL). and s pecialized debugging tools . Skills a re als o Almeida, Aknine et al. [15] have provided replication of needed to s hift fro m an analys is and des ign phas e to critica l agents having features namely, p redictive, coding, as well as is s ues in unders tanding the automatic and adaptive. They tried to mas k agent or environment and methodology. mach ine failure. Hos t failu re and process failure are bas ed on an adaptive failure indicators hierarchy. The ir Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 A Survey on Fault Tolerant M ulti Agent Sy stem 47 Ubiquitous App licat ions in Smart Environments." In Agent Technology for Intelligent M obile Services and IX. CONCLUSION Smart Societies, p p . 106-116. Sp ringer Ber lin Heidelb er g, Currently, mu lti-agent s ys tem is being us ed in different 2015. applications in a dis tributed environ ment. In MAS, as [7] Gerrard, Clair e E., John M cCall, Christop her M acleod, and Geor ge M . Cogh ill. "App lications and design of there are many agents s o there are several challenges that coop erative multi-agent ARN-based systems." Soft can occur. For e xa mp le, co -ord ination, co-operation, Comp uting (2015): 1-14. negotiation and communicat ion in a dis tributed [8] Li, Ni, Xian g Li, Yuzhong Shen, Zhumin g Bi, and environment. When one agent does not co -operate due to a M inghui Sun. "Risk assessment model based on fault then other components of MAS als o do not provide multi-agent sy stems for comp lex p roduct design." their s ervices . Then fa ilures like machine cras hes , process Information Sy stems Front iers 17, no. 2 (2015): 363-385. failure, s oftware failure , co mmunication fa ilure and [9] Davoodi, M ohammad Reza, Khashay ar Khorasani, Heid ar hardware fa ilure occur. There fore, in this res earch paper, Ali Talebi, and Hamid Reza M omeni. "Distributed fault detection and isolation filter design for a n etwork of we have s urveyed the many techniques for fault tole rance heterogen eous multiagent systems." Control Sy stems in a mult i-agent s ys tem s o that failures can be overcome. Technology , IEEE Transactions on 22, no. 3 (2014) : In this res earch paper, we have pres ented exis ting 1061-1069. techniques , which are very effective for fau lt t olerance, by [10] Wan g, Yannan, Yunin g Son g, and Frank Lewis. "Robust providing related work and then clas s ifying thes e Adaptive Fault -tolerant Control of M ulti-agent Sy stems approaches into different categories . We als o categorized with Uncertain Non-identical Dy namics and Undetectable failures that occur in the mu lti agent s ys tem. Fu rthermore, Actuation Failures." (2015). we have a ls o provided a qualitative co mparis on of e xis ting [11] Kumar, San jeev, and Philip R. Cohen. "Towards fault tolerance approaches . In this comparis on, we locate fault-tolerant multi-agent system architecture." Proceedin gs of the fourth international confer ence on diffe rent para meters s o that we can identify by co mparing Autonomous agents. ACM , 2000. which technique is better for mas king a fault in MAS. We [12] M arin, Olivier, Pierre Sens, Jean-Pierre Briot, and Zahia have provided the pros and cons of exis ting fault tolerance Guessoum. "Towards adap tive fault tolerance for techniques . It s hows that mos t of the e xis ting s chemes are distributed multi-agent sy stems." In Proceedings of not effic ient due to various reas ons like high co mputation ERSADS, p p . 195-201. 2001. cost, cos tly replication and large co mmunicat ion [13] Almeida, Alessandro, Jean-Pierre Br iot, Samir Aknin e, overheads . We have found out that when res earchers Zahia Guessoum, and Oliv ier M arin. "Towards autonomic propos ed a technique for fault to lerance, they ignored its fault-tolerant multi-agent sy stems." In The 2nd Latin overheads which when applied to MAS, proved very American Autonomic Co mp uting Sy mp osium (LAACS’2007), Petrop olis, RJ, Brésil. 2007. costly. It provides fault tole rance but on the other hand, it [14] Khan, Zaheer Abbas, Salman Sh ahid, H. Farooq Ahmad, als o degrades the performance of the s ys tem and reliability. Arshad Ali, and Hiroki Suguri. "Decentralized There s hould be an appropriate technique, which provides architecture for fau lt tolerant multi agent sy stem." In fault tolerance with fe wer overheads and henc e less Autonomous Decentralized Sy stems, 2005. ISADS 2005. expens ive for computation. Proceedings, p p . 167-174. IEEE, 2005. [15] Alessandro de Lun a Almeida , Samir Aknine , Jean-Pierr e REFERENCES Briot , Jacques M alenfant, Plan-based rep lication for fault-tolerant multi-agent sy stems, Proceedin gs of the [1] By rski, Aleksander, Rafał Dreżewski, Leszek Siwik, and 20th international conference on Parallel and d istributed M arek Kisiel-Dorohinicki. "Evolutionary multi-agent p rocessing, p .347-347, Ap ril 25-29, 2006, Rhodes Island, systems." The Knowledge En gineerin g Review 30, no. 02 Greece. (2015): 171-186. [16] Sin gh, Aarti, Dimp le Juneja, and A. K. Sharma. "Adap tive [2] Eddy , Foo, H. B. Gooi, and S. X. Chen. "Multi-Agent and automated fault -tolerance for mu lti-agent sy stems." In Sy stem for Distributed M anagement of Comp uter Science and Automation En gineer in g (C SAE), M icrogrids." Power Sy stems, IEEE Transactions on 30, 2011 IEEE International Conf erence on, vo l. 1, p p . 53-57. no. 1 (2015): 24-34. IEEE, 2011. [3] Sajja, Priti Sr inivas. "Automatic Gen eration of Agents [17] Kopp ensteiner, Gottfried, M unir M erdan, Wilfried using Reusab le Soft Comp uting Code Libraries to develop Lep uschitz, and Ingo Hegny . "Hy brid based app roach for M ulti Agent Sy stem for Healthcare."International Journal fault tolrance in a multi-agent sy stem." In Advanced of Information Technology and Comp uter Science Intelligent M echatronics, 2009. AIM 2009. IEEE/ASM E (IJITCS) 7, no. 5 (2015): 48. International Conference on, p p . 679-684. IEEE, 2009. [4] Yadav, Sandeep Singh, and M andeep Singh Yad av. [18] Bora, Sebn em, and O guz Dikenelli. "On the choice of "Develop ment of Sy stem for Automated & Secure samp ling r ates in a fau lt -tolerant multi-agent sy stem." In Generation of Content (ASCGS)."International Journal of 2012 International Sy mp osium on Innovations in Information Technolo gy and Comp uter Science (IJITCS) Intelligent Sy stems and Ap p lications. 2012. 7, no. 11 (2015): 81. [19] M irian, M ary am S., M ajid Nili Ahmadabadi, and [5] Abbas, Hosny Ahmed, Samir Ibrah im Shah een, and Zainalabed m Navab i. "A decision-mak in g based ap p roach M ohammed Hussein Amin. "Or ganization of multi-agent for fault-handlin g in multi-agent sy stems." InNeural systems: an overview." arXiv p rep rint arXiv:1506.09032 Information Processin g, 2002. ICONIP'02. Proceed in gs of (2015). the 9th International Conference on, vol. 4, p p . 1905-1909. [6] M aciel, Cristiano, Patricia Cristiane d e Souza, José IEEE, 2002. Viterbo, Fabiana Freitas M endes, and Amal El Fallah [20] Khalili, M ohsen, Xiaodon g Zhan g, Yon gcan Cao, and Seghrouchn i. "A M ulti-agent Architecture to Supp ort Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48 48 A Survey on Fault Tolerant M ulti Agent Sy stem Jonathan A. M use. "Distributed Adap tive Fault -Tolerant Consensus Control of M ulti-Agent Sy stems with Actuator Faults". Authors’ Profiles Yasir Arfat is currently student of M .S in comp uter science at Kin g Abdulaziz University , Jeddah, Saudi Arabia. He has comp leted his bachelor degr ee in Software Engineer in g with Distinction (Gold M edal) from the University of Azad Jammu & Kashmir, Pakistan in 2011. His research interests include network security , software security , software agent systems and big data, high p erforman ce comp uting. Fathy E. Eassa receiv ed his B. Sc degree in electronics and electrical commu nication engineer in g from Cairo University , Egy pt in 1978 and the M .Sc. degr ee in co mp uters and Sy stems engineer in g from Al Azhar University , Cairo, Egyp t in 1984, and Ph.D degr ee in comp uters and systems engineer in g from Al-Azhar University , Cairo, Egy pt with joint sup ervision with University of Colorado, U.S.A, in 1989. He is a full p rofessor with comp uter Scien ce d ep t, Faculty of Comp uting and Information technology , King Abdul Aziz University , Saudi Arabia. His research interests include agent based software engineerin g, cloud co mp uting, software engineer in g, big data, distributed systems, exascale sy stem testing. How to cite this paper: Yasir Arfat, Fathy Elbouraey Eassa, "A Survey on Fault Tolerant M ulti Agent Sy stem", International Journal of Information Technolo gy and Comp uter Science (IJITCS), Vol.8, No.9, pp .39-48, 2016. DOI: 10.5815/ijitcs.2016.09.06 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 9, 39-48

Journal

International Journal of Information Technology and Computer ScienceUnpaywall

Published: Sep 8, 2016

There are no references for this article.