SEMI-HIERARCHICAL APPROACH FOR RELIABILITY, AVAILABILITY, AND SERVICEABILITY OF CELLULAR SYSTEMS Ramendra K. Sahoo, Myung Bae*, Jose Moreira IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 *Unix Development Laboratory, IBM Corporation, Poughkeepsie, NY 12601 INTRODUCTION Cellular architectures offer a path to building very large parallel systems, with thousands of processors,that offer superior price/performance when compared to more conventional parallel systems like the IBM RS/6000 SP and PC clusters. However, those large scale cellular machines introduce significant system management challenges. In particular, the ability to track and analyze every possible fault condition, whether transient (soft) or permanent (hard), [1] in large cellular machines is a major issue from systems software, hardware and architecture point of view. To manage systems of this scale, we propose to design and develop a new semi-hierarchical three-tier approach for system management and control. The essence of this approach is that each compute node in a cellular system is managed through means of a proxy that runs on a service node. The compute nodes are considered as controllable external entities (e.g., devices) attached to this service node. This approach has the following benefits. First, the management is non-intrusive to the applications running on the
/lp/association-for-computing-machinery/semi-hierarchical-approach-for-reliability-availability-and-J1MwdfsawY