Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

DeepDyve requires Javascript to function. Please enable Javascript on your browser to continue.

MOLAR: adaptive runtime support for high-end computing operating and runtime systems

Engelmann, Christian; Scott, Stephen L.; Bernholdt, David E.; Gottumukkala, Narasimha R.; Leangsuksun, Chokchai;

ACM SIGOPS Operating Systems Review , Volume 40 (2) – Apr 1, 2006

Read Article

Download PDF

Share Full Text for Free

10 pages

Loading next page...

References (37)

C. Engelmann, S. Scott (2005)
Concepts for High Availability in Scientific High-End Computing
S. Scott, C. Engelmann, engelmannc scottsl, Ben He (2004)
A Highly Available Cluster Storage System Using Scavenging
(2006)
Reliability-aware approach to improve job completion time for large-scale parallel applications
A. Maccabe (2006)
FAST-OS: forum to address scalable technology for runtime and operating systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
C. Engelmann, A. Geist (2005)
Super-Scalable Algorithms for Computing on 100, 000 Processors
44] TORQUE resource manager at Cluster Resources
(2006)
Performance Instrumentation to Characterize Computation-Communication Overlap in Message-Passing Systems
S. Moore, D. Cronk, K. London, J. Dongarra (2001)
Review of Performance Analysis Tools for MPI Parallel Programs
Science Case for Large-scale Simulation. SCaLeS at http://www.pnl.gov/scales
Berkeley Lab Checkpoint Restart
J. Langou, Zizhong Chen, G. Bosilca, J. Dongarra (2007)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
SIAM J. Sci. Comput., 30
X. Défago, A. Schiper, P. Urbán (2003)
Total order broadcast and multicast algorithms: Taxonomy and survey
ACM Comput. Surv., 36
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, D. Panda (2003)
Optimizing mechanisms for latency tolerance in remote memory access communication on clusters
2003 Proceedings IEEE International Conference on Cluster Computing
C. Engelmann, S. Scott, C. Leangsuksun, Xubin He (2006)
Active/active replication for highly available HPC system services
First International Conference on Availability, Reliability and Security (ARES'06)
D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, T. Eicken (1993)
LogP: towards a realistic model of parallel computation
C. Engelmann, S. Scott, G. Geist (2004)
High Availability through Distributed Control
High-End Computing Revitalization Task Force. HECRTF at http://www.nitrd.gov/subcommittee/ hec/hecrtf-outreach
Fault Tolerant MPI (FT-MPI) Project at University of Tennessee
Zizhong Chen, G. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, J. Dongarra (2005)
Building fault surviv-able mpi programs with ft-mpi using diskless-checkpointing
C. Engelmann, A. Geist (2006)
RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework
L. Moser, Y. Amir, P. Melliar-Smith, D. Agarwal (1994)
Extended virtual synchrony
14th International Conference on Distributed Computing Systems
C. Leangsuksun, V. Munganuru, Tong Liu (2005)
Asymmetric Active-Active High Availability for High-end Computing
A. Geist, J. Kohl, S. Scott, P. Papadopoulos (1999)
Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters
Parallel Process. Lett., 9
A. Geist, A. Beguelin, J. Dongarra, Weicheng Jiang, R. Manchek, V. Sunderam (1995)
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Computers in Physics, 9
43] SLURM resource manager at Lawrence Livermore National Laboratory
E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, Jeffrey Squyres, Vishal Sahay, P. Kambadur, Brian Barrett, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, Timothy Woodall (2004)
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
C. Engelmann, A. Geist (2003)
A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform
Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003.
C. Engelmann, S. Scott, A. Geist (2002)
Distributed Peer-to-Peer Control in Harness
Towards highly available linux clusters
J. White, S. Bova
An Analysis of Popular Mpi Implementations
C. Engelmann, A. Geist (2005)
A lightweight kernel for the Harness metacomputing framework
19th IEEE International Parallel and Distributed Processing Symposium
(1996)
A modern taxonomy of high availability
(2004)
Linux World Magazine
G. Loh (2000)
A Critical Assessment of LogP : Towards a Realistic Model of Parallel Computation
C. Engelmann, S. Scott (2008)
High Availability for Ultra-Scale High-End Scientific Computing
OpenPBS resource manager at Altair Engineering
G. Fagg, A. Bukovsky, J. Dongarra (2001)
HARNESS and fault tolerant MPI
Parallel Comput., 27

Publisher: Association for Computing Machinery
Copyright: Copyright © 2006 by ACM Inc.
ISSN: 0163-5980
DOI: 10.1145/1131322.1131337
Publisher site: See Article on Publisher Site

There are no references for this article.