Access the full text.
Sign up today, get DeepDyve free for 14 days.
C. Engelmann, S. Scott (2005)
Concepts for High Availability in Scientific High-End Computing
S. Scott, C. Engelmann, engelmannc scottsl, Ben He (2004)
A Highly Available Cluster Storage System Using Scavenging
(2006)
Reliability-aware approach to improve job completion time for large-scale parallel applications
A. Maccabe (2006)
FAST-OS: forum to address scalable technology for runtime and operating systemsProceedings of the 2006 ACM/IEEE conference on Supercomputing
C. Engelmann, A. Geist (2005)
Super-Scalable Algorithms for Computing on 100, 000 Processors
44] TORQUE resource manager at Cluster Resources
(2006)
Performance Instrumentation to Characterize Computation-Communication Overlap in Message-Passing Systems
S. Moore, D. Cronk, K. London, J. Dongarra (2001)
Review of Performance Analysis Tools for MPI Parallel Programs
Science Case for Large-scale Simulation. SCaLeS at http://www.pnl.gov/scales
Berkeley Lab Checkpoint Restart
J. Langou, Zizhong Chen, G. Bosilca, J. Dongarra (2007)
Recovery Patterns for Iterative Methods in a Parallel Unstable EnvironmentSIAM J. Sci. Comput., 30
X. Défago, A. Schiper, P. Urbán (2003)
Total order broadcast and multicast algorithms: Taxonomy and surveyACM Comput. Surv., 36
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, D. Panda (2003)
Optimizing mechanisms for latency tolerance in remote memory access communication on clusters2003 Proceedings IEEE International Conference on Cluster Computing
C. Engelmann, S. Scott, C. Leangsuksun, Xubin He (2006)
Active/active replication for highly available HPC system servicesFirst International Conference on Availability, Reliability and Security (ARES'06)
D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, T. Eicken (1993)
LogP: towards a realistic model of parallel computation
C. Engelmann, S. Scott, G. Geist (2004)
High Availability through Distributed Control
High-End Computing Revitalization Task Force. HECRTF at http://www.nitrd.gov/subcommittee/ hec/hecrtf-outreach
Fault Tolerant MPI (FT-MPI) Project at University of Tennessee
Zizhong Chen, G. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, J. Dongarra (2005)
Building fault surviv-able mpi programs with ft-mpi using diskless-checkpointing
C. Engelmann, A. Geist (2006)
RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework
L. Moser, Y. Amir, P. Melliar-Smith, D. Agarwal (1994)
Extended virtual synchrony14th International Conference on Distributed Computing Systems
C. Leangsuksun, V. Munganuru, Tong Liu (2005)
Asymmetric Active-Active High Availability for High-end Computing
A. Geist, J. Kohl, S. Scott, P. Papadopoulos (1999)
Harness: Adaptable Virtual Machine Environment for Heterogeneous ClustersParallel Process. Lett., 9
A. Geist, A. Beguelin, J. Dongarra, Weicheng Jiang, R. Manchek, V. Sunderam (1995)
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computingComputers in Physics, 9
43] SLURM resource manager at Lawrence Livermore National Laboratory
E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, Jeffrey Squyres, Vishal Sahay, P. Kambadur, Brian Barrett, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, Timothy Woodall (2004)
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
C. Engelmann, A. Geist (2003)
A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transformProceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003.
C. Engelmann, S. Scott, A. Geist (2002)
Distributed Peer-to-Peer Control in Harness
Towards highly available linux clusters
J. White, S. Bova
An Analysis of Popular Mpi Implementations
C. Engelmann, A. Geist (2005)
A lightweight kernel for the Harness metacomputing framework19th IEEE International Parallel and Distributed Processing Symposium
(1996)
A modern taxonomy of high availability
(2004)
Linux World Magazine
G. Loh (2000)
A Critical Assessment of LogP : Towards a Realistic Model of Parallel Computation
C. Engelmann, S. Scott (2008)
High Availability for Ultra-Scale High-End Scientific Computing
OpenPBS resource manager at Altair Engineering
G. Fagg, A. Bukovsky, J. Dongarra (2001)
HARNESS and fault tolerant MPIParallel Comput., 27
MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.
ACM SIGOPS Operating Systems Review – Association for Computing Machinery
Published: Apr 1, 2006
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.