Fault Tolerance in MPI Programs

TitleFault Tolerance in MPI Programs
Publication TypeJournal Article
Year of Publication2004
AuthorsGropp, WD, Lusk, EL
JournalInternational Journal of High Performance Computing Applications
Volume18
Issue3
Pagination363-372
Date Published07/2004
Abstract

<p>This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI program, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.</p>

URLhttp://hpc.sagepub.com/content/18/3/363
PDFhttp://www.mcs.anl.gov/papers/P1154.pdf