Argonne National Laboratory

Simplifying the Recovery Model of User-Level Failure Mitigation

TitleSimplifying the Recovery Model of User-Level Failure Mitigation
Publication TypeConference Paper
Year of Publication2014
AuthorsBland, W, Raffenetti, K, Balaji, P
Conference NameEXAMPI14 - at SC'14
Conference LocationNew Orleans, LA
Other NumbersANL/MCS-P5206-1014
AbstractAs resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.