Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

TitleUnderstanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System
Publication TypeJournal Article
Year of Publication2008
AuthorsGupta, R, Naik, H, Beckman, PH
JournalInternational Journal of High Performance Computing Applications
Date Published12/2008
Other NumbersANL/MCS-P1756-0510
Abstract

Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focused on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application---and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.

URLhttp://hpc.sagepub.com/content/early/2010/06/01/1094342010369118.abstract
PDFhttp://www.mcs.anl.gov/papers/P1756.pdf