Argonne National Laboratory

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing

Title Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing
Publication TypeConference Paper
Year of Publication2014
AuthorsBalaprakash, P, Gomez, LABautist, Bouguerra, MS, Wild, SM, Cappello, F, Hovland, PD
Conference NameProceedings of the 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS14)
Other NumbersANL/MCS-P5138-0514
Abstract

In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.

PDFhttp://www.mcs.anl.gov/papers/P5138-0514.pdf