Argonne National Laboratory

Monitoring Strategies for Scalable Dynamic Checkpointing

TitleMonitoring Strategies for Scalable Dynamic Checkpointing
Publication TypeReport
Year of Publication2016
AuthorsPerarnau, S, Bautista-Gomez, L
Report NumberANL/MCS-P6072-1016

Resilience is an important challenge for extremescale supercomputers. Failures in current supercomputers are
assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events
across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes.
Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability
on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability
of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience