Argonne National Laboratory

Fault Prediction Under the Mircroscope: A Closer Look into HPC Systems

TitleFault Prediction Under the Mircroscope: A Closer Look into HPC Systems
Publication TypeConference Paper
Year of Publication2012
AuthorsGainaru, A, Cappello, F, Snir, M, Kramer, W
Conference NameSC'12
Other NumbersANL/MCS-P3023-0812

A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems.

In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.