Argonne National Laboratory

Navigating the Blue Waters: Online Failure Prediction in the Petascale Era

TitleNavigating the Blue Waters: Online Failure Prediction in the Petascale Era
Publication TypeReport
Year of Publication2014
AuthorsGainaru, A, Bouguerra, MS, Cappello, F, Snir, M, Kramer, W
Other NumbersANL/MCS-P5219-1014

At the scale of todays large scale systems, fault tolerance is no longer an option, but a necessity. As the size of supercomputers increases, so does the probability of a single component failure within a time frame. With a system MTBF of less than one day and predictions that future systems will experience delays of couple of hours between failures, current fault tolerance strategies face serious limitations. Checkpointing is currently an area of significant research, however, even when implemented in a satisfactory manner, there is still a considerable loss of computation time due to frequent application roll-backs. With the growing operation cost of extreme scale supercomputers like Blue Waters, the act of predicting failures to prevent the loss of computation hours becomes cumbersome and presents a couple of challenges not encountered for smaller systems.

In this paper, we present a novel methodology for truly online failure prediction for the Blue Water system. We analyze its results and show that some failures types can be predicted with over 60% recall and a precision of over 85%. The failures which represent the main bottlenecks are discussed in detail and possible solutions are proposed by investigating different prediction methods. We show to what extent online failure prediction is a possibility at petascale and what are the challenges in achieving an effective fault prediction mechanism for Blue Waters.