A new look at exascale resilience

July 30, 2014

With the emerging era of exascale computing, many challenges have come to light. One of the most troublesome is resilience – the ability to tolerate and gracefully recover from faults. Exascale systems will contain millions of cores running up to a billion threads and hence will be subject to frequent instabilities and errors.  The problem is well recognized, and the scientific community has been actively seeking solutions.

Researchers from Argonne National Laboratory and the University of Illinois at Urbana-Champaign recently reviewed the technical progress that has been made in resilience over the past five years. Their paper, published in the inaugural issue of Supercomputing Frontiers and Innovations, also summarizes the research problems still considered critical by the high-performance computing community.

The authors begin with a discussion of the changes in both hardware and software that will make exascale systems more error-prone. The paper then turns to an examination of lessons learned from experience with petascale systems such as Blue Waters. For example, current systems lack good error isolation mechanisms, with the result that the failure of any one component in a parallel job results in the failure of the entire job. Moreover, each subsystem has its own mechanism for error detection, notification, recovery, and logging: the authors believe that an integrated approach to fault tolerance would be more effective.

On the positive side, the authors note that advances in system software approaches and algorithms have substantially improved efforts to detect and recover from faults. These include novel fault tolerance protocols and better failure prediction techniques combining data mining with signal analysis and methods to spot outliers. Also noteworthy are advances in checkpoint/restart, which have transformed this approach from a dead end five years ago into the favorite approach for the so-called fail-stop errors.

Nevertheless, major areas for improvement remain. The authors discuss, for example, the handling of silent errors, the design of programming abstractions for resilience, and the development of a portable, scalable test suite.

Based on research results from almost 100 studies and workshop reports, this paper provides an exceptionally thorough look at what has been learned about exascale resilience and what is still considered critical by the high-performance computing community.

 

For further information, see

Franck Cappello, Al Geist, William Gropp, Laxmikant (Sanjay) Kale, Bill Kramer, and Marc Snir, Toward exascale resilience: 2014 update, Supercomputing Frontiers and Innovations 1(1) 5-28, 2014. The full paper is available on the web at http://superfri.org/superfri/article/view/14/7.

For an excellent summary, see the article in HPCWire ‘s July 21 issue.