Argonne National Laboratory

Addressing Failures in Exascale Computing*

TitleAddressing Failures in Exascale Computing*
Publication TypeJournal Article
Year of Publication2013
AuthorsSnir, M, Wisniewski, RW, Abraham, JA, Adve, SV, Bagchi, S, Balaji, P, Belak, J, Bose, P, Cappello, F, Carlson, B, Chien, AA, Coteus, P, Debardeleben, NA, Diniz, P, Engelmann, C, Erez, M, Fazzari, S, Geist, A, Gupta, R, Johnson, F, Krishnamoorthy, S, Leyffer, S, Liberty, D, Mitra, S, Munson, TS, Schreiber, R, Stearley, J, Hensbergen, EV
JournalInternational Journal of High Performance Computing
Other NumbersANL/MCS-P5022-0913
AbstractWe present here a report produced by a workshop on “Addressing Failures in Exascale Computing” held in Park City, Utah, August 4–11, 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system; discuss existing knowledge on resilience across the various hardware and software layers of an exascale system; and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia; and their interests ranged from the- ory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.