Argonne National Laboratory

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

TitleVersioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
Publication TypeJournal Article
Year of Publication2015
AuthorsChien, AA, Balaji, P, Beckman, PH, Dun, N, Fang, A, Fujita, H, Iskra, K, Rubenstein, Z, Zheng, Z, Schreiber, R, Hammond, J, Dinan, J, Laguna, I, Richards, D, Dubey, A, van Straalen, B, Hoemmen, M, Heroux, M, Teranishi, K, Siegel, AR
JournalProcedia Computer Science
Volume51
Pagination29-38
Other NumbersANL/MCS-P5271-0115
AbstractExascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.  
DOI10.1016/j.procs.2015.05.187
PDFhttp://www.mcs.anl.gov/papers/P5271-0115.pdf