|Title||Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience |
|Publication Type||Journal Article |
|Year of Publication||2015 |
|Authors||Chien, AA, Balaji, P, Beckman, PH, Dun, N, Fang, A, Fujita, H, Iskra, K, Rubenstein, Z, Zheng, Z, Schreiber, R, Hammond, J, Dinan, J, Laguna, I, Richards, D, Dubey, A, van Straalen, B, Hoemmen, M, Heroux, M, Teranishi, K, Siegel, AR |
|Journal||Procedia Computer Science |
|Other Numbers||ANL/MCS-P5271-0115 |
|Abstract||Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.