GVR: Global View Resilience

GVR: Global View Resilience

GVR (Global View Resilience) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. Global naming of distributed data yields programmability benefits that include simpler expression of algorithms and decoupling of computation and data structure across increasingly complex (irregular, variable, degraded) hardware.

In the GVR programming model, applications can indicate reliability priorities — which parts of their data are more important to protect — allowing the applications to manage reliability overheads. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. MCS will research algorithms and a runtime that map and adapt the application/system’s reliability deployment based on application-specified reliability priorities.

GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected. Efficiency is critical for exascale reliability, so by exploiting new opportunities arising from the myriad ongoing hardware technology and architecture disruptions, MCS will explore aggressive hardware co-design for efficient reliability.