Invited Talk

Speaker: Ganesh Gopalakrishnan

Title: System Resilience: Amplify Failures, Detect, or Both?


System resilience research is in a depressing state. Those who create error detectors often cannot offer high detection rates, avoid false positives, or offer generality, but definitely promise a ~30% slowdown which ensures very reluctant (if any) uptake of the proposed solutions. This talk is on improving some of these aspects.

One improvement we have made is by restricting to important applications such as stencils and guaranteeing near 100% detection based on rigorous floating-point error analysis based on affine arithmetic and reducing overheads by covering multiple steps of the stencil application per detector deployment.

Another improvement we have made is to protect only the address calculation steps involved in indexing arrays and structs. This approach rewrites an LLVM representation of the initial program to a new LLVM representation that exaggerates (amplifies) the first failure to a cascading failure that manifests more readily. This approach offers low overheads and can exploit instructions found in ARM, offering 100% detection on address (AGU) faults.

In this talk, we present FPDetect and FailAMP – tools that take these approaches, respectively.