Argonne National Laboratory

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

TitleLightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications
Publication TypeReport
Year of Publication2014
AuthorsBerrocal, E, Bautista-Gomez, L, Di, S, Lan, Z, Cappello, F
Other NumbersANL/MCS-P5224-1014
Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Hence, supercomputer designers are pushing the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is mostly algorithm-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We evaluate our strategy on well-known HPC applications and kernels such as HACC and Nek5000. Our results show that some detectors can detect up to 95% of corruptions and other lightweight detectors can cover for the majority of corruptions while incurring less than 5% overhead.
 

PDFhttp://www.mcs.anl.gov/papers/P5224-1014.pdf