Argonne National Laboratory

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

TitleLightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers
Publication TypeConference Paper
Year of Publication2016
AuthorsGuhur, P, Zhang, H, Peterka, T, Constantinescu, EM, Cappello, F
Conference NameEuro-Par 2016
Date Published01/2016
Conference LocationGrenoble, France
Other NumbersANL/MCS-P5582-0316
AbstractSilent data corruptions (SDCs) are errors that corrupt the system or falsify results while remaining unnoticed by firmwares or op- erating systems. In numerical integration solvers, SDCs that impact the accuracy of the solver are considered significant. Detecting SDCs in high-performance computing is necessary because results need to be trust- worthy and the increase of the number and complexity of components in emerging large-scale architectures makes SDCs more likely to occur. Until recently, SDC detection methods consisted in replicating the processes of the execution or in using checksums (for example algorithm-based fault tolerance). Recently, new detection methods have been proposed relying on mathematical properties of numerical kernels or performing data analysis of the results modified by the application. None of those methods, however, provide a lightweight solution guaranteeing that all significant SDCs are detected. We propose a new method called Hot Rod as a solution to this problem. It checks and potentially corrects the data produced by numerical integration solvers. Our theoretical model shows that all significant SDCs can be detected. We present two detectors and conduct experiments on streamline integration from the WRF meteorol- ogy application. Compared with the algorithmic detection methods, the accuracy of our first detector is increased by 52% with a similar false detection rate. The second detector has a false detection rate one order of magnitude lower than these detection methods while reaching a detection accuracy improved by 23%. The computational overhead is lower than 5% in both cases. The model has been developed for an explicit Runge-Kutta method, although it can be generalized to other solvers.