Argonne National Laboratory

Using Error Estimations for Detecting Silent Data Corruption in Numerical Integration Solvers

TitleUsing Error Estimations for Detecting Silent Data Corruption in Numerical Integration Solvers
Publication TypeReport
Year of Publication2016
AuthorsGuhur, P, Cappello, F, Peterka, T, Abou-Kandil, H
Report NumberANL/MCS-TM-364
Abstract

Data corruption may arise from a wide variety of sources from aging hardware to ionizing radiation, and the risk of corruption increases with
the computation scale. Corruptions may create failures, when execution crashes; or they may be silent, when the corruption remains undetected.
I studied solutions to silent data corruptions for numerical integration solvers, which are particularly sensitive to corruptions. Numerical integration
solvers are step-by-step methods that approximate the solution of a differential equation. Corruptions are not only propagated all along the
resolution, but the solution could even diverge. In numerical integration solvers, approximation error can be estimated at a low cost. I used these error estimates for detecting silent data corruptions in two high-performance applications in fault tolerance. On the one hand, I demonstrated a new lightweight detector for solvers with a
fixed integration step size. I mathematically showed that all corruptions affecting the accuracy of a simulation are detected by our method. On the other hand, solvers with a variable integration size can naturally reject silent data corruptions during the selection of the next integration size. I showed that this mechanism alone can miss too many corruptions, and I developed a mechanism to improve it.
 

PDFhttp://www.mcs.anl.gov/papers/ANL_MCS-TM-364.pdf