Detecting Silent Data Corruption through Data Dynamic Monitoring for Scientific Applications

TitleDetecting Silent Data Corruption through Data Dynamic Monitoring for Scientific Applications
Publication TypeConference Paper
Year of Publication2013
AuthorsGomez, LABautist, Cappello, F
Conference Name19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Conference LocationOrlando, FL
Other NumbersANL/MCS-P5053-1213
Abstract

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large- scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead.

 

PDFhttp://www.mcs.anl.gov/papers/P5053-1213.pdf