Argonne National Laboratory

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery

Publication TypeConference Paper
Year of Publication2015
AuthorsPena, AJ, Bland, W, Balaji, P
Conference NameSC'15
Date Published11/2015
Conference LocationAustin, Texas
Other NumbersANL/MCS-P5275-0115
AbstractPopular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient fault-tolerant system for accelerators. Although we leverage our techniques to protect from detected but uncorrected ECC errors in the device memory in OpenCL-accelerated applications, coprocessor reliability solutions based on different error detectors and similar API semantics can directly adopt the techniques we propose. Adding error detection and protection involves a tradeoff between runtime overhead and recovery time. Although optimal configurations depend on the particular application, the length of the run, the error rate, and the temporary storage speed, our test cases reveal a good balance with significantly reduced runtime overheads.