Argonne National Laboratory

Fault Tolerance Techniques for Scalable Computing

TitleFault Tolerance Techniques for Scalable Computing
Publication TypeBook Chapter
Year of Publication2012
AuthorsBalaji, P, Buntinas, D, Kimpe, D
Other NumbersANL/MCS-P2037-0212

The largest systems in the world today already scale to hundreds of thousands of cores. With plans under way for exascale systems to emerge within the next decade, we will soon have systems comprising more than a million processing elements. As researchers work toward architecting these enormous systems, it is becoming increasingly clear that, at such scales, resilience to hardware faults is going to be a prominent issue that needs to be addressed. This chapter discusses techniques being used for fault tolerance on such systems, including checkpoint-restart techniques (system-level and application level; complete, partial, and hybrid checkpoints), application-based fault-tolerance techniques, and hardware features for resilience.