Fault Tolerance Techniques for Scalable Computing
|Title||Fault Tolerance Techniques for Scalable Computing|
|Publication Type||Book Chapter|
|Year of Publication||2012|
|Authors||Balaji, P, Buntinas, D, Kimpe, D|
The largest systems in the world today already scale to hundreds of thousands of cores. With plans under way for exascale systems to emerge within the next decade, we will soon have systems comprising more than a million processing elements. As researchers work toward architecting these enormous systems, it is becoming increasingly clear that, at such scales, resilience to hardware faults is going to be a prominent issue that needs to be addressed. This chapter discusses techniques being used for fault tolerance on such systems, including checkpoint-restart techniques (system-level and application level; complete, partial, and hybrid checkpoints), application-based fault-tolerance techniques, and hardware features for resilience.