Argonne National Laboratory Mathematics and Computer Science Division
Argonne Home > MCS Division >

Publications

P. Balaji, D. Buntinas, D. Kimpe, "Fault Tolerance Techniques for Scalable Computing," Preprint ANL/MCS-P2037-0212, February 2012. [pdf]

The largest systems in the world today already scale to hundreds of thousands of cores. With plans under way for exascale systems to emerge within the next decade, we will soon have systems comprising more than a million processing elements. As researchers work toward architecting these enormous systems, it is becoming increasingly clear that, at such scales, resilience to hardware faults is going to be a prominent issue that needs to be addressed. This chapter discusses techniques being used for fault tolerance on such systems, including checkpoint-restart techniques (system-level and application level; complete, partial, and hybrid checkpoints), application-based fault-tolerance techniques, and hardware features for resilience.


The Office of Advanced Scientific Computing Research | UChicago Argonne LLC | Privacy & Security Notice | ContactUs