Seminars & Events
Mathematics and Computer Science Division
"Fast Checkpoint for Extreme Scale Supercomputers"
DATE: November 8, 2012
TIME: 10:30 AM - 11:30 AM
SPEAKER: Leonardo Bautista Gomez, Postdoc Interviewee
LOCATION: Building 240, Seminar Room 4301, Argonne National Laboratory
HOST: Franck Cappello
Description:
In high performance computing, scientific applications need to make progress despite frequent failures. Thus, long running executions are periodically checkpointed to stable storage. Nowadays, the overhead imposed by parallel file system based checkpointing is about 25% of execution time. In future exascale supercomputers, checkpointing will become prohibitively time consuming. We developed a fault tolerance interface that exploits the features of large scale hybrid systems implementing a low-overhead high-frequency multi-level checkpoint that uses a Topology-Aware Reed-Solomon encoding algorithm with modern local storage devices, advanced clustering techniques and Fault Tolerance Dedicated Threads. Finally, we develop an exascale study using our performance model and we show that our approach can guarantee low overhead in future extreme scale systems.
Save the event to your calendar [schedule.ics]
