Argonne National Laboratory Mathematics and Computer Science Division
Argonne Home > MCS Division > Seminar & Events

Seminars & Events

Bookmark and Share

Mathematics and Computer Science Division
"Toward Recovery Capabilities in Message Passing Environments"

DATE: December 17, 2012
TIME: 10:30 AM - 11:30 AM
SPEAKER: Wesley Bland, Postdoc Interviewee
LOCATION: Building 240, Conference Room 1406 and 1407, Argonne National Laboratory
HOST: Pavan Balaji

Description:
There is a known direct relationship between the size of computing resources and their failure rate. As the scale of these platforms become increasingly extreme, the requirements for application fault tolerance are following the same trend. Automatic approaches have a small investment cost, but their scalability is questionable at the magnitude of future machines. More promising techniques toward improving the resilience of application's intrinsic algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This talk discusses two approaches to failure mitigation, one in the context of the current MPI Standard (Version 3.0) and one using an extension to the MPI standard, called Checkpoint-on-Failure (CoF) and User Level Failure Mitigation (UFLM) respectively. Experiments demonstrate the capabilities of these two techniques, and highlight that a fault-aware MPI implementation can have little to no impact on performance for a range of applications, while producing satisfactory recovery times when failures occur.


Save the event to your calendar [schedule.ics]


The Office of Advanced Scientific Computing Research | UChicago Argonne LLC | Privacy & Security Notice | ContactUs