Argonne National Laboratory Mathematics and Computer Science Division
Argonne Home > MCS Division > Seminar & Events

Seminars & Events

Bookmark and Share

Mathematics and Computer Science Division Seminar
"Fault Awareness ENabled Computing Environment"

DATE: May 27, 2008
TIME: 10:30 am
SPEAKER: Zhiling Lan, Illinois Institute of Technology
LOCATION: Building 221, Conference Room A216, Argonne National Laboratory
HOST: Rinku Gupta / Pete Beckman

Description:
As the scale of high performance computing continues to grow, fault management is becoming a critical challenge. Recent studies have pointed out that the MTBF of teraflop and petaflop machines are only on the order of 10-100 hours. This situation is only likely to deteriorate in the near future, thereby threatening the promising productivity of HPC systems. In this talk, I will describe an on-going research project at Illinois Institute of Technology that aims at building FENCE, a Fault-aware ENabled Computing Environment for HPC. FENCE is a comprehensive fault management system in the sense that it consists of both offline and runtime supports, integrates both proactive and reactive mechanisms, and combines both application level and system level fault management. I will give an overview of FENCE by describing its major components, and will focus on runtimesupports, namely failure prediction and diagnosis, adaptive control manager, and integrated runtime support.


more info >>

Save the event to your calendar [schedule.ics]


The Office of Advanced Scientific Computing Research | UChicago Argonne LLC | Privacy & Security Notice | ContactUs