Argonne National Laboratory Mathematics and Computer Science Division
Argonne Home > MCS Division > MCS Research > Research Projects

Research Projects

CIFTS : CiFTS : Coordinated Infrastructure for Fault Tolerant Systems

PIs:
Peter Beckman

MCS People Involved:
Pavan Balaji, Darius Buntinas, Narayan Desai, Rinku Gupta, Rob Ross, Rajeev Thakur

Additional People:
Indiana University: Andrew Lumsdaine; Lawrence Berkeley National Laboratory: Paul Hargrove; Oak Ridge National Laboratory: Al Geist, David Bernholdt, Pratul Agarwal, Hoony Park; Ohio State University: D. K. Panda; University of Tennessee, Knoxville : Jack Dongarra

Other Institutes:
ALCF; Indiana University; Lawrence Berkeley National Laboratory; Oak Ridge National Laboratory; Ohio State University; University of Tennessee, Knoxville

[project website]

 

Abstract:
Current systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner.

Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.


U.S. Department of Energy The University of Chicago Office of Science - Department of Energy
Privacy & Security Notice | Contact Us