Pavan Balaji, Darius Buntinas, Narayan Desai, Rinku Gupta, Rob Ross, Rajeev Thakur
Indiana University: Andrew Lumsdaine; Lawrence Berkeley National Laboratory: Paul Hargrove; Oak Ridge National Laboratory: Al Geist, David Bernholdt, Pratul Agarwal, Hoony Park; Ohio State University: D. K. Panda; University of Tennessee, Knoxville : Jack Dongarra
ALCF; Indiana University; Lawrence Berkeley National Laboratory; Oak Ridge National Laboratory; Ohio State University; University of Tennessee, Knoxville
Current systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner.
Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.