Coordinated and Improved Fault Tolerance for High Performance Computing Systems
In the next few years SciDAC applications will utilize petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice.
Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Petascale applications that are evolving to utilize these platforms face many new challenges. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occuring in the operating environment in a holistic manner.
Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.
CIFTS at SC'09
Come visit us at SC! The CIFTS team will be in Portland, Oregon for SuperComputing 2009. We have several activities planned for this year, including demos, BOFs and posters. Detailed information can be found here. Click here to download the CIFTS Supercomputing'09 flier.
FTB-ENABLED MVAPICH2 1.4 NOW AVAILABLE
MVAPICH2 1.4 from Ohio State University is now FTB-enabled! More information on this can be found at the OSU FTB website. A list of FTB events publishable by MVAPICH2 1.4 can be found here
FTB 0.6.1 RELEASE
The FTB 0.6.1 software is now available for download. This release supports Linux(Ubuntu 9.04) clusters, IBM BG/P and Cray XT4 systems. This release is based on the FTB API (version 0.5).
LBL's FTB-ENABLED BERKELEY LAB CHECKPOINT/RESTART FOR LINUX (BLCR)
Lawrence Berkeley National Laboratory's popular Checkpoint/Restart Software BLCR (version 0.8.0) is now FTB-Enabled. This software is based on FTB 0.6 release. Users can download this software from the BLCR website. Please refer to README.FTB (packaged in the software) for more information on how to install this software and the fault events it supports with the Fault Tolerance Backplane(FTB).
