Coordinated and Improved Fault Tolerance for High Performance Computing Systems

In the next few years SciDAC applications will utilize exascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Exascale applications that are evolving to utilize these platforms face many new challenges.

The CIFTS initiative focuses on providing end-to-end fault tolerance for applications on high-end computing systems. To achieve this

  • We aim to research, design and improve fault tolerance techniques in various software that are being used widely in the high-end computing community today and
  • Investigate research challenges and build a fault coordination environment that will allow all system software to exchange fault information and thus adapt to faults occurring in the operating environment in a holistic manner.

Our approach will be to create an (1) Interface specification that allows libraries, run-time systems, and applications to exchange fault information and conduct fault management in a coordinated manner; (2) Design and implement a reference implementation of a fault awareness and notification interface specification to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; (3) Improve fault tolerance capabilities in key libraries and applications and extend them to validate both our interface specification and its reference implementation.



Open MPI 1.5.3 releases with FTB support! More information and detailed instructions on the release; as well as a list of supported FTB fault events can be found on the Open Systems Laboratory website.

FTB-ENABLED MPICH2 1.3.1 Released

MPICH 1.3.1 from Argonne National Laboratory is FTB-enabled. More information on this can be found on the MPICH2 Wiki Page. With the MPICH2 1.3.1 release, MPICH2 is now fully compliant with the FTB MPI standardized events 1.0 and publishes all events described in this standard.


MVAPICH2 1.6 from Ohio State University now provides FTB-enabled support for both Checkpoint-Restart and Process Migration. More information on these features can be found from the OSU FTB website. MVAPICH2 1.6 codebase and the associated user guide with related information about using these features can be downloaded from the MVAPICH website.


The FTB software version 0.6.2 is now available for download. This release supports Linux clusters, IBM BG/P and Cray XT4 systems. This release is based on the FTB API version 0.5 specification.


Lawrence Berkeley National Laboratory's popular Checkpoint/Restart Software BLCR (version 0.8.0) is now FTB-Enabled. This software is based on FTB 0.6 software release. Users can download this software from the BLCR website. Please refer to README.FTB (packaged in the software) for more information on how to install this software and the fault events it supports with the Fault Tolerance Backplane(FTB).