N. Desai, E. Lusk, D. Buettner, A. Cherry, and T. Voran, "Simulating Failures on Large-Scale Systems," 37th International Conference on Parallel Processing - Workshops, IEEE, 1969, pp. 103-108, . [pdf]
Developing fault management mechanisms is a difficult task because of the unpredictable nature of failures. In this paper, we present a fault simulation framework for Blue Gene/P systems implemented as a part of the Cobalt resource manager. The primary goal of this framework is to support system software development. We also present a hardware diagnostic system that we have implemented using this framework.