Fault-Aware Utility-Based Job Scheduling on BlueGene/P Systems

Publication TypeConference Paper
Year of Publication2009
AuthorsBuettner, D, Desai, NL, Lan, Z, Tang, W
Conference Name2009 IEEE Conference on Cluster Computing (Cluster 2009)
Date Published09/2009
Conference LocationNew Orleans, LA

Job scheduling on large-scale systems is increasingly a complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance different scheduling requirements/priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at ANL.