Publications
Buettner, D.,Desai, N.,Lan, Z.,Tang, W., "Fault-aware utility-based job scheduling on BlueGene/P systems," 2009 IEEE Conference on Cluster Computing (Cluster 2009), New Orleans, LA, 1969, , .
Job scheduling on large-scale systems is increasingly a complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance different scheduling requirements/priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at ANL.
