Argonne National Laboratory

Improving Resource Availability by Relaxing Network Allocation Constraints on the Blue Gene/P

TitleImproving Resource Availability by Relaxing Network Allocation Constraints on the Blue Gene/P
Publication TypeConference Paper
Year of Publication2009
AuthorsDesai, NL, Buntinas, D, Buettner, D, Balaji, P, Chan, A
Conference Name2009 International Conference on Parallel Processing
Date Published09/2009
PublisherIEEE Computer Society
Conference LocationVienna, Austria
Other NumbersANL/MCS-P1587-0209

High-end computing (HEC) systems have passed the petaflop barrier and continue to move toward the next frontier of {exascale} computing. As companies and research institutes continue to work toward architecting these enormous systems, it is becoming increasingly clear that these systems will utilize a significant amount of shared hardware between processing units, including shared caches, memory management engines, and network infrastructure. While these systems are optimized to use all of the hardware available in a dedicated manner to achieve the best performance, in practice, the shared nature of this hardware makes scheduling applications on it difficult and wasteful. For example, while the IBM Blue Gene/P system has been designed to use a torus network for efficient communication, some of the torus links (especially those connecting different racks) are shared between multiple racks. Thus, a job running on one rack, might preclude another job from running on a second rack in spite of having its compute resources completely idle. In this paper, we assess the relative performance degradation noticed by real applications when such shared network hardware is completely unutilized for some cases. Our measurements on Intrepid, one of the largest Blue Gene/P installations in the world, demonstrate less than 5% degradation for several leadership applications commonly run on the Intrepid system. Further, we demonstrate that the additional scheduling flexibility offered by not sharing such hardware can improve the overall job turnaround time by nearly 40% in some cases.