J. J. Evans, C. S. Hood, W. D. Gropp, "Exploring the Relationship Between Parallel Application Run-Time Variability and Network Performance in Clusters," Preprint ANL/MCS-P1071-0703, July 2003. [pdf]
Highly variable parallel application execution time is a persistent issue in cluster computing environments, and can be particularly acute in systems composed of Networks of Workstations (NOWs). We are looking at this issue in terms of consistency. In particular, we are focusing on network performance. Before we can use techniques from fault management to attain consistency, this paper presents our preliminary analysis of run-time variability from logs and experiments, exposing important issues related to systemic inconsistency in NOW clusters. The characterization of application sensitivity can be used to set network performance goals, thereby defining operational requirements. Network performance depends on the virtual topology imposed by the scheduler's allocation of nodes and the communication patterns of the set of running applications. Therefore, it is important to look at both the network and the cluster's centralized node mapper (scheduler) as critical subsystems.