Argonne National Laboratory

Toward Understanding Soft Faults in High Performance Cluster Networks

TitleToward Understanding Soft Faults in High Performance Cluster Networks
Publication TypeReport
Year of Publication2003
AuthorsEvans, JJ, Baik, S, Hood, CS, Gropp, WD
Date Published01/2003
Other NumbersANL/MCS-P1017-0103
Abstract

Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.

PDFhttp://www.mcs.anl.gov/papers/P1017.pdf