A Case for Epidemic Fault Detection and Group membership in HPC Storage Systems

TitleA Case for Epidemic Fault Detection and Group membership in HPC Storage Systems
Publication TypeReport
Year of Publication2014
AuthorsSnyder, S, Carns, PH, Jenkins, J, Harms, K, Ross, RB, Mubarak, M, Carothers, CD
Other NumbersANL/MCS-P5180-0814

Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, trade-offs must be made to account for the competing goals of fault detection efficiency and accuracy.
In this work, we evaluate epidemic fault detection and group membership protocols for use in large-scale HPC storage systems. This class of algorithms has the potential to increase scalability and decrease fault response time for future production systems. We focus our analysis on the Scalable Weakly-consistent Infection- style Process Group Membership (SWIM) protocol.
We begin by exploring how the semantics of this protocol differ from those of typical HPC group membership protocols, and we discuss how storage systems might need to adapt as a result. We then identify how to tune the SWIM protocol for deployment in an HPC environment. We also develop a high- resolution parallel discrete event simulation of the protocol to confirm existing analytical models and explore protocol behavior in greater detail. Our preliminary results indicate that the SWIM protocol is a promising alternative for group membership in HPC storage systems, offering rapid convergence, tolerance to transient network failures, and minimal network load.