A Case for Epidemic Fault Detection and Group membership in HPC Storage Systems
|Title||A Case for Epidemic Fault Detection and Group membership in HPC Storage Systems|
|Year of Publication||2014|
|Authors||Snyder, S, Carns, PH, Jenkins, J, Harms, K, Ross, RB, Mubarak, M, Carothers, CD|
Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, trade-offs must be made to account for the competing goals of fault detection efficiency and accuracy.