Argonne National Laboratory Mathematics and Computer Science Division
Argonne Home > MCS Division >

Publications

D. Buntinas, "Scalable Distributed Consensus to Support MPI Fault Tolerance," Lecture Notes in Computer Science, vol. 6960, Springer, 2012, pp. 325-328. Also Preprint ANL/MCS-P2027-0212, February 2012. [pdf]

As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a scalable, distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum's fault-tolerance working group. The algorithm was implemented and evaluated on a 4,096-core Blue Gene/P. The implementation was able to perform a full-scale distributed consensus in 305 us and scaled logarithmically.


The Office of Advanced Scientific Computing Research | UChicago Argonne LLC | Privacy & Security Notice | ContactUs