Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System

TitleParallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System
Publication TypeConference Paper
Year of Publication2010
AuthorsFu, J, Min, MS, Latham, R, Carothers, CD
Conference NameCLUSTER '11
Date Published10/2010
Other NumbersANL/MCS-P1804-1010
AbstractAs the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called reduced-blocking I/O (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the 1 POSIX file per processor approach. Our study shows that rbIO and coIO result in 100 improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25 production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.