Argonne National Laboratory

On The Efficacy of GPU-Integrated MPI for Scientific Applications

TitleOn The Efficacy of GPU-Integrated MPI for Scientific Applications
Publication TypeConference Paper
Year of Publication2013
AuthorsAji, AM, Panwar, LS, Ji, F, Chabbi, M, Murthy, K, Balaji, P, Bisset, KR, Dinan, J, Feng, W, Mellor-Crummey, J, Ma, X, Thakur, R
Conference NameThe 22nd Internation ACM Symposium on High Performance Parallel and Distributed Computing (HPDC)
Conference LocationNew York, NY
Other NumbersANL/MCS-P5047-1213

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application devel- opers to explicitly manage the disjointed host and GPU memo- ries, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2- GPU, have been developed that provide unified programming inter- faces and optimized implementations for end-to-end data commu- nication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated commu- nication systems for scientific applications.

In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simula- tion and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depend- ing on the application’s computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the over- all cluster utilization. Furthermore, MPI-ACC internally imple- ments scalable memory management techniques, thereby decou- pling the low-level memory optimizations from the applications and making them scalable and portable across several architec- tures. Experimental results from a state-of-the-art cluster with hun- dreds of GPUs show that the MPI-ACC–driven new application- specific optimizations can improve the performance of an epidemi- ology simulation by up to 61.6% and the performance of a seis- mology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer productivity and has the potential to improve the performance and porta- bility of scientific applications, thus making a significant step to- ward GPUs being “first-class citizens” of hybrid CPU-GPU clus- ters.