MPI+Threads Applications at Scale: A Case Study with Parallel Breadth-First Search

TitleMPI+Threads Applications at Scale: A Case Study with Parallel Breadth-First Search
Publication TypeReport
Year of Publication2014
AuthorsAmer, A, Lu, H, Balaji, P, Matsuoka, S
Other NumbersANL/MCS-P5166-0714

With the increasing prominence of manycore architectures and decreasing per-core memory available on large supercomputers, a number of applications are investigating the usage of hybrid MPI+threads programming to utilize computational units while sharing memory. A process-only model that uses one MPI process per system core is capable of effectively utilizing the available processing units, but fails to fully utilize the memory hierarchy. Hybrid MPI+threads model, on the other hand, can handle intranode parallelism more effectively, but can suffer from locking and memory consistency overheads associated with data sharing. Moreover, hybrid MPI+threads models can alleviate some of the overheads associated with inter-node data communication by allowing more coarse-grained data movement between address spaces, while still performing fine-grained accesses to data by different threads within the same address space. These intricacies are often not visible at small scales, but become highly prominent on large-scale systems causing performance bottlenecks and scalability limitations.
In this paper, we use a distributed implementation of the breadth-first search (BFS) algorithm to understand the performance characteristics of MPI-only and MPI+threads communication models at scale. We start with the MPI-only implementation of BFS, and propose a highly efficient MPI+threads implementation of BFS where threads independently communicate with remote processes while cooperating for local computation. At large scale, despite the better scalability of our hybrid method over a process-only model, the overhead of polling for communication progress and an inefficient global synchronization constitute major bottlenecks. We then demonstrate various techniques to reduce such overhead improving the performance of BFS by 27-fold when scaling to 524,288 (512K) cores of a Blue Gene/Q system.