J. Jenkins, P. Balaji, J. Dinan, N. F. Samatova, R. Thakur, "Enabling Fast, Non contiguous GPU Data Movement in Hybrid MPI+GPU Environments," Preprint ANL/MCS-P2028-0212, February 2012. [pdf]
Lack of efficient and transparent interaction with GPU data in hybrid MPI GPU environments challenges GPU acceleration of largescale scientific and engineering computations. A particular challenge is the efficient transfer of noncontiguous data to and from GPU memory. MPI supports such transfers through the use of datatypes, however an efficient means of utilizing datatypes for noncontiguous data in GPU memory is not currently known.
To address this gap, we present the design and implementation of efficient MPI datatypes processing system, which is capable of efficiently processing arbitrary datatypes directly on the GPU. We present a means for converting conventional datatype representations into a GPU tractable format, which exposes parallelism. Fine grained, element level parallelism is then utilized by a GPU kernel to perform in device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array subvolumes over CUDA based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of datatypes that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end to end, GPU to GPU communication time.