"Speeding Up Nek5000 with Autotuning and Specialization"
J. Shin, M. W. Hall, J. Chame, C. Chen, P. F. Fischer, and P. D. Hovland
Proceedings of the 24th ACM International Conference on Supercomputing, Tsukuba, Ibaraki, Japan, ACM, , pp. 253-262. Also Preprint ANL/MCS-P1705-1209
Preprint Version: [pdf]
Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this paper, we demonstrate how compiler-based autotuning that incorporates specialization for expected data set sizes of key computations can be used to speed up Nek5000, a spectral-element code. Nek5000 makes heavy use of what are effectively Basic Linear Algebra Subroutine (BLAS) calls, but for very small matrices. Through autotuning and specialization, we can achieve significant performance gains over hand-tuned libraries (e.g., Goto, ATLAS, and ACML BLAS). Additional performance gains are obtained from using higher-level compiler optimizations that aggregate multiple BLAS calls. We demonstrate more than 2.2X performance gains on an Opteron over the original manually tuned implementation, and speedups of up to 1.26X on the entire application running on 256 nodes of the Cray XT5 Jaguar system at Oak Ridge.