G. Belter, E. Jessup, I. Karlin, T. Nelson, B. Norris, and J. Siek, "Exploring the Optimization Space for Build to Order Matrix Algebra," Preprint ANL/MCS-P1890-0511, May 2011. [pdf]
The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a high-performance C program for a given architecture. We focus on optimizing programs where memory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applying them at every opportunity does not necessarily lead to the best performance. We present an empirical and exhaustive characterization of the optimization space for these two optimizations reporting its size and how many points in the space are close to the fastest option. We show how optimizations of different parts of the program affect one another and how the best choices depend on the computer system. We also evaluate the suitability of several algorithms for searching the space. We leverage these findings to ensure that the BTO compiler produces kernels that outperform vendor-tuned BLAS on a variety of modern computer architectures.