Publications
D. Ozog, J. R. Hammond, J. Dinan, P. Balaji, S. Shende, A. Malony, "Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions," Preprint ANL/MCS-P3056-1112, November 2012. [pdf]
Good load-balancing methods are required in order to obtain scalability from the NWChem coupled-cluster module, which allows the detailed study of chemical problems by iteratively solving the Schrodinger equation with an accurate ansatz. In this application, a relatively large amount of task information can be obtained at minimal cost, which suggests a static mapping of task groups to processors can be a simple and more efficient alternative to centralized dynamic load balancing. The distributed tensor contractions are block sparse, and an a priori inspection can quickly distinguish non-null tasks and assign them cost estimations based on characteristics such as their dimensions. Architecture-specific and empirically driven performance models of the dominant SORT and DGEMM routines serve as a cost estimator for a once-per-simulation static partitioning process. In this paper we demonstrate this inspector/executor technique, which improves the NWChem coupled cluster module's execution time by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.
