Overcoming the Power WallFebruary 24, 2017
Power consumption is widely believed to be a major limitation as we advance toward exascale computing. A promising solution proposed in the past few years involved “inexact design” – trading application accuracy for large energy savings. As formulated, however, that approach required representations of floating-point numbers that were not readily available on commercial off-the-shelf (COTS) processors.
To address this limitation, researchers at Argonne National Laboratory, Rice University, and the École Polytechnique Fédérale de Lausanne developed a model capable of measuring the energy gains with all three IEEE-compliant precisions on COTS processors – double, single, and half.
“Trading precision for energy savings seems counterintuitive,” said Kazutomo Yoshii, a principal software development specialist at Argonne’s Mathematics and Computer Science Division. “But we felt that the potential benefits of the approach were worth exploring, particularly if we could extend the methodology to the world of COTS.”
For their study, the researchers constructed a microbenchmark consisting of a typical instruction sequence that computes a vectorized fused multiply-add. Two algorithms were devised: one to handle single- and double-precision variants and another to handle half precision. Tests with a data array size of 108 showed that going from double precision to half precision resulted in an energy savings of a factor of 3.98
The next step was to understand the source of the energy gains. For this study, they devised an energy-modeling tool that counts the various instruction categories. Comparing the total instruction entries showed that single precision executes half the instructions that double precision does. But this “factor of 2” effect does not exist when going from single to half precision, since the number of memory references as well as the number of fused multiply-add instructions is the same. Yet using half precision instead of single precision reduces the energy by a factor of 1.49.
This puzzled us – especially because the half-precision variant must execute conversion instructions that single precision does not, thus executing more instructions in total,” Yoshii said.
What the researchers determined, however, was that smaller precision means more values can fit in the cache. Thus, the cache misses for half precision are reduced by a little more than a factor of 2 – overcoming the conversion instructions and still producing a savings.
Understanding precision and quality tradeoffs is critical to continued scaling of computing systems. The researchers hypothesize that algorithms could be designed so that one part could use reduced precision to save energy and that a different part could reinvest this saved energy to improve the overall quality of the application.
A paper based on this study, “Overcoming the Power Wall by Exploiting Inexactness and Emerging COTS Architectural Features,” by M. Fagan, J. Schlachter, K. Yoshii, S. Leyffer, K. Palem, M. Snir, S. Wild, and C. Enz, appeared in the 29th IEEE International System-on-Chip Conference.