As with the original study, we are interested in time spent in model physics as a function of location in the model domain (grid point) and as a function of location in the physics code (subroutine or routines). The objective of instrumenting and running the PCCM2 code was to produce a set of timing data that varied over four dimensions. Each datum in this set was the time of an interval between a timer-start and a timer-stop, in microseconds. For a given datum, two dimensions specified its coordinates in the model grid; one dimension specified the point in time (in time steps); and the last dimension specified the section of the code being timed. From this data, it was possible to make inferences about spatial load imbalances (over the first two dimensions) and temporal imbalances (over the third). In addition, the contributing routine or set of routines can be identified (over the last dimension).
Timers were added to the code at appropriate locations to obtain load data during the execution of the code. The sections to instrument were identified with the help of David Williamson and Jim Hack, who have been central to the development of CCM at NCAR and who are members of the working group that produced PCCM2 under the U.S. Department of Energy CHAMMP initiative . The first column of Table 1 shows the sections of the physics subtree that were separately instrumented. The physics routines fall into the following major categories:
To generate timing data in the two horizontal dimensions of the model grid, the model was decomposed as finely as possible over processors so that the timing coming from each processor would serve as a point in the data set. Ideally, and to match the resolution of the original study, one would have a single timing per cell per time step. In other words, each processor would compute and generate timings for a single point in the grid. At T42 resolution (64 latitudes by 128 longitudes) such a decomposition would require 4096 processors and thus is not feasible. However, the loop over latitude is very high in the CCM call-tree, outside the call to physics. Thus, each processor was assigned a number of latitudes, and each latitude was timed separately. In this way, the number of processors needed in the north/south dimension was reduced to only two without affecting timer resolution. The timing runs were conducted on 128 processors of the Intel Touchstone DELTA computer decomposing the grid by 2 processors in latitude and 64 processors in longitude, giving an effective timer resolution of two points per timing per time step for each instrumented section of the physics code.
The collected data was stored in a processor's memory until all the calculations for a time step were completed and subsequently written onto the disk. This procedure was followed to prevent the overhead due to writing of the data from contaminating the load data. The instrumented code was run for one simulated day (72 time steps of 20 minutes each). The data from the first 36 time steps was ignored to avoid the effect on performance of initialization. The initial data corresponded to that of September 1, 1987.
The data for a representative time step in which all routines are active is given in Table 1. The table shows the maximum and minimum time reported by a 2-grid-cell partition in the simulation for each of the instrumented sections of physics. The mean is the average time for all 4096 partitions. The standard deviation, , provides one measure of the imbalance between partitions. From the standpoint of how the imbalance affects parallel efficiency, a better measure of imbalance is divided by . The mean (not the minimum) is the shortest time for module to execute if load were perfectly balanced. The maximum is the time it would actually take (with the unbalanced load configuration). The next section analyzes the contribution of the physics modules to load imbalance using this measure.