On a sequential computer the solution would be obtained by traversing the entire domain. Parallel computing involves the division of a task into smaller subtasks and the assignment of such subtasks to individual processors. These processors carry out these sub-tasks and communicate with each other when required. One method for dividing work between processors is domain (or data) decomposition. Domain decomposition can be either by latitude or longitude alone (one-dimensional decomposition) or by latitude and longitude (two-dimensional decomposition). The method of parallelizing the dynamics of an atmospheric (spectral) model is discussed in . A similar methodology has been employed for the parallel implementation of the CCM2. The grid-point domain is patch-decomposed over processors in both the latitudinal and longitudinal dimensions, with the added constraint that each processor has both northern and corresponding southern latitudes (Figure 1). Latitudes that are symmetric about the equator are paired on each processor by the spectral transform algorithm. The decomposition of spectral space is not dealt with in this paper, since physics is computed only in grid space. PCCM2 is not decomposed in the vertical dimension.
When decomposing the model domain over processors, it is important that computational load be distributed as evenly as possible. Unevenly distributed load reduces parallel efficiency because processors with lighter load wait for more heavily loaded processors to finish. Therefore, it is necessary to analyze the variation of load during computation to better understand and correct load imbalance.
A primary source of computational load imbalance in a global climate model is physics. Computational load in physics varies with the state of the model variables. Studies conducted with sequential versions of CCM1 running on CRAY computers showed that the load in physics computations can vary for the following reasons :
The initial study of CCM was performed before the development of the parallel code and had to be conducted using the sequential model with timers placed to capture the time spent at each grid point. Subsequent development of the parallel model, PCCM2, allowed a more direct approach, which is described in the next section. The mesh was decomposed over many processors, and direct measurement of time spent on the processors was used.