A good performance model, like a good scientific theory, is able to explain available observations and predict future circumstances, while abstracting unimportant details. Amdahl's law, empirical observations, and asymptotic analysis do not satisfy the first of these requirements. On the other hand, conventional computer system modeling techniques, which typically involve detailed simulations of individual hardware components, introduce too many details to be of practical use to parallel programmers. In the rest of this chapter, we introduce performance modeling techniques that provide an intermediate level of detail. These techniques are certainly not appropriate for all purposes: they are specialized for the multicomputer architecture and do not take into account, for example, cache behavior. However, they have been proven useful in a wide range of parallel algorithm design problems. The chapter notes provide references to other approaches.
The performance models considered here specify a metric such as execution time T as a function of problem size N , number of processors P , number of tasks U , and other algorithm and hardware characteristics:
We define the execution time of a parallel program as the time that elapses from when the first processor starts executing on the problem to when the last processor completes execution. This definition is not entirely adequate for a timeshared parallel computer but suffices for our purposes. During execution, each processor is computing, communicating, or idling, as illustrated in Figure 3.2. , , and are the time spent computing, communicating, and idling, respectively, on the i th processor. Hence, total execution time T can be defined in two ways: as the sum of computation, communication, and idle times on an arbitrary processor j ,
or as the sum of these times over all processors divided by the number of processors P ,
Figure 3.2: Activity plot during execution of a parallel program on eight processors. Each processor spends its time computing, communicating, or idling. T is the total execution time.
The latter definition is often more useful, since it is typically easier to determine the total computation and communication performed by a parallel algorithm rather than the time spent computing and communicating on individual processors.
Thus, the goal is to develop mathematical expressions that specify execution time as functions of N , P , etc. These models should be as simple as possible, while providing acceptable accuracy. We use the following techniques to reduce model complexity.
We first examine the three components of total execution time: computation time, communication time, and idle time.
The computation time of an algorithm ( is the time spent performing computation rather than communicating or idling. If we have a sequential program that performs the same computation as the parallel algorithm, we can determine by timing that program. Otherwise, we may have to implement key kernels.
Computation time will normally depend on some measure of problem size, whether that size is represented by a single parameter N or by a set of parameters, , , ..., . If the parallel algorithm replicates computation, then computation time will also depend on the number of tasks or processors. In a heterogeneous parallel computer (such as a workstation network), computation time can vary according to the processor on which computation is performed.
Computation time will also depend on characteristics of processors and their memory systems. For example, scaling problem size or number of processors can change cache performance or the effectiveness of processor pipelining. As a consequence, one cannot automatically assume that total computation time will stay constant as the number of processors changes.
The communication time of an algorithm () is the time that its tasks spend sending and receiving messages. Two distinct types of communication can be distinguished: interprocessor communication and intraprocessor communication. In interprocessor communication, two communicating tasks are located on different processors. This will always be the case if an algorithm creates one task per processor. In intraprocessor communication, two communicating tasks are located on the same processor. For simplicity, we assume that interprocessor and intraprocessor communication costs are comparable. Perhaps surprisingly, this assumption is not unreasonable in many multicomputers, unless intraprocessor communication is highly optimized. This is because the cost of the memory-to-memory copies and context switches performed in a typical implementation of intraprocessor communication is often comparable to the cost of an interprocessor communication. In other environments, such as Ethernet-connected workstations, intraprocessor communication is much faster.
In the idealized multicomputer architecture, the cost of sending a message between two tasks located on different processors can be represented by two parameters: the message startup time , which is the time required to initiate the communication, and the transfer time per (typically four-byte) word , which is determined by the physical bandwidth of the communication channel linking the source and destination processors. As illustrated in Figure 3.3, the time required to send a message of size L words is then
This idealized model of communication performance is adequate for many purposes but does break down in some situations. More detailed models are described in Section 3.7.
Figure 3.3: Simple communication cost model: . In this plot of time versus message length, the slope of the line corresponds to the cost per word transferred and the y-intercept to the message startup cost.
Table 3.1 lists approximate values for and for some parallel computers. Because these values tend to change rapidly as hardware and systems software evolve, they should be verified before being used in performance models. Notice the considerable variation in both and values. Clearly, different computers have very different communication performance characteristics.
The values in Table 3.1 were obtained either from the literature or by fitting Equation 3.1 to execution times measured for a small test program that sends messages back and forth between two processors. Figure 3.4 presents some representative experimental data obtained with this program. These times are for a single round trip and hence are twice those given by Equation 3.1. The impact of startup and per-word costs on communication time is clearly visible. Notice the irregularities in both Ethernet and Fiber Distributed Data Interconnect (FDDI) times for small messages, and the periodic jumps in Paragon times. These are due to details of the communication protocols and buffer management strategies used in communication libraries. Nevertheless, we see that Equation 3.1 is a reasonably accurate representation of message costs, particularly for larger messages.
Table 3.1: Approximate machine parameters for some parallel computers, in microseconds (sec). Some of these data provided by T. Dunigan.
Figure 3.4: Round-trip time for a single message between two processors as a function of message length on Ethernet-connected workstations, FDDI-connected workstations, Intel Paragon, and IBM SP1. Data provided by W. Gropp.
Notice that both the and terms are required in Equation 3.1. Asymptotically (for large L ) only the term is important; yet as is generally much larger than , the term can dominate in applications that send mostly small messages.
The values in Table 3.1 represent ``best achievable'' performance and in general may be used as a lower bound on communication costs when estimating performance. Applications with less regular or structured communication patterns may perform less well. In addition, the values in Table 3.1 do not incorporate other costs such as buffer management associated with message passing. However, these additional costs are typically proportional to the number and size of messages communicated. Hence, it is generally possible, by fitting Equation 3.1 to empirical data, to obtain system- and algorithm-dependent values for and for which Equation 3.1 is valid for a large range of problem and machine sizes. This procedure is applied in several examples later in this chapter.
Both computation and communication times are specified explicitly in a parallel algorithm; hence, it is generally straightforward to determine their contribution to execution time. Idle time () can be more difficult to determine, however, since it often depends on the order in which operations are performed.
A processor may be idle due to lack of computation or lack of data. In the first case, idle time may be avoided by using load-balancing techniques such as those introduced in Section 2.5.1. In the second case, the processor is idle while the computation and communication required to generate remote data are performed. This idle time can sometimes be avoided by structuring a program so that processors perform other computation or communication while waiting for remote data. This technique is referred to as overlapping computation and communication, since local computation is performed concurrently with remote communication and computation (Figure 3.5). Such overlapping can be achieved in two ways. A simple approach is to create multiple tasks on each processor. When one task blocks waiting for remote data, execution may be able to switch to another task for which data are already available. This approach has the advantage of simplicity but is efficient only if the cost of scheduling a new task is less than the idle time cost that is avoided. Alternatively, a single task can be structured so that requests for remote data are interleaved explicitly with other computation.
Figure 3.5: Overlapping computation with communication. Solid lines represent computation and dashed lines represent communication operations. In both (a) and (b), processor P1 generates a request to processor P2 at time t+2 and receives a reply at time t+8 . In both cases, the cost of actually sending the message is assumed to be 1 time unit. In (a), P1 has no other useful work to do while waiting for the reply and hence is idle for five time units after sending the message. In (b), P1 switches to another task as soon the request is generated. As this task requires five time units to complete, P1 is never idle.
Throughout this chapter, we use a parallel finite difference algorithm similar to the atmosphere model considered in Section 2.6 to illustrate how performance models are developed and used. For simplicity, we assume a grid of size points, where Z is the number of points in the vertical dimension. Initially, we assume that this grid is decomposed in one horizontal dimension and partitioned among P tasks, with each task responsible for a subgrid of size . Each task performs the same computation on each grid point and at each time step. Because the parallel algorithm does not replicate computation, we can model computation time in a single time step as
where is the average computation time at a single grid point.
As in Section 2.6, we consider a nine-point stencil, meaning that each task must exchange 2 N Z data points with two neighboring tasks, for a total of two messages and 4 N Z data. (We assume that each processor is allocated at least a 2 N subgrid; if not, communications will be required with more than two neighbors. Hence, the performance model that we develop does not apply on more than N/2 processors.) The total communication cost, summed over P processors, is
If P divides N and the amount of computation per grid point is a constant, idle time can be expected to be negligible in this example. In these circumstances, we can combine Equations 3.2 and 3.3 to obtain the following performance model:
Execution time is not always the most convenient metric by which to evaluate parallel algorithm performance. As execution time tends to vary with problem size, execution times must be normalized when comparing algorithm performance at different problem sizes. Efficiency---the fraction of time that processors spend doing useful work---is a related metric that can sometimes provide a more convenient measure of parallel algorithm quality. It characterizes the effectiveness with which an algorithm uses the computational resources of a parallel computer in a way that is independent of problem size. We define relative efficiency as
where is the execution time on one processor and is the time on P processors. The related quantity relative speedup,
is the factor by which execution time is reduced on P processors:
The quantities defined by Equations 3.5 and 3.6 are called relative efficiency and speedup because they are defined with respect to the parallel algorithm executing on a single processor. They are useful when exploring the scalability of an algorithm but do not constitute an absolute figure of merit. For example, assume that we have a parallel algorithm that takes 10,000 seconds on 1 processor and 20 seconds on 1000 processors. Another algorithm takes 1000 seconds on 1 processor and 5 seconds on 1000 processors. Clearly, the second algorithm is superior for P in the range 1 to 1000. Yet it achieves a relative speedup of only 200, compared with 500 for the first algorithm.
When comparing two algorithms, it can be useful to have an algorithm-independent metric other than execution time. Hence, we define absolute efficiency and speedup, using as the baseline the uniprocessor time for the best-known algorithm. In many cases, this ``best'' algorithm will be the best-known uniprocessor (sequential) algorithm. From this point forward, we shall frequently use the terms efficiency and speedup without qualifying them as relative or absolute. However, the context will always make clear which is meant.
In the finite difference algorithm, , and so from Equation 3.4 we have the following model for efficiency in the absence of load imbalances and when P divides N :
Because the uniprocessor algorithm is identical to the parallel algorithm when P=1 , this equation represents absolute efficiency.
© Copyright 1995 by Ian Foster