Next: 3.8 Input/Output Up: 3 A Quantitative Basis for Design Previous: 3.6 Evaluating Implementations

3.7 A Refined Communication Cost Model

Next we examine how the idealized communication cost model used in preceding sections can be extended to account for characteristics of realistic interconnection networks. We review a range of network architectures and develop a more detailed model of communication performance that takes into account the impact of competition for bandwidth on communication costs. This more detailed model is still idealized but can be significantly more accurate in some circumstances.

3.7.1 Competition for Bandwidth

In the idealized multicomputer architecture introduced in Chapter 1, the time required to send a message from one processor to another is independent of both processor location and the number of other processors that may be communicating at the same time. These assumptions are reflected in the communication cost model, Equation 3.1:

While accurate for many algorithms and on many architectures, this model can break down if a computer's interconnection network has properties different from the ideal, particularly if an application generates many messages. In these cases, it is necessary to develop a more detailed model of the interconnection network.

Most interconnection networks use fewer than wires to connect N processors. Hence, they must include routing nodes, or switches , to route messages from a source processor to a destination. A switching node may block or reroute messages when several messages require access to the same wire at the same time. The number of wires that must be traversed in order to get from one processor to another is termed the distance between those two processors. (The distance is equal to the number of switches plus one.) The maximum distance from any processor to any other processor is termed the diameter of the network. The distance between two processors and the length of the wires connecting them are not normally significant factors in determining performance, although networks with long wires may be more expensive to build. (Wire length can be important in networks extending over tens to thousands of kilometers, where the speed of light---about sec per kilometer in optical fiber---places a lower limit on communication latency.)

A factor that can have a significant impact on communication performance and which we study here in some depth is competition for bandwidth. Two processors may need to send data over the same wire at the same time. Typically, only one message can be transmitted simultaneously, so the other message will be delayed. However, for many practical purposes it suffices to think of the two processors as sharing the available bandwidth. Hence, we scale the data volume term of Equation 3.1 by S , the number of processors needing to send concurrently over the same wire:

The scaling factor reflects the idea that the effective bandwidth available to each processor is 1/S of the true bandwidth.

Equation 3.10 does not account for additional contention costs that may be incurred if messages collide and must be retransmitted. (Network researchers have developed sophisticated simulation techniques to account for such effects.) However, experience shows that Equation 3.10 is sufficiently accurate for many practical purposes.

The impact of competition for bandwidth is most severe in algorithms that execute synchronously, that is, algorithms in which all processors are sending and receiving messages at approximately the same time and in which processors cannot proceed with other computation while awaiting messages. The finite difference problem and many other SPMD algorithms have this property. In algorithms such as the search and Fock matrix construction algorithms described in Chapter 2, processors execute asynchronously and are less likely to compete for bandwidth.

3.7.2 Interconnection Networks

The value S in Equation 3.10 can depend on properties of both the parallel algorithm and the underlying interconnection network. In the following discussion, we use two examples to illustrate how the communication patterns of a particular algorithm can be analyzed to determine an approximate value for S on different networks. We first consider properties of interconnection networks.

Crossbar Switching Network.

A crossbar switch avoids competition for bandwidth by using switches to connect N inputs to N outputs (Figure 3.11). In this case, S=1 . Although highly nonscalable, crossbar switches are a popular mechanism for connecting small numbers of workstations, typically 20 or fewer. For example, the DEC GIGAswitch can connect up to 22 workstations. While larger crossbars can be constructed (for example, the Fujitsu VPP 500 uses a 224 224 crossbar to connect 224 processors), they are very expensive.

Figure 3.11: A 44 nonblocking crossbar, used here to connect 4 processors. On the right, two switching elements are expanded; the top one is set to pass messages through and the lower one to switch messages. Notice that each processor is depicted twice, and that any pair of processors can communicate without preventing other processor pairs from communicating.

Bus-based Networks.

In a bus-based network, processors share a single communication resource (the bus). A bus is a highly nonscalable architecture, because only one processor can communicate on the bus at a time. The competition factor S is equal to the number of processors trying to communicate simultaneously.

Figure 3.12: A bus-based interconnection network, here used to implement a shared-memory parallel computer. Each processor (P) is connected to the bus, which in turn is connected to the global memory. A cache associated with each processor stores recently accessed memory values, in an effort to reduce bus traffic.

Buses are commonly used in shared-memory parallel computers to communicate read and write requests to a shared global memory. In principle, the use of a global memory in a shared-memory computer simplifies parallel programming by making locality a nonissue. However, as discussed in Section 1.2.2, most shared-memory parallel computers introduce caches in an attempt to reduce bus traffic; hence, locality continues to be important.

Figure 3.13: An Ethernet LAN. Multiple computers are connected to a single Ethernet cable, which acts as a communication bus, carrying a single signal at a time.

Ethernet.

The Ethernet network often used in LANs to connect workstations or personal computers at a departmental level is another example of a bus-based interconnect. As noted in Table 3.1, standard Ethernet can provide network bandwidths of up to about 1 Mbytes per second. All computers connected via an Ethernet share a single communication channel (Figure 3.13). A computer that needs to send must wait until the channel is idle, then send its message; if it detects a collision, it waits a while and then retransmits. Since a computer requires exclusive access to the entire channel when sending a message, any algorithm that requires more than one processor to communicate concurrently will suffer reduced effective bandwidth. Hence, the term S in Equation 3.10 is, as in other bus-based networks, equal to the number of simultaneous senders. The impact of Ethernet bandwidth limitations on performance is illustrated in the examples that follow.

Figure 3.14: A two-dimensional torus interconnection network. This is a 2-D mesh with end-around connections so that each processor is connected to four neighbors.

Mesh Networks.

A mesh network can be thought of as a crossbar switch (Figure 3.11) in which processors are associated with switching elements rather than being placed on the edge of the mesh. In a mesh network of dimension D , each nonboundary processor is connected to 2D immediate neighbors. Connections typically consist of two wires, one in each direction. Two- and three-dimensional meshes are commonly used in parallel computing. They have the advantage over some more sophisticated networks that they can be constructed in three-dimensional space without long wires. In a 2-D mesh, a message is communicated from processor (i,j) to processor (k,l) in steps. One-, two- and three-dimensional cubic meshes of P processors have diameters of P-1 , , and and contain 2(P-1) , , and wires, respectively. As illustrated in Figure 3.14, these diameters can be halved by extending a mesh with toroidal connections so that boundary processors are also connected with neighbors. However, the torus has two disadvantages. First, longer wires are needed for the end-around connections in the 3-D case. (The need for longer wires can be avoided in a 2-D torus by folding the mesh.) Second, a subset of a torus is not a torus, so the benefits of the toroidal connections are lost if a torus-connected computer is partitioned among several users.

Figure 3.15: Competition for bandwidth in a 1-D mesh. In (a), processors P0 and P1 communicate and P2 and P3 communicate. Because the two communications use different wires, both can proceed concurrently. In (b), processors P0 and P2 communicate and P1 and P3 communicate. The two communications must both traverse the wire connecting P1 and P2; hence, the two communications cannot proceed concurrently, and S=2 . In (c), processors P0 and P2 communicate and P3 and P1 communicate. Because each connection is bidirectional, the two communications can proceed concurrently.

Competition for bandwidth in a mesh network occurs when two or more processors attempt to send over the same wire at the same time (Figure 3.15). The analysis used to determine S for a particular algorithm is illustrated in the examples that follow.

Figure 3.16: Hypercubes of dimension zero through four. The processors in the cubes of dimension 1, 2, and 3 are labeled with integers, here represented as binary numbers. Notice that two processors are neighbors in dimension d if and only if their binary labels differ only in the d th place. Notice also that in a hypercube of dimension d , a message can be routed between any pair of processors in at most d hops.

Hypercube Network.

The hypercube network was introduced in Section 2.4.1. As in the mesh, processors in a hypercube network are associated with switching elements. A d -dimensional hypercube connects each of processors to d other processors. A hypercube can be defined recursively as follows (Figure 3.16). A zero-dimensional hypercube is a single processor and a one-dimensional hypercube connects two zero-dimensional hypercubes. Generally, a hypercube of dimension d+1 is constructed by connecting corresponding processors in two hypercubes of dimension d . As with the mesh, the competition-for-bandwidth factor S is algorithm dependent, although the greater number of wires in the hypercube means that competition for bandwidth tends to occur less often.

The many interesting properties of hypercubes are beyond the scope of this book (but see Chapter 11). However, we note that when labeled as shown in Figure 3.16, two processors are connected if and only if the binary representation of their labels differs in a single position. We exploit this property when specifying algorithms that use the hypercube communication structure. Another important feature of a hypercube is that it contains a mesh: it may be considered a mesh with additional, long-distance connections. The additional connectivity reduces the diameter to d and increases the number of available wires, particularly for nonlocal communication. A disadvantage of the hypercube interconnect from an engineering point of view is that it is more complex than the mesh. In particular, it requires more and longer wires, since a hypercube with dimension greater than three cannot be laid out in three-dimensional space so that wires connect only physically adjacent processors.

Figure 3.17: Example multistage interconnection networks. Shaded circles represent processors and unshaded circles represent crossbar switches. The network on the left has k=2 and n=3 ; on the right, k=4 and n=2 . The network can be constructed from unidirectional switches and links, in which case it is folded so that the processors on the left and right are the same. Alternatively, it can be constructed from bidirectional switches and links, in which case processors on the left and right are distinct.

Multistage Interconnection Networks.

In a multistage interconnection network (MIN), as in a crossbar, switching elements are distinct from processors. However, fewer than switches are used to connect P processors. Instead, messages pass through a series of switch stages. Figure 3.17 illustrates two MINs, which are representatives of a general class of networks characterized by parameters n and k . These networks are sometimes referred to as radix k , dimension n butterflies, or k -ary n -flies. Either n stages of unidirectional crossbar switches connect processors, or n stages of bidirectional crossbar switches connect processors. In the latter case, each link comprises two channels that carry data in opposite directions, and each crossbar switch can route data arriving on any of 2k inputs to any of 2k outputs. Notice that each stage of these networks connects P inputs with P outputs, although not every input is directly connected to every output in each stage.

Figure 3.18: Communications in a bidirectional MIN. The communication indicated at (a) involves processors connected to the same crossbar; it takes just two hops and passes through a single switch. The communication at (b) takes three hops and passes through two switches.

In a unidirectional MIN, all messages must traverse the same number of wires, and so the cost of sending a message is independent of processor location. In effect, all processors are equidistant. In a bidirectional MIN, the number of wires traversed depends to some extent on processor location, although to a lesser extent than in a mesh or hypercube (Figure 3.18).

The fact that messages destined for different destinations may need to pass over the same wire means that MINs are not immune to competition for bandwidth. Nevertheless, a MIN connecting P processors typically provides P wires at each stage, so in principle we should be able to organize communications so that little competition occurs.

Example . Competition for Bandwidth in Finite Difference:

In the first of two examples, we consider the impact of competition for bandwidth in an algorithm with a high degree of locality: the one-dimensional finite difference algorithm examined in preceding sections. Recall from Equation 3.3 that according to the idealized model of Equation 3.1, the per-processor communication costs for this algorithm are

Figure 3.19: Speedup of finite difference code with N=512 and Z=5 as measured on Ethernet-connected IBM RS6000 workstations and as predicted both by a simple performance model that does not take into account competition for bandwidth and by a more sophisticated model that does. Both models assume that sec and sec.

Competition for bandwidth is not an issue on a mesh or hypercube because the one-dimensional ring-based communication structure of the finite difference problem can be embedded in these networks using only nearest-neighbor connections. On a bus-based network, only one of the P processors can communicate at one time; if we assume that in the communication phase of the algorithm, half the processors need to send at once (the other half are receiving), then S=P/2 and the communication volume term must be scaled by a factor of P/2 , giving

Figure 3.19 illustrates both the large impact that bandwidth limitations can have on the performance of even a simple algorithm such as finite difference and the improved accuracy of the refined performance model. The figure shows performance measured on Ethernet-connected workstations and as predicted by Equations 3.3 and 3.11. We see that the more sophisticated model is reasonably accurate.

Example . Competition for Bandwidth in Butterfly:

As a second example, we consider an algorithm in which P tasks use the butterfly (or hypercube) communication structure illustrated in Figure 2.14 to perform exchanges of N/P data. The summation algorithm described in Section 2.4.1 has this form. Other algorithms with similar characteristics are described in Chapter 11.

Per-processor communication costs associated with this algorithm are, in the absence of competition for bandwidth,

The algorithm can, of course, execute without competition for bandwidth on a crossbar switch. Somewhat less obviously, it can also execute without competition for bandwidth on a P -processor hypercube: Computation and communication can be organized so that each of the processors with which a processor must communicate is a neighbor on one of the hypercube links. On a bus-based network, only one processor can communicate at a time; hence, as in the finite difference algorithm considered in Example 3.7, we assume S=P/2 and from Equation 3.10 we have

Figure 3.20: Execution of the butterfly summation algorithm on an eight-processor, one-dimensional mesh. Shading is used to highlight a single task and its communication partners, which are one, two, and four hops distant.

On a mesh, the limited number of wires becomes an issue. For example, on a one-dimensional mesh of P processors, each processor generates messages that must traverse 1, 2, ..., hops in the p steps of the algorithm (Figure 3.20). These messages travel a total of hops. This represents the number of wires to which each processor requires exclusive access during execution of the summation. As a one-dimensional bidirectional mesh provides only 2(P-1) wires, we see that the parallel algorithm cannot possibly proceed in less than P/2 steps rather than steps as supposed previously. In fact, it can proceed in P/2 steps only if we can define a communication schedule that keeps all wires busy all the time. Hence, the following model represents a lower bound on communication costs:

Figure 3.21: Performance of parallel FFT in a spectral transform code on a one-dimensional mesh in Intel DELTA and on Ethernet-connected RS/6000 processors. The simple models do not take into account competition for bandwidth; the refined models do, and give a better fit to observed performance.

Figure 3.21 compares observed speedups with those predicted by the simple and bandwidth-limited performance models on a one-dimensional mesh and on an Ethernet. These results are from an atmosphere modeling code that uses a parallel fast Fourier transform (FFT) to parallelize a numerical method called the spectral transform. The details of the numerical method are not important here; what is relevant is that at each step, the code must perform two butterfly communication operations (specifically, FFT) on a large array. Details of the two experiments are given in Table 3.6. (The term used on the DELTA is significantly smaller than in the finite difference code of Example 3.6; this reflects the fact that the communication code in the FFT implementation on the DELTA had been carefully optimized.)

Table 3.6: Parameters for butterfly performance study ( N in words, times in sec).

Next: 3.8 Input/Output Up: 3 A Quantitative Basis for Design Previous: 3.6 Evaluating Implementations