In the remainder of this chapter, we apply the modular design
techniques discussed in preceding sections in three case studies. We
start with an example from image processing, which we
use to study design tradeoffs that can arise when constructing
parallel programs from several components. We consider the problem of
applying a series of * convolution
* operations to a sequence
of images. Images, represented as arrays of size
* N*
* N*
, are input in pairs on streams * A*
and
* B*
; convolution generates a new array of the same size that is output
on stream * C*
(Figure 4.4). A single convolution
operation involves the transformation of two input arrays using
independent two-dimensional fast Fourier transforms (2-D FFTs), a
pointwise multiplication of the two transformed arrays, and the
transformation of the resulting array using an inverse 2-D FFT,
thereby generating an output array. A 2-D FFT performs 1-D FFTs first
on each row and then on each column of an array. A 1-D Fourier
transform, , of a sequence of * N*
values,
, is given by

where . The FFT exploits symmetry to perform this computation in steps, each involving operations.

**Figure 4.4:** * Dataflow diagram for an image-processing pipeline.
Two streams of images, A
and B
, are passed through FFT
modules and then into an inverse FFT module, which first multiplies
them and then applies an inverse FFT.*

We first consider the three components from which the convolution algorithm is constructed: forward 2-D FFT, multiplication, and inverse 2-D FFT. The pointwise multiplication is the simplest: Its communication requirements are zero as long as the arrays on which it operates have the same data distribution.

**Figure 4.5:** * Two parallel algorithms for computing a series of
2-D FFTs. In each case, the activity on each of four
processors (P0--3) is shown over time, with arrows denoting communication and I
and O denoting input and output operations. The algorithm illustrated
in the upper part of the figure is a sequential composition of program
components that perform 1-D FFTs, first on rows and then on columns of
each input 2-D array; all-to-all communication is required to
transpose the array after performing row FFTs and before performing
column FFTs. In the second algorithm, data flow from a first set of
processors performing row FFTs (P0, P1) to a second set performing
column FFTs (P2, P3). Communication is required to move data from P0
and P1 to P2 and P3.*

A variety of parallel algorithms are possible for the forward and inverse 2-D FFTs. A fine-grained algorithm can exploit concurrency within the 1-D FFTs performed on individual rows and columns of the input array, at the cost of considerable communication. A more coarse-grained algorithm performs independent 1-D FFTs in parallel, thereby avoiding a need for communication within the 1-D FFTs, but requiring communication when moving from a row-based to a column-based decomposition. We consider two algorithms based on the latter strategy. The first processes the input image stream sequentially, performing row FFTs and then column FFTs on each image in turn. The second algorithm pipelines the image stream, performing column FFTs for one image in parallel with the row FFTs for the next (Figure 4.5). These two algorithms are in effect sequential and parallel compositions, respectively, of code fragments that perform 1-D FFTs on the rows and columns of a two-dimensional array.

The first parallel algorithm is termed the * transpose
*
algorithm, since it performs a series of one-dimensional transforms on
* P*
processors while the array is partitioned in one dimension,
then transposes the array and performs transforms in the second
dimension using the same processors. The transpose operation requires
that each processor send one message of size to each of the
* P-1*
other processors. Hence, total communication costs summed
over * P*
processors are

The second algorithm is termed the * pipeline
* algorithm, since it partitions processors into
two sets of size * P/2*
which perform FFTs on rows and columns,
respectively. Each processor in the first set must communicate with
each processor in the other set, for a total of messages. The
entire array is communicated. Hence, total communication costs are

Notice that communication costs are not necessarily distributed
equally among processors in the second algorithm, since the sending and
receiving processors form distinct groups. Nevertheless,
Equations 4.1 and 4.2 give a rough idea of the
relative performance of the two algorithms. The second algorithm
sends significantly fewer messages and hence should be more efficient
in situations in which message startup costs are dominant, for example,
when * N*
and/or are small or when
* P*
or are large. On the other hand, the first algorithm
probably incurs lower data transfer costs and hence may be superior in
other situations.

Having designed two alternative algorithms for the 2-D FFT, we now
consider the parallel convolution algorithm proper. Its four
components---two parallel 2-D FFTs, one matrix multiplication, and one
inverse 2-D FFT (Figure 4.4)---can be combined using
either sequential or parallel composition. If sequential composition
is used, the parallel convolution algorithm can be represented as
follows, with the ` fft` and ` fft` calls invoking the
transpose 2-D parallel FFT.

for each image

A = fft(A)

B = fft(B)

C = A.B

C = fft(C)

endfor

If the input to this algorithm is decomposed appropriately (in one dimension, by rows), then because each FFT involves a transpose, total communication requirements are three times the cost of a single transpose:

Notice that because the forward FFT operates first on rows and then on columns, the inverse FFT must operate first on columns and then on rows, so as to avoid the need for an additional transpose operation between the forward and inverse FFTs.

If parallel composition is used, the three FFTs execute concurrently, each on one third of the available processors. (Because the multiplication involves rather than operations, we regard it as insignificant and compose it sequentially with the inverse FFT.) Communication costing is required to move data from the processors handling the forward FFTs to the processors handling the inverse FFT.

The 2-D FFTs within the parallel composition can be implemented by
using either the transpose or pipeline algorithms, yielding two
algorithm variants. Costs are as specified by Equations 4.1
and 4.2, except that each algorithm is executed three times,
with * P/3*
rather than * P*
processors involved in each
execution. Combining these costs with the cost of the data movement
between components, we obtain the following models.

**Table 4.1:** Approximate total message counts and data volumes for
three parallel convolution algorithms, summed over * P*
processors
and assuming that * P*
is reasonably large.

**Figure 4.6:** * Performance of the sequential/transpose (Sequential)
and parallel/transpose (Parallel) convolution algorithms on an IBM
SP computer for different problem sizes and numbers of processors.
The latter algorithm is more efficient for smaller problems and larger
numbers of processors.*

The results of this brief analysis are summarized in Table 4.1. We see that the second and third algorithms perform fewer message startups but send more data. Hence, they can be expected to be more efficient on smaller problems and larger numbers of processors. This result can be confirmed experimentally, as illustrated in Figure 4.6. A flexible parallel program might incorporate all three algorithms and select the appropriate alternative at runtime. Of course, a programming tool that supports only sequential composition will allow only the first algorithm to be used.

The convolution problem illustrates the design tradeoffs that can arise when constructing even relatively simple parallel programs. These tradeoffs can arise at multiple levels. First, we must identify candidate parallel algorithms for the component modules: in this case, the 2-D FFT. Then, we must decide how to compose these building blocks so as to construct a complete parallel algorithm. Aspects of the complete design in turn influence the techniques used within components, requiring for example that we operate on columns before rows in the inverse FFT. The performance analysis must take into account all these design decisions.