In this chapter, we have focused on the software engineering question of developing a parallel algorithm design from a given problem specification. We have assumed some familiarity with program design---from, for example, previous study in sequential programming. A classic article by Parnas and Clements [222] describes the benefits (and difficulties) of a rational design process. The chapter notes in Chapters 3 and 4 provide additional references to material on the related topics of performance analysis and modularity.

Relatively few authors have addressed the particular problems that arise when designing algorithms for realistic scalable parallel computers. Numerous books discuss parallel algorithm design in the context of the idealized PRAM model; see, for example, Akl [8], Gibbons and Reitter [119], and JáJá [157]. However, these techniques are for the most part not directly applicable to multicomputers. The books by Fox et al. [111,113] and Kumar et al. [181] provide relevant material. Carriero and Gelernter [48] give an introduction to parallel program design in the Linda language. They distinguish between agenda, result, and specialist parallelism, which can be thought of as domain decomposition and two different forms of functional parallelism, respectively. See also the references in Chapter 12.

The mapping problem has received considerable attention in the
computer science and scientific computing literature. Bokhari shows
that it is * NP
*-complete [39]. The recursive
coordinate bisection algorithm is due to Berger and
Bokhari [33], and the unbalanced recursive bisection
algorithm is due to Jones and Plassmann [161]. A related
algorithm is the recursive spectral bisection algorithm of Pothen,
Simon, and Liou [230], which uses connectivity information to
identify partitions of unstructured meshes with low communication
requirements, at the cost of an eigenvalue computation. This
algorithm has proven very effective, although more expensive than the
other recursive algorithms. Simon [258] and
Williams [293] compare different algorithms for partitioning
unstructured meshes, including coordinate bisection, graph bisection,
and spectral bisection. Barnard and Simon [28] describe a less
expensive multilevel version of spectral bisection. The mesh in
Figure 2.9 is generated using the finite octree technique of
Shephard and Georges [256]. Fox et al. [111] and Nicol
and Saltz [213] describe the use of cyclic
decompositions. Lin and Keller [190] describe a local
algorithm. The local algorithm described in the text is termed a *
receiver-initiated
* strategy. In an alternative *
sender-initiated
* strategy, workers with excess work send it to
other workers. Tantawi and Towsley [279] compare
sender-initiated and receiver-initiated strategies and show that the
former gives better performance if workers are often idle and that
the latter performs better when load is heavy. Other relevant papers
include those by Berman and Snyder [34];
Chowdhury [61]; Cybenko [68]; Hac [131];
Heath, Rosenberg, and Smith [141]; Kumar, Grama, and
Rao [180]; Lo [191]; and Sadayappan and
Ercal [250]. Dijkstra, Feijen, and Gasteren [81],
Rokusawa et al. [246], and Kumar et al. [179] describe
distributed termination-detection algorithms.

Real atmosphere models are of course more complex than the system considered in Section 2.6. Washington and Parkinson [292] provide a good introduction to the numerical methods and algorithms used in climate modeling on sequential computers. A workshop held at the European Center for Medium-range Weather Forecasting surveyed issues involved in executing weather and climate models on parallel computers [155]. The parallel version of the Community Climate Model is described by Drake et al. [86]. Michalakes [206] analyzes load imbalances in climate models and Foster and Toonen [109] describe load-balancing algorithms.

Banerjee [26] describes parallel algorithms for VLSI design. Wimer et al. [294] and Arvindam et al. [17] describe branch-and-bound search algorithms and domain-specific optimizations that can improve performance on floorplanning problems. Reinefeld and Schnecke [243] describe the algorithm variant in which workers redundantly expand several tree levels before selecting nodes for local expansion. Kumar et al. [179,181,239] provide a wealth of material on the design, implementation, and analysis of parallel search algorithms. Quinn [234] also examines branch-and-bound search and describes and analyzes the performance of four different load balancing strategies. For a general introduction to search algorithms, see Nilsson [214], Pearl [225], and Kumar and Kanal [164].

Hehre et al. [142] provide an introduction to * ab
initio
* quantum
chemistry. Feyereisen and Kendall [96] describe a replicated
data algorithm for the Fock matrix construction problem. Colvin et
al. [62] describe an algorithm based on domain decomposition
techniques. An algorithm that uses distributed data structures and a
centralized task scheduler is described by Harrison et
al. [108,135].

Here is a Web Tour providing access to additional information on parallel program design and software engineering.