Leighton [187] discusses basic properties of hypercubes and describes many parallel algorithms that use a hypercube communication structure. The recursive halving vector reduction algorithm considered in Section 11.2 is described by Fox et al. [111] and the hybrid algorithm by van de Geijn [289]. Vector broadcast algorithms are described by Bertsekas and Tsitsiklis [35] and Johnsson and Ho [160]. The parallel mergesort algorithm, often called bitonic mergesort, is due to Batcher [30]; Akl [8] provides a good description. Fox et al. [111] describe a multicomputer implementation of this algorithm and of the variant discussed in Exercise 4. Other algorithms that are conveniently formulated in terms of a hypercube communication structure include the fast Fourier transform [20,21,110,192,277], parallel prefix [238], and various computer vision [238] and linear algebra computations [159]. See also the book by Kumar et al. [179] and papers by Bertsekas et al. [36], McBryan and van der Velde [197], and Saad and Schultz [248,249].

The book by Kumar et al. [179] describes parallel versions of several sorting algorithms, including quicksort [153], one of the most commonly used sorting algorithms on sequential computers. Their parallel quicksort algorithm partitions processors according to the number of elements lesser than or greater than the pivot at each step. Fox et al. [111] describe an alternative parallel quicksort algorithm that uses regular processor partitions. They address load imbalance by performing some preliminary processing to identify pivots that approximately bisect the input sequence. Knuth [174] and Aho et al. [7] are good references for sequential sorting algorithms. In addition to standard algorithms such as mergesort and quicksort, Knuth describes various specialized algorithms designed to exploit certain properties of the data to be sorted. Kumar et al. [179] explain how many of these can be adapted for parallel computers. For example, if data are highly redundant (that is, they have many identical items), we can count items on each node, then do a global sum to obtain total counts. If the input data distribution is known, a parallel bucketsort can be used. Each processor knows the location of each bucket and sends its data to the appropriate location.

Here is a Web Tour providing access to additional information on modular programming, parallel program design, and templates.