Sorting is a common and important problem in computing. Given a sequence of N data elements, we are required to generate an ordered sequence that contains the same elements. Here, we present a parallel version of the well-known mergesort algorithm. The algorithm assumes that the sequence to be sorted is distributed and so generates a distributed sorted sequence. For simplicity, we assume that N is an integer multiple of P , that the N data are distributed evenly among P tasks, and that is an integer power of two. Relaxing these assumptions does not change the essential character of the algorithm but would complicate the presentation.
Figure 11.4: Mergesort, used here to sort the sequence [6,2,9,5]. The two partition phases each split the input sequence; the two merge phases each combine two sorted subsequences generated in a previous phase.
The sequential mergesort algorithm is as follows; its execution is illustrated in Figure 11.4.
The merge operation employed in step (4) combines two sorted subsequences to produce a single sorted sequence. It repeatedly compares the heads of the two subsequences and outputs the lesser value until no elements remain. Mergesort requires time to sort N elements, which is the best that can be achieved (modulo constant factors) unless data are known to have special properties such as a known distribution or degeneracy.
We first describe two algorithms required in the implementation of parallel mergesort: compare-exchange and parallel merge.
A compare-exchange operation merges two sorted sequences of length M , contained in tasks A and B . Upon completion of the operation, both tasks have M data, and all elements in task A are less than or equal to all elements in task B . As illustrated in Figure 11.5, each task sends its data to the other task. Task A identifies the M lowest elements and discards the remainder; this process requires at least M/2 and at most M comparisons. Similarly, task B identifies the M highest elements.
Figure 11.5: The compare-exchange algorithm, with M=4 . (a) Tasks A and B exchange their sorted subsequences. (b) They perform a merge operation to identify the lowest and highest M elements, respectively. (c) Other elements are discarded, leaving a single sorted sequence partitioned over the two tasks.
Notice that a task may not need all M of its neighbor's data in order to identify the M lowest (or highest) values. On average, only M/2 values are required. Hence, it may be more efficient in some situations to require the consumer to request data explicitly. This approach results in more messages that contain a total of less than M data, and can at most halve the amount of data transferred.
Figure 11.6: The parallel merge operation, performed in hypercubes of dimension one, two, and three. In a hypercube of dimension d , each task performs d compare-exchange operations. Arrows point from the ``high'' to the ``low'' task in each exchange.
A parallel merge algorithm performs a merge operation on two sorted sequences of length , each distributed over tasks, to produce a single sorted sequence of length distributed over tasks. As illustrated in Figure 11.6, this is achieved by using the hypercube communication template. Each of the tasks engages in d+1 compare-exchange steps, one with each neighbor. In effect, each node executes Algorithm 11.1, applying the following operator at each step.
if ( myid AND > 0 ) then
state = compare_exchange_high(state,message)
state = compare_exchange_low(state,message)
In this code fragment, AND is a bitwise logical and operator, used to determine whether the task is ``high'' or ``low'' in a particular exchange; myid and i are as in Algorithm 11.1.
We next describe the parallel mergesort algorithm proper. Each task in the computation executes the following logic.
procedure parallel_mergesort(myid, d, data, newdata) begin data = sequential_mergesort(data) for dim = 1 to d data = parallel_merge(myid, dim, data) endfor newdata = data end
First, each task sorts its local sequence using sequential mergesort. Second, and again using the hypercube communication structure, each of the tasks executes the parallel merge algorithm d times, for subcubes of dimension 1.. d . The i th parallel merge takes two sequences, each distributed over tasks, and generates a sorted sequence distributed over tasks. After d such merges, we have a single sorted list distributed over tasks.
Parallel mergesort uses the hypercube communication template at multiple levels. We review these uses and develop a performance model. We assume N data distributed over tasks (that is, ), with N an integer multiple of P . Hence, the total number of compare-exchanges is
Because each compare-exchange requires one message containing N/P data, the per-processor communication cost is
The computation costs comprise the initial intraprocessor sort and the comparisons performed during the interprocessor communication phase. The former involves a total of comparisons, while the latter requires at most comparisons, thereby giving computation costs summed over P processors of
Because the algorithm is perfectly balanced, we can assume that idle time is negligible. Thus, we obtain the following model for parallel execution time:
© Copyright 1995 by Ian Foster