We first provide a general introduction to data parallelism and data-parallel languages, focusing on concurrency, locality, and algorithm design.
Depending on the programming language used, the data ensembles operated on in a data-parallel program may be regular (e.g., an array) or irregular (e.g., a tree or sparse matrix). In F90 and HPF, the data structures operated on are arrays. In contrast, the data-parallel language pC++ allows programs to operate not only on arrays but also on trees, sets, and other more complex data structures.
Concurrency may be implicit or may be expressed by using explicit parallel constructs. For example, the F90 array assignment statement is an explicitly parallel construct; we write
A = B*C ! A, B, C are arraysto specify that each element of array A is to be assigned the product of the corresponding elements of arrays B and C. This statement also implies conformality ; that is, the three arrays have the same size and shape. In contrast, the following do-loop is implicitly parallel: a compiler may be able to detect that the various do-loop iterations are independent (meaning that one iteration does not write a variable read or written by another) and hence can be performed in parallel, but this detection requires some analysis.
do i = 1,m
do j = 1,n
A(i,j) = B(i,j)*C(i,j)
A data-parallel program is a sequence of explicitly and implicitly parallel statements. On a distributed-memory parallel computer, compilation typically translates these statements into an SPMD program (Section 1.3.2), in which each processor executes the same code on a subset of the data structures. In many cases, the compiler can construct this program by first partitioning data structures into disjoint subdomains, one per processor, and then applying the owner computes rule to determine which operations should be performed on each processor. This rule states that the computation required to produce a datum is performed on the processor on which that datum is located. However, compilers can also use different techniques.
Compilation also introduces communication operations when computation mapped to one processor requires data mapped to another processor. As an example, consider the following program.
real y, s, X(100) ! y, s scalars; X an array
X = X*y ! Multiply each X(i) by y
do i = 2,99
X(i) = (X(i-1) + X(i+1))/2 ! Communication required
s = SUM(X) ! Communication required
The communication requirements of this program depend on how the variables X, y, and s are distributed over processors. If X is distributed while y and s are replicated, then the first assignment can proceed without communication, with each X(i) being computed by the processor that owns X(i). The second assignment (in the do-loop) requires communication: the processor computing X(i) requires the values of X(i-1) and X(i+1), which may be located on different processors. The summation also requires communication.
Data placement is an essential part of a data-parallel algorithm, since the mapping of data to processors determines the locality of data references and hence, to a large extent, the performance of a parallel program. For example, the simple array assignment A = B*C either can proceed without any communication or can require communication for every assignment, depending on whether corresponding elements of the arrays A, B, and C are located on the same or different processors.
Identifying the best distribution of the various data structures operated on by a data-parallel program is a global optimization problem and not generally tractable. Hence, data-parallel languages often provide the programmer with the ability to define how data structures are to be distributed. In HPF, the DISTRIBUTE directive fulfills this function. The statements
!HPF$ PROCESSORS pr(16) real X(1024) !HPF$ DISTRIBUTE X(BLOCK) ONTO prindicate that the array X is to be distributed in a blocked fashion over 16 processors---processor 0 gets the first 1024/16 elements, processor 1 the second 1024/16 elements, and so on.
The data-parallel programming model is both higher level and more restrictive than the task/channel model introduced in Part I. It is higher level in that the programmer is not required to specify communication structures explicitly: these are derived by a compiler from the domain decomposition specified by the programmer. It is more restrictive because not all algorithms can be specified in data-parallel terms. For these reasons, data parallelism, although important, is not a universal parallel programming paradigm.
Despite these differences between the task/channel and data-parallel programming models, the program design techniques developed in Part I are still applicable. Recall that we have decomposed the design process into partitioning, communication, agglomeration, and mapping phases. Data-parallel languages address the first phase directly, by providing implicit and explicit constructs that can be used to specify a very fine grained partition of a computation, in which one task is created for each data element. As noted in Section 2.2, a key concern at this stage is to define a partition that identifies sufficient concurrency.
The PCAM design strategy of Chapter 2 requires that once we have developed a fine-grained partition, we determine communication requirements, agglomerate fine-grained tasks into larger tasks, and map tasks to processors. The beauty of the data-parallel approach is that the latter issues can be addressed at a particularly high level, by means of directives, rather than in terms of explicit communication and mapping operations. Directives indicate how arrays are to be aligned and distributed over processors and hence specify agglomeration and mapping. Communication operations are not specified explicitly by the programmer but are instead inferred by the compiler from the program.
The translation from fine-grained source program and directives to executable (typically SPMD) program is an automatic process that is performed by the data-parallel compiler. Nevertheless, the programmer must understand the essential characteristics of this translation in order to write efficient code and identify inefficient constructs. For example, an inappropriate choice of directives may lead to load imbalances or unnecessary communication. Alternatively, a data-parallel compiler may fail to recognize opportunities for concurrent execution. Generally, a data-parallel language compiler can be expected to generate reasonably efficient code when a program's communication structure is regular and local. Programs involving irregular and global communication patterns are less likely to be compiled efficiently. These and other performance issues are addressed in Section 7.7.
Finally, we note that the modular design techniques introduced in Chapter 4 apply to data-parallel programs. The issues involved are straightforward. Because data-parallel programs have sequential semantics, we need be concerned only with sequential composition of program components (Section 4.2.2). The primary concern in a parallel environment is the choice of data distributions in components that are to be composed. This issue is discussed in Section 7.5.
In the remainder of this chapter, we first briefly introduce F90 and then describe HPF. Much of this material also applies to other data-parallel languages. The chapter notes provide pointers to relevant documentation.
F90 is a data-parallel programming language in its own right. Its array assignment statement and array intrinsic functions can be used to specify certain classes of data-parallel computations. Our main interest in F90, however, is that it forms the basis for HPF, which augments F90 with data distribution directives, a FORALL statement, the INDEPENDENT directive, and new intrinsics. Array assignment, the FORALL statement, and the INDEPENDENT directive are used to identify concurrency in an algorithm, while data distribution directives specify how data should be placed on physical processors and hence provide control over locality.
Although HPF defines only a small set of extensions to F90, it is nevertheless a complex language. The extensions have numerous variants and can interact with F90 constructs and with each other in a wide variety of ways. In the interests of succinctness and clarity, our presentation does not aim for completeness but rather seeks to present the essential ideas required to understand data-parallel programming and the HPF constructs required to write simple programs. These constructs are taken from the official HPF subset, which should in principle be supported by all HPF compilers.
© Copyright 1995 by Ian Foster