Although MPI is a complex and multifaceted system, we can solve a wide range of problems using just six of its functions! We introduce MPI by describing these six functions, which initiate and terminate a computation, identify processes, and send and receive messages:
MPI_INIT : Initiate an MPI computation.
MPI_FINALIZE : Terminate a computation.
MPI_COMM_SIZE : Determine number of processes.
MPI_COMM_RANK : Determine my process identifier.
MPI_SEND : Send a message.
MPI_RECV : Receive a message.
Function parameters are detailed in Figure 8.1. In this and subsequent figures, the labels IN, OUT, and INOUT indicate whether the function uses but does not modify the parameter ( IN), does not use but may update the parameter ( OUT), or both uses and updates the parameter ( INOUT).
Figure 8.1: Basic MPI. These six functions suffice to write a wide range of parallel programs. The arguments are characterized as having mode IN or OUT and as having type integer, choice, handle, or status. These terms are explained in the text.
All but the first two calls take a communicator handle as an argument. A communicator identifies the process group and context with respect to which the operation is to be performed. As explained later in this chapter, communicators provide a mechanism for identifying process subsets during development of modular programs and for ensuring that messages intended for different purposes are not confused. For now, it suffices to provide the default value MPI_COMM_WORLD, which identifies all processes involved in a computation. Other arguments have type integer, datatype handle, or status. These datatypes are explained in the following.
The functions MPI_INIT and MPI_FINALIZE are used to initiate and shut down an MPI computation, respectively. MPI_INIT must be called before any other MPI function and must be called exactly once per process. No further MPI functions can be called after MPI_FINALIZE.
The functions MPI_COMM_SIZE and MPI_COMM_RANK determine the number of processes in the current computation and the integer identifier assigned to the current process, respectively. (The processes in a process group are identified with unique, contiguous integers numbered from 0.) For example, consider the following program. This is not written in any particular language: we shall see in the next section how to call MPI routines from Fortran and C.
MPI_INIT() Initiate computation
MPI_COMM_SIZE(MPI_COMM_WORLD, count) Find # of processes
MPI_COMM_RANK(MPI_COMM_WORLD, myid) Find my id
print("I am", myid, "of", count) Print message
MPI_FINALIZE() Shut down
The MPI standard does not specify how a parallel computation is started. However, a typical mechanism could be a command line argument indicating the number of processes that are to be created: for example, myprog -n 4, where myprog is the name of the executable. Additional arguments might be used to specify processor names in a networked environment or executable names in an MPMD computation.
If the above program is executed by four processes, we will obtain something like the following output. The order in which the output appears is not defined; however, we assume here that the output from individual print statements is not interleaved.
I am 1 of 4 I am 3 of 4 I am 0 of 4 I am 2 of 4
Finally, we consider the functions MPI_SEND and MPI_RECV, which are used to send and receive messages, respectively. A call to MPI_SEND has the general form MPI_SEND(buf, count, datatype, dest, tag, comm)
and specifies that a message containing count elements of the specified datatype starting at address buf is to be sent to the process with identifier dest. As will be explained in greater detail subsequently, this message is associated with an envelope comprising the specified tag, the source process's identifier, and the specified communicator ( comm).
A call to MPI_RECV has the general form
MPI_RECV(buf, count, datatype, source, tag, comm, status)
and attempts to receive a message that has an envelope corresponding to the specified tag, source, and comm, blocking until such a message is available. When the message arrives, elements of the specified datatype are placed into the buffer at address buf. This buffer is guaranteed to be large enough to contain at least count elements. The status variable can be used subsequently to inquire about the size, tag, and source of the received message (Section 8.4).
Program 8.1 illustrates the use of the six basic calls. This is an implementation of the bridge construction algorithm developed in Example 1.1. The program is designed to be executed by two processes. The first process calls a procedure foundry and the second calls bridge, effectively creating two different tasks. The first process makes a series of MPI_SEND calls to communicate 100 integer messages to the second process, terminating the sequence by sending a negative number. The second process receives these messages using MPI_RECV.
Much of the discussion in this chapter will be language independent; that is, the functions described can be used in C, Fortran, or any other language for which an MPI library has been defined. Only when we present example programs will a particular language be used. In that case, programs will be presented using the syntax of either the Fortran or C language binding. Different language bindings have slightly different syntaxes that reflect a language's peculiarities. Sources of syntactic difference include the function names themselves, the mechanism used for return codes, the representation of the handles used to access specialized MPI data structures such as communicators, and the implementation of the status datatype returned by MPI_RECV. The use of handles hides the internal representation of MPI data structures.
In the C language binding, function names are as in the MPI definition but with only the MPI prefix and the first letter of the function name in upper case. Status values are returned as integer return codes. The return code for successful completion is MPI_SUCCESS; a set of error codes is also defined. Compile-time constants are all in upper case and are defined in the file mpi.h, which must be included in any program that makes MPI calls. Handles are represented by special defined types, defined in mpi.h. These will be introduced as needed in the following discussion. Function parameters with type IN are passed by value, while parameters with type OUT and INOUT are passed by reference (that is, as pointers). A status variable has type MPI_Status and is a structure with fields status.MPI_SOURCE and status.MPI_TAG containing source and tag information. Finally, an MPI datatype is defined for each C datatype: MPI_CHAR, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED, MPI_UNSIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, etc.
In the Fortran language binding, function names are in upper case. Function return codes are represented by an additional integer argument. The return code for successful completion is MPI_SUCCESS; a set of error codes is also defined. Compile-time constants are all in upper case and are defined in the file mpif.h, which must be included in any program that makes MPI calls. All handles have type INTEGER. A status variable is an array of integers of size MPI_STATUS_SIZE, with the constants MPI_SOURCE and MPI_TAG indexing the source and tag fields, respectively. Finally, an MPI datatype is defined for each Fortran datatype: MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_COMPLEX, MPI_LOGICAL, MPI_CHARACTER, etc.
The pairwise interactions algorithm of Section 1.4.2 illustrate the two language bindings. Recall that in this algorithm, T tasks ( T an odd number) are connected in a ring. Each task is responsible for computing interactions involving N data. Data are circulated around the ring in T-1 phases, with interactions computed at each phase. Programs 8.2 and 8.3 are C and Fortran versions of an MPI implementation, respectively.
The number of processes created is specified when the program is invoked. Each process is responsible for 100 objects, and each object is represented by three floating-point values, so the various work arrays have size 300. As each process executes the same program, the first few lines are used to determine the total number of processes involved in the computation ( np), the process's identifier ( myid), and the identify of the process's neighbors in the ring ( lnbr, rnbr). The computation then proceeds as described in Section 1.4.2 but with messages sent to numbered processes rather than on channels.
Before proceeding to more sophisticated aspects of MPI, we consider the important topic of determinism. Message-passing programming models are by default nondeterministic: the arrival order of messages sent from two processes, A and B, to a third process, C, is not defined. (However, MPI does guarantee that two messages sent from one process, A, to another process, B, will arrive in the order sent.) It is the programmer's responsibility to ensure that a computation is deterministic when (as is usually the case) this is required.
In the task/channel programming model, determinism is guaranteed by defining separate channels for different communications and by ensuring that each channel has a single writer and a single reader. Hence, a process C can distinguish messages received from A or B as they arrive on separate channels. MPI does not support channels directly, but it does provide similar mechanisms. In particular, it allows a receive operation to specify a source, tag, and/or context. (Recall that these data constitute a message's envelope.) We consider the first two of these mechanisms in this section.
The source specifier in the MPI_RECV function allows the programmer to specify that a message is to be received either from a single named process (specified by its integer process identifier) or from any process (specified by the special value MPI_ANY_SOURCE). The latter option allows a process to receive data from any source; this is sometimes useful. However, the former is preferable because it eliminates errors due to messages arriving in time-dependent order.
Message tags provide a further mechanism for distinguishing between different messages. A sending process must associate an integer tag with a message. This is achieved via the tag field in the MPI_SEND call. (This tag has always been set to 0 in the examples presented so far.) A receiving process can then specify that it wishes to receive messages either with a specified tag or with any tag ( MPI_ANY_TAG). Again, the former option is preferable because it reduces the possibility of error.
To illustrate the importance of source specifiers and tags, we examine a program that fails to use them and that, consequently, suffers from nondeterminism. Program 8.4 is part of an MPI implementation of the symmetric pairwise interaction algorithm of Section 1.4.2. Recall that in this algorithm, messages are communicated only half way around the ring (in T/2-1 steps, if the number of tasks T is odd), with interactions accumulated both in processes and in messages. As in Example 8.1, we assume 100 objects, so the arrays to be communicated in this phase have size 100.3.2=600. In a final step, each message (with size 100.3=300) is returned to its originating process. Hence, each process sends and receives N/2-1 data messages and one result message.
Program 8.4 specifies neither sources nor tags in its MPI_RECV calls. Consequently, a result message arriving before the final data message may be received as if it were a data message, thereby resulting in an incorrect computation. Determinism can be achieved by specifying either a source processor or a tag in the receive calls. It is good practice to use both mechanisms. In effect, each ``channel'' in the original design is then represented by a unique (source, destination, tag) triple.
© Copyright 1995 by Ian Foster