The Channel Interface


Up: Architecture of MPICH Next: A Case Study Previous: The Abstract Device Interface

At the lowest level, what is really needed is just a way to transfer data, possibly in small amounts, from one process's address space to another's. Although many implementations are possible, the specification can be done with a small number of definitions. The channel interface, described in more detail in [28], consists of only five required functions. Three routines send and receive envelope (or control) information: MPID_SendControl,One can use MPID_SendControlBlock instead of or along with MPID_SendControl. It can be more efficient to use the blocking version for implementing blocking calls. MPID_RecvAnyControl, and MPID_ControlMsgAvail; two routines send and receive data: MPID_SendChannel and MPID_RecvFromChannel. Others, which might be available in specially optimized implementations, are defined and used when certain macros are defined that signal that they are available. These include various forms of blocking and nonblocking operations for both envelopes and data.

These operations are based on a simple capability to send data from one process to another process. No more functionality is required than what is provided by Unix in the select, read, and write operations. The ADI code uses these simple operations to provide the operations, such as MPID_Post_recv, that are used by the MPI implementation.

The issue of buffering is a difficult one. We could have defined an interface that assumed no buffering, requiring the ADI that calls this interface to perform the necessary buffer management and flow control. The rationale for not making this choice is that many of the systems used for implementing the interface defined here do maintain their own internal buffers and flow controls, and implementing another layer of buffer management would impose an unnecessary performance penalty.

The channel interface implements three different data exchange mechanisms.

Eager
In the eager protocol, data is sent to the destination immediately. If the destination is not expecting the data (e.g., no MPI_Recv has yet been issued for it), the receiver must allocate some space to store the data locally.

This choice often offers the highest performance, particularly when the underlying implementation provides suitable buffering and handshakes. However, it can cause problems when large amounts of data are sent before their matching receives are posted, causing memory to be exhausted on the receiving processors.

This is the default choice in MPICH.

Rendezvous
In the rendezvous protocol, data is sent to the destination only when requested (the control information describing the message is always sent). When a receive is posted that matches the message, the destination sends the source a request for the data. In addition, it provides a way for the sender to return the data.

This choice is the most robust but, depending on the underlying system software, may be less efficient than the eager protocol. Some legacy programs may fail when run using a rendezvous protocol if an algorithm is unsafely expressed in terms of MPI_Send. Such a program can be safely expressed in terms of MPI_Bsend, but at a possible cost in efficiency. That is, the user may desire the semantics of an eager protocol (messages are buffered on the receiver) with the performance of the rendezvous protocol (no copying) but since buffer space is exhaustible and MPI_Bsend may have to copy, the user may not always be satisfied.

MPICH can be configured to use this protocol by specifying -use_rndv during configuration.

Get
In the get protocol, data is read directly by the receiver. This choice requires a method to directly transfer data from one process's memory to another. A typical implementation might use memcpy.

This choice offers the highest performance but requires special hardware support such as shared memory or remote memory operations. In many ways, it functions like the rendezvous protocol, but uses a different set of routines to transfer the data.

To implement this protocol, special routines must be provided to prepare the address for remote access and to perform the transfer. The implementation of this protocol allows data to be transferred in several pieces, for example, allowing arbitrarily sized messages to be transferred using a limited amount of shared memory. The routine MPID_SetupGetAddress is called by the sender to determine the address to send to the destination. In shared-memory systems, this may simply be the address of the data (if all memory is visible to all processes) or the address in shared-memory where all (or some) of the data has been copied. In systems with special hardware for moving data between processors, it may be the appropriate handle or object.


MPICH includes multiple implementations of the channel interface (see Figure 8 ).


Figure 8: Lower layers of MPICH

Chameleon
Perhaps the most significant implementation is the Chameleon version, which was particularly important during the initial phase of MPICH implementation. By implementing the channel interface in terms of Chameleon [31] macros, we provide portability to a number of systems at one stroke, with no additional overhead, since Chameleon macros are resolved at compile time. Chameleon macros exist for most vendor message-passing systems, and also for p4, which in turn is portable to very many systems. A newer implementation of the channel interface is a direct TCP/IP interface, not involving p4.
Shared memory
A completely different implementation of the channel interface has been done (portably) for a shared-memory abstraction, in terms of a shared-memory malloc and locks. There are, in turn, multiple (macro) implementations of the shared-memory implementation of the channel interface. This is represented as the p2 box in Figure 8 .
Specialized
Some vendors (SGI and HP-Convex, at present) have implemented the channel interface directly, without going through the shared-memory portability layer. This approach takes advantage of particular memory models and operating system features that the shared-memory implementation of the channel interface does not assume are present.
SCI
A specialized implementation of the channel interface has been developed for an implementation of the Scalable Coherent Interface [40] from Dolphin Interconnect Solutions, which provides portability to a number of systems that use it [39].

Contrary to some descriptions of MPICH that have appeared elsewhere, MPICH has never relied on the p4 version of the channel interface for portability to massively parallel processors. From the beginning, the MPP (IBM SP, Intel Paragon, TMC CM-5) versions used the macros provided by Chameleon. We rely on the p4 implementation only for the workstation networks, and a p4-independent version for TCP/IP will be available soon.



Up: Architecture of MPICH Next: A Case Study Previous: The Abstract Device Interface