One of the goals of MPI was to define the semantics of the message passing operations so that no unnecessary data motion was required. The MPICH implementation has shown this goal to be achievable. On two different shared-memory systems, MPICH achieves a single copy directly from user-buffer to user-buffer. In both cases, the operating system had to be modified slightly to allow a process to directly access the address space of another process. On distributed memory systems, two vendors were able to achieve the same result by providing vendor-specific implementations of the ADI.
In actual use, some users have noticed some performance irregularities; these indicate areas where more work needs to be done in implementations. For example, the implementation of MPI_Bsend in MPICH always copies data into the user-provided buffer; for small messages, such copying is not always necessary (it may be possible to deliver the message without blocking). This can have a significant effect on latency-sensitive calculations. Different methods for handling short, intermediate, and long messages are also needed and are under development.
Another source of some performance difficulties is seemingly innocuous
requirements that affect the lowest levels of the implementation. For example,
the following is legal in MPI:
MPI_Isend( ..., &request );
MPI_Request_free( &request );
The user need not (must not, actually) use a wait or test on the
request. This functionality can be complex to implement when
well-separated software layers are used in the MPI implementation. In
particular, it requires that either the completion of the operation started by
the MPI_Isend change data maintained by the MPI implementation or that
the MPI implementation periodically check to see whether some request has
completed. The problem with this functionality is that it may not match well
with the services that are implementing the actual data transport, and can be
the source of unanticipated latency.
Despite these problems, the MPICH implementation does achieve its goal of high performance and portability. In particular, the use of a carefully layered design, where the layers can be implemented as macros (or removed entirely, as one vendor has done), was key in the success of MPICH.