Performance of MPICH


Up: Portability and Performance Next: Performance Measurement Problems and Pitfalls Previous: Exploiting Networks of Workstations

The MPI specification was designed to allow high performance in the sense that semantic restrictions on optimization were avoided wherever user convenience would not be severely impacted. Furthermore, a number of features were added to enable users to take advantage of optimizations that some systems offered, without affecting portability to other systems that did not have such optimizations available. In MPICH we have tried to take advantage of those features in the Standard that allow for extra optimization, but we have not done so in every possible case.

Performance on one's own application is, of course, what counts most. Nonetheless, useful predictions of application performance can be made, based on the results of specially constructed benchmark programs. In this section, we first describe some of the difficulties that arise in benchmarking message-passing systems, then discuss the programs we have developed to address these difficulties and finally present results from running the benchmarks on a representative sample of the environments supported by MPICH.

The MPICH implementation includes two MPI programs, mpptest and goptest, that provide reliable tests of the performance of an MPI implementation. The program mpptest provides testing of both point-to-point and collective operations on a specified number of processors; the program goptest can be used to study the scalability of collective routines as a function of number of processors.



Up: Portability and Performance Next: Performance Measurement Problems and Pitfalls Previous: Exploiting Networks of Workstations


Performance Measurement Problems and Pitfalls


Up: Performance of MPICH Next: Benchmarks for Point-to-Point Operations Previous: Performance of MPICH

One common problem with simple performance measurement programs is that the results are different each time the program is run, even on the same system. A number of factors are responsible, ranging from assuming that the clock calls have no cost and infinite resolution to the effects of other jobs running on the same machine. A good performance test will give the same (to the clock's precision) answer each time. The mpptest and goptest programs distributed with MPICH compute the average time for a number of iterations of an operation (thus handling the cost and granularity of the clock) and then run the same test over several times and take the minimum of those times (thus reducing the effects of other jobs). The programs can also provide information about the mean and worst-case performance.

More subtle are issues of which test to run. The simplest ``ping-pong'' test, which sends the same data (using the same data buffer) between two processes, allows data to reside entirely in the memory cache. In many real applications, however, neither buffer will already be mapped into cache, and this situation can affect the performance of the operation. Similarly, data transfers that are not properly aligned on word boundaries can be more expensive than those that are. MPI also has noncontiguous datatypes; the performance of an implementation with these datatypes may be significantly slower than for contiguous data. Another parameter is the number of processors used, even if only two are communicating. Certain implementations will include a latency cost proportional to the number of processors. This gives the best performance on the two-processor ping-pong test at the cost of (possibly) lower performance on real applications. Mpptest and goptest include tests to measure these effects.



Up: Performance of MPICH Next: Benchmarks for Point-to-Point Operations Previous: Performance of MPICH


Benchmarks for Point-to-Point Operations


Up: Performance of MPICH Next: Performance of MPICH Compared with Native Vendor Systems Previous: Performance Measurement Problems and Pitfalls

In this section we present some of the simplest benchmarks for performance of MPICH on various platforms. The performance test programs mpptest and goptest can produce a wealth of information; the script basetest, provided with the MPICH implementation, can be used to get a more complete picture of the behavior of a particular system. Here, we present only the most basic data: short- and long-message performance.

For the short-message graphs, the only options used with mpptest are -auto and -size 0 1000 40. The option -auto tells mpptest to choose the sizes of the messages so as to reveal the exact message size where there is any sudden change in behavior (for example, at an internal packet-size boundary). The -size option selects messages with sizes from 0 to 1000 bytes in increments of 40 bytes. The short-message graphs give a good picture of the latency of message passing.

For the long-message graphs, a few more options are used. Some make the test runs more efficient. The size range of message is set with -size 1000 77000 4000, which selects messages of sizes between about 1K and 80K, sampled every 4000 bytes.

These tests provide a picture of the best achievable bandwidth performance. More realistic tests can be performed by using -cachesize (to force the use of different data areas), -overlap (for communication and computation overlap), -async (for nonblocking communications), and -vector (for noncontiguous communication). Using -givedy gives information on the range of performance, displaying both the mean and worst-case performance.



Up: Performance of MPICH Next: Performance of MPICH Compared with Native Vendor Systems Previous: Performance Measurement Problems and Pitfalls


Performance of MPICH Compared with Native Vendor Systems


Up: Performance of MPICH Next: Paragon Measurements Previous: Benchmarks for Point-to-Point Operations


Figure 1: MPICH vs. NX on the Paragon

One question that can be asked about MPI is how its performance compares with proprietary vendor systems. Fortunately, the mpptest program was designed to work with many message-passing systems and can be built to call a vendor's system directly. In Figure 1 , we compare MPI and Intel's NX message-passing. The MPICH implementation for the Intel Paragon, while implemented with a special ADI, still relies on message-passing services provided by NX. Despite this fact, the MPI performance is quite good and can probably be improved with the second-generation ADI, planned for a later release of MPICH. We use this as a representative example to demonstrate that the apparently elaborate structure shown in Figures 7 and 8 does not impose serious performance overheads beyond those of the underlying, vendor-specific message-passing layer.



Up: Performance of MPICH Next: Paragon Measurements Previous: Benchmarks for Point-to-Point Operations


Paragon Measurements


Up: Performance of MPICH Next: IBM SP2 measurements Previous: Performance of MPICH Compared with Native Vendor Systems

The Intel Paragon has a classic distributed-memory architecture with a (cut-through routed) 2-D mesh topology. Latency and bandwidth performance are shown in Figure 2 . The Paragon performance measurements shown in Figure 2 were taken while other users were on the system. This explains why the right side of Figure 2 is ``rougher'' than the curve in Figure 1 , although the peak bandwidth shown is similar.


Figure 2: Short and long messages on the Paragon



Up: Performance of MPICH Next: IBM SP2 measurements Previous: Performance of MPICH Compared with Native Vendor Systems


IBM SP2 measurements


Up: Performance of MPICH Next: SGI Power Challenge Measurements Previous: Paragon Measurements

The IBM SP2 at Argonne National Laboratory has Power-1 nodes (the same as in the IBM SP1) and the SP2 high-performance switch. Measurements on IBM SP2 with Power-2 nodes (thin or wide) will be different. The latencies shown in Figure 3 reflect the slower speed of the Power-1 nodes. Note the obvious packet boundaries in the short-message plot.


Figure 3: Short and long messages on the IBM SP2



Up: Performance of MPICH Next: SGI Power Challenge Measurements Previous: Paragon Measurements


SGI Power Challenge Measurements


Up: Performance of MPICH Next: Cray T3D Measurements Previous: IBM SP2 measurements

The SGI Power Challenge is a symmetric multiprocessor. The latency and bandwidth performance as shown in Figure 4 indicate the performance for the ch_shmem device, a generic shared-memory device supplied with the MPICH implementation.


Figure 4: Short and long messages on the SGI Power Challenge



Up: Performance of MPICH Next: Cray T3D Measurements Previous: IBM SP2 measurements


Cray T3D Measurements


Up: Performance of MPICH Next: Workstation Network Measurements Previous: SGI Power Challenge Measurements

The Cray T3D supports a shared memory interface (the shmem library). For MPICH, this library is used to support MPI message-passing semantics. The latency and bandwidth performance are shown in Figure 5 .


Figure 5: Short and long messages on the Cray T3D



Up: Performance of MPICH Next: Workstation Network Measurements Previous: SGI Power Challenge Measurements


Workstation Network Measurements


Up: Performance of MPICH Next: Architecture of MPICH Previous: Cray T3D Measurements

Workstation networks connected by simple Ethernet are common. The performance of MPICH for two Sun SPARCStations, on a shared Ethernet, are shown in Figure 6 .


Figure 6: Short and Long Messages on a workstation network



Up: Performance of MPICH Next: Architecture of MPICH Previous: Cray T3D Measurements