How not to measure communications performance

Here are some common errors in measuring communication performance, organized as problems with the test design, with the choice of functions tested and with the interpretation of the data.

Problems with Test Design

Forget to establish initial communication link: Some systems dynamically create connections
Ignore contention with unrelated applications or jobs: A background filesystem backup may consume much of the available communication bandwidth
Ignore correctness: Systems that fail for long messages may have an unfair advantage for short messages.
Time events that are small relative to the resolution of the clock: Many timers are not cycle counters; timing a single event may lead to wildly inaccurate times if the resolution of the clock is close to the time the operation takes. A related error is to try and correct the clock overhead by subtracting an estimate of the time to call the clock that is computed by taking the average of the time it takes to call the clock; this will reduce the apparent time and artifically inflate performance.
Measure with just two processors: Some systems may poll on the number of possible sources of messages; this can lead to a significant degradation in performance for real configurations.
Measure with a single communication pattern: No system with a large number of processors provides a perfect interconnect. The pattern you want may incure contention. One major system suffers slowdowns when simple butterfly patterns are used.

Functions Tested

Ignore non-blocking calls: High-performance kernels often involve non-blocking operations both for the possibility of communication overlap but more importantly, for the advantage in allowing the system to schedule communications when many processes are communicating
Ignore overlap of computation and communication: High-performance kernels often strive to do this both for the advantages in data transfer and in latency hiding
Ignore cache effects: Does the data end up in the cache of the receiver? What if data doesn't start in the cache of the sender? Does the tranfer of data perturb (e.g., invalidate) the cache?

Comparing Apples and Oranges

Make an apples to oranges comparison by comparing message-passing to a shared-memory copy: Message-passing accomplishes two effects: the transfer of data and a handshake (synchronization) to indicate that the data are available. Some comparisons of message-passing with remote memory operations ignores the synchronization step.
Compare CPU time to Elapsed time: CPU time may not include any time that was spent waiting for data to arrive. Knowing the CPU load caused by a message passing system is useful information, but only the elapsed time may be used to measure the time it takes to deliver a message.
Confuse total bandwidth with point-to-point bandwidth: Don't compare dedicated, switched networks with shared network fabrics
Use a communication pattern different from the application: Ensuring that a receive is issued before the matching send can make a significant difference in the performance. Multiple messages between different processes can also affect performance. Measuring ping-pong messages when the application sends head-to-head (as many scientific applications do) can also be misleading.

Please send comments to William Gropp.