How not to measure communications performance
Here are some common errors in measuring communication performance, organized
as problems with the test design, with the choice of functions tested and with
the interpretation of the data.
Problems with Test Design
- Forget to establish initial communication link
- Some systems dynamically create connections
- Ignore contention with unrelated applications or jobs
- A background filesystem backup may consume much of the available communication bandwidth
- Ignore correctness
- Systems that fail for long messages may have an unfair advantage for
short messages.
- Time events that are small relative to the resolution of the clock
- Many timers are not cycle counters; timing a single event may lead to
wildly inaccurate times if the resolution of the clock is close to the time
the operation takes. A related error is to try and correct the clock overhead
by subtracting an estimate of the time to call the clock that is computed by
taking the average of the time it takes to call the clock; this will reduce the
apparent time and artifically inflate performance.
- Measure with just two processors
- Some systems may poll on the number of possible sources of messages; this
can lead to a significant degradation in performance for real configurations.
- Measure with a single communication pattern
- No system with a large number of processors provides a perfect
interconnect. The pattern you want may incure contention. One major system
suffers slowdowns when simple butterfly patterns are used.
Functions Tested
- Ignore non-blocking calls
- High-performance kernels often involve non-blocking operations both
for the possibility of communication overlap but more importantly, for the
advantage in allowing the system to schedule communications when many processes
are communicating
- Ignore overlap of computation and communication
- High-performance kernels often strive to do this both for the advantages
in data transfer and in latency hiding
- Ignore cache effects
- Does the data end up in the cache of the receiver? What if data doesn't
start in the cache of the sender? Does the tranfer of data perturb (e.g.,
invalidate) the cache?
Comparing Apples and Oranges
- Make an apples to oranges comparison by comparing message-passing to
a shared-memory copy
- Message-passing accomplishes two effects: the transfer of data and
a handshake (synchronization) to indicate that the data are available. Some
comparisons of message-passing with remote memory operations ignores the
synchronization step.
- Compare CPU time to Elapsed time
- CPU time may not include any time that was spent waiting for data to
arrive. Knowing the CPU load caused by a message passing system is useful
information, but only the elapsed time may be used to measure the time it takes
to deliver a message.
- Confuse total bandwidth with point-to-point bandwidth
- Don't compare dedicated, switched networks with shared network fabrics
- Use a communication pattern different from the application
- Ensuring that a receive is issued before the matching send can make a
significant difference in the performance. Multiple messages between
different processes can also affect performance. Measuring ping-pong messages
when the application sends head-to-head (as many scientific applications do)
can also be misleading.
Please send comments to William Gropp.