Tracing the behavior of parallel applications on extreme-scale systems

January 3, 2014

Event-tracing tools have proved vital for understanding how parallel applications behave. But new challenges make the use of event tracing on extreme-scale machines problematic. Tracing tools generate large amounts of data, which can overload the parallel file system and skew the application being studied. To remedy this problem, researchers from Argonne National Laboratory, the Technische Universität Dresden in Germany, and Oak Ridge National Laboratory have devised a new technique that enables event tracing on exascale systems.

Scientists who have been using existing performance analysis tools find that these frequently do not scale to large systems. And even if they do scale, they do not efficiently store the recorded data and thus overwhelm large-scale storage systems. Other tools may avoid these problems, but they may produce artificial synchronization in the application.

After reviewing a number of tracing tools, the research team selected a popular tool called Vampir/Trace as the most promising. “We observed that the I/O patterns generated by this tool could be optimized at an intermediate layer known as the I/O forwarding layer,” said Dries Kimpe, an assistant computer scientist in Argonne’s Mathematics and Computer Science Division. I/O forwarding allows a large number of write operations to be collected from multiple sources and reorganized before being written to the storage system. “By reducing the number of files generated, we significantly reduce the burden on the file system,” Kimpe said.

In addition to improving the write efficiency, the researchers devised several optimizations to address scalability bottlenecks in the VampirTrace toolset. For example, they devised a technique that enables higher throughput, and they introduced an option that reduces the overhead time involved in checking buffers.

To demonstrate the effectiveness of the approach at large scale, the team tested the newly enhanced Vampir tracing tools on Jaguar, a Cray XT5 leadership-class computer. Using a numerical simulation code that scales to the full application on Jaguar, the researchers successfully traced performance data on over 200,000 processes – increasing the maximum traced application size by a factor of 5.

The features developed in this project can also be used for other massively parallel applications, data-intensive stream-processing tools, and high-level I/O libraries that allow chunked data storage. Moreover, the features are sufficiently generic that they can be implemented in other production-quality I/O forwarding layers, such as in IBM’s Blue Gene systems. Thus, they offer the promise of substantially helping the high-performance computing community understand applications running on large-scale systems.

The research is presented in a paper published in April 2013 in Cluster Computing, titled “Optimizing I/O forwarding techniques for extreme-scale event tracing.”