P. M. Dickens and R. Thakur, "On Implementing High-Performance Collective I/O," Preprint ANL/MCS-P852-0900, September 2000. [pdf]
In many parallel applications, the I/O requirements quite often present a significant obstacle in the way of achieving good performance. An important area of current research is the development of techniques by which these costs can be reduced. One such approach is collective I/O, where the processors cooperatively develop an I/O strategy that reduces the number, and increases the size, of I/O requests, thereby making a better use of the I/O subsystem. While many studies have been presented in the literature showing excellent results using collective I/O techniques, there has been little discussion of implementation techniques for collective I/O operations and the impact on performance of various implementation strategies. A closely related issue, which has not yet received sufficient attention, is whether the high cost of I/O can be further reduced by executing the collective I/O operation in the background, thus overlapping its execution with other computation occurring in the foreground. In this paper, we investigate both of these important issues. First, we explore the issues that arise in the implementation of a collective I/O library and show the impact on performance resulting from various implementation strategies. To make these results as general as possible, we investigate the performance of collective I/O implementations on four different parallel architectures: the IBM SP2, the Intel Paragon, the HP Exemplar, and the SGI Origin2000. We show that a naive implementation of collective I/O does not result in significant performance gains for any of the architectures, but that an optimized collective I/O implementation provides excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all computational platforms. Next, we explore the issues that arise in the impmlementation of thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve the performance of thread-based collective I/O, we develop an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this new technique can provide significant performance gains, offering up to 50% improvement when compared with implementations that do not attempt to overlap collective I/O and computation.