R. Thakur and W. Gropp, "Improving the Performance of MPI Collective Communication on Switched Networks," Preprint ANL/MCS-P1007-1102, November 2002. [pdf]
In this paper, we present new algorithms for improving the performance of collective communication operations in MPI. Our target architecture isa cluster of machines connected by a switched network such as Myrinet or switched ethernet. We have developed new algorithms for all the MPI collective communication operations, namely, scatter/gather/reduce, allgather/allreduce, broadcast, reduce-scatters, all-to-all, and scan. We compare the performance of our new algorithms with the algorithms currently used in the latest version of MPICH on up to 256 nodes of a Myrinet-connected cluster. For operations such as scatter/gather/reduce, allgather/allreduce, and reduce-scatter, we observe an improvement of up to a factor of 10 for short messages sizes. For operations such as broadcast and reduce-scatter and for long messages sizes, the new algorithms are truly scalable: the time taken remains fairly constant as we increase the number of processes participating in the operation.