W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Latency, Bandwidth, and Concurrent Issue Limitations in High-Performance CFD," Preprint ANL/MCS-P850-1000, October 2000. [pdf]
To achieve high performance, a parallel algorithm needs to effectively
utilize the memory subsystem and minimize the communication volume
and the number of network transactions. These issues gain further importance on modern architectures, where the peak CPU performance is
increasing much more rapidly than the memory or network performance.
In this paper, we present some performance enhancing techniques that
were employed on an unstructured mesh implicit solver. Our experimental results show that this solver adapts resonably well to the high memory
and network latencies.