Scalable Spectral Element Methods

LEFT: Early-time pressure distribution for simulation of coolant flow in a 217-pin wire-wrapped subassembly, computed on 32768 processors of the Argonne Leadership Computing Facility's BG/P using Nek5000. The Reynolds number is Re~10500, based on hydraulic diameter. The mesh consists of 2.95 million spectral elements of order N=7 ( ~988 million gridpoints). Click here for a larger movie [34 MB]. RIGHT: Strong scaling results for this problem on BG/P. Vertical axis is the CPU time (s) for the first 50 timesteps. Horizontal axis is the number of processors. 80% parallel efficiency is realized for P=131072, which corresponds to only 7500 points/processor.

The simulation pictured above is a watershed computation as it is our first to exceed one million elements (2.95 M used) and our first to use one billion gridpoints (0.988 B used). The domain is a model of a 217-pin subassembly with wire-wrapped pins. The pins partition the hexagonal cannister into 438 (communicating) subchannels, each with a length Lz/Dh ~ 75, where Dh is the hydrualic diameter of the subchannel. Thus, this simulation is equivalent to LES of channel flow in a channel of length Lz ~ 90000h, where h is the channel half-height, save that this geometry is complex. Wires spiraling around each pin serve not only to separate the pins but also to induce inter-channel mixing, thus mitigating local hot spots. The domain is bounded on six sides by canister walls and is periodic in the axial direction with length corresponding to a single period of the wire wrap.

Computational time on the IBM BG/P at the ALCF was provided through the DOE Office of Science INCITE program. The Nek5000 development and simulation effort is supported by the DOE's Applied Mathematics Research and AFCI Advanced Simulation and Modeling programs.

Realizing this degree of scalability required the following developments:

A scalable spectral element multigrid solver for the pressure.
A 4th-generation coarse-grid solver (based on algebraic multigrid).
Elimination of virtually all arrays scaling with global element count (only two remain), which resulted in a memory-footprint reduction from 1 GB/proc to 90 MB/proc for P=65536. This savings was enabled through James Lottes's development of a custom, efficient, general-purpose all-to-all utility based on the Crystal Router algorithm in the text of Fox et al. (1988).
Development of communication algorithms that can rapidly discover and instantiate the required communication topology (again, using the CR algorithm).
Development of a scalable partitioner. (Standard partitioners were unable to meet our requirements.) Our partitioner employs recursive spectral bisection on the element graph, with element-to-element connectivity measured by the number of vertices shared between adjacent elements. This graph is consequently a "27-point" stencil, rather than 7-point, which eliminates a large fraction of the disconnected subsets that are problematic for mesh partitioning algorithms.
Parallel I/O -- We open and write a limited number of files in parallel (typ. 16-64). These files are written to distinct subdirectories so that the number of files in the working directory is bounded, while the number of files in each subdirectory scales only as the desired number of outputs, independent of P or Q (the number of "writing" processors). The processor space is divided into Q subsets, and each processor q' in a given subset sends to the lowest numbered processor in that set, which is designated as the I/O processor.
Parallel visualization. We use the LLNL-based VisIt software on ALCF's visualization platform, Eureka, which has direct access to files generated on the BG/P. VisIt has readily accommodated the billion-cell meshes resulting from these runs.
The pictures below show some close-ups of the mesh and a plot of the velocity magnitude near one of the bounding walls.