The performance data shown are for a single domain (no nest) problem, 100km resolution, non-hydrostatic, full-physics (radiation, Grell cumulus, explicit moisture, mixed-phase ice physics, Blackadar PBL). Grid size is 61 by 61 by 23 levels.
The extreemly low latency and high-bandwidth of the T3D interprocessor communication hardware and software no doubt plays a role in its greater efficiency relative to the SP2. Of greater impact, however, is the T3D's dramatically lower per-processor performance relative to the SP2. Thus, although the T3D looks attractive in terms of parallel efficiency and speedup, the SP2 is 3.5 times faster. This is shown in the next section.
When timings are adjusted to discount inefficiency from parallelism, per-node performance of the SP2 is 30.6 Mflop/sec. Each T3D node generates 7.3 Mflop/sec. I suspect that the unusually small size of the primary cache on each T3D node is a factor. One notes that the model is running with 32-bit precision on the SP2 while the T3D code is executing at 64-bit precision. However 64-bit precision is not strictly necessary for this code, and is used on the T3D only because that is all that is available. Also, 64-bit per-processor computational rates are generally slightly higher, not lower, than 32-bit rates on RS/6000 platforms since the processor computes internally at the higher precision.