MPMM and New MM90 Performance Data, December 1995

The following MPMM performance data was collected December 1995 on the NAS SP2. All runs are with MPL (IBM native message passing). A discussion of the results follows the figure below.

Results

We compare the performance of the new MM90 Fortran-90 version of the parallel MM5 against performance of the latest version of the MPMM (Fortran 77) model on a single domain and a problem involving doubly nested domains.

Single Domain

The single domain (no nest) runs were conducted at 100km resolution, non-hydrostatic, full-physics (radiation, Grell cumulus, explicit moisture, mixed-phase ice physics, Blackadar PBL). Grid size is 61 by 61 by 23 levels. Each time step entails 716 million floating point operations (determined using source model run on Cray with HPM). Mflop ratings are determined by dividing this number by the time, in seconds, for an average step. I/O time is not included.

The MPMM code scales from 145 Mflop/sec on 4 processors to 1951 Mflop/sec on 64 processors (84 percent efficiency). A 36 hour simulation will require approximately 35 minutes on 4 processors and approximately 2.5 minutes on 64 processors, excluding the cost of I/O and initialization (another several minutes).

The MM90 code scales from 121 Mflop/sec on 4 processors to 1713 Mflop/sec on 64 processors (88 percent efficiency). A 36 hour simulation will require approximately 45 minutes on 4 processors and approximately 3 minutes on 64 processors, also excluding the cost of I/O and initialization (approximately the same penalty).

Doubly nested problem

The three domain (doubly nested) runs were conducted at 100km/33km/10km resolutions, same physics, all grids sized 61 by 61 by 23. Each coarse-domain time step entails 8.054 billion floating point operations. Mflop ratings are determined by dividing this number by the time, in seconds, for an average coarse domain step. I/O time is not included. The MPMM code scales from 126 Mflop/sec on 4 processors to 1568 Mflop/sec on 64 processors (78 percent efficiency). A 36 hour simulation will require approximately 6 hours on 4 processors and approximately 36 minutes on 64 processors.

The MM90 code scales from 139 Mflop/sec on 4 processors to 1048 Mflop/sec on 64 processors (47 percent efficiency). A 36 hour simulation will require approximately 7 hours on 4 processors and approximately 1 hour on 64 processors. However, the major hit against scaling occurs between 36 and 64 processors. From 4 to 36 processors (828 Mflop/sec; 1.2 hours per 36 hour simulation), the efficiency is better: about 66 percent.

Inferences

There is an overall performance penalty associated with the MM90 code which is attributable in part to the newness of the implementation. I expect that tuning effort such as what went into improving the performance of the MPMM code will get much of this back. Unlike MPMM, MM90 scales poorly beyond 36 processors on the nested problem, a key concern.

Return to MPMM index page.