MPMM and New MM90 Performance Data, December 1995
The following MPMM performance data was collected December 1995
on the
NAS SP2.
All runs are with MPL (IBM native message passing). A discussion
of the results
follows the figure below.
Results
We compare the performance of the new MM90 Fortran-90 version
of the parallel MM5 against performance of the latest version of the
MPMM (Fortran 77) model on a single domain and a problem involving
doubly nested domains.
Single Domain
The single domain (no nest) runs were conducted at
100km resolution, non-hydrostatic, full-physics (radiation, Grell cumulus,
explicit moisture, mixed-phase ice physics, Blackadar PBL). Grid size
is 61 by 61 by 23 levels.
Each time step entails 716 million floating point operations
(determined using source model run on Cray with HPM).
Mflop ratings are determined by dividing this number by the
time, in seconds, for an average step. I/O time is not
included.
The MPMM code scales from 145 Mflop/sec on 4 processors to 1951
Mflop/sec on 64 processors (84 percent efficiency).
A 36 hour
simulation will require approximately 35 minutes on 4 processors and
approximately 2.5 minutes on 64 processors, excluding the cost of I/O and
initialization (another several minutes).
The MM90 code scales from 121 Mflop/sec on 4 processors to 1713 Mflop/sec
on 64 processors (88 percent efficiency).
A 36 hour
simulation will require approximately 45 minutes on 4 processors and
approximately 3 minutes on 64 processors, also excluding the cost of I/O and
initialization (approximately the same penalty).
Doubly nested problem
The three domain (doubly nested) runs were conducted at
100km/33km/10km resolutions, same physics, all grids sized
61 by 61 by 23.
Each coarse-domain time step entails 8.054 billion floating point operations.
Mflop ratings are determined by dividing this number by the
time, in seconds, for an average coarse domain step. I/O time is not
included.
The MPMM code scales from 126 Mflop/sec on 4 processors to 1568 Mflop/sec
on 64 processors (78 percent efficiency).
A 36 hour
simulation will require approximately 6 hours on 4 processors and
approximately 36 minutes on 64 processors.
The MM90 code scales from 139 Mflop/sec on 4 processors to 1048
Mflop/sec on 64 processors (47 percent efficiency). A 36 hour
simulation will require approximately 7 hours on 4 processors and
approximately 1 hour on 64 processors. However, the major hit
against scaling occurs between 36 and 64 processors. From 4 to
36 processors (828 Mflop/sec; 1.2 hours per 36 hour simulation),
the efficiency is better: about 66 percent.
Inferences
There is an overall performance penalty associated with the
MM90 code which is attributable in part to the newness of the
implementation. I expect that tuning effort such as what went
into improving the performance of the MPMM code will get much of
this back. Unlike MPMM, MM90 scales poorly beyond 36 processors on the nested
problem, a key concern.
Return to MPMM index page.