FOAM Performance and load balancing

The performance measurements above were done with the default configuration of FOAM.  The default uses the R15 (40 lats x 48 lons, 18 levels) PCCM3-UW and 128x128x24 ocean model OM3.   FOAM was run in "co-located" mode which places the ocean model on the same processors as the atmosphere so the total is always a power-of-two.  Timings are done by dumping the system clock on node 0 every simulated day and taking the difference.  Timings are then converted from seconds/simulated day to simulated years/day.   Timings to do not include I/O which typically adds only 5% for normal output frequencies (monthly history, annual restart).

More information on the platforms:

IBM Power5:      bluevista at NCAR.   1.9GHz Power5
Linux Cluster Lightning at NCAR. 2.2 GHz Opteron with Myrinet. Used PGF77 v. 5.2 with -fastsse
Linux Cluster    Teragrid cluster at Argonne National Lab.  1.3GHz Itanium2, Myrinet, Intel Fortran Compilers
IBM Power4:      bluesky at NCAR.   1.3GHz Power4
Origin 3800:  chinook at NCAR.   500MHz MIPS R14K   Not done in dedicated mode
IBM Power3:    blackforest at NCAR.  375MHz Power3.
Linux Cluster "Jazz" at Argonne National Lab.   2.4GHz XEON; Myrinet.  Intel Fortran Compilers
Linux Cluster "Joxaren" at the University of Chicago.   1.5GHz AMD Athlon; Myrinet.  Portland Group Fortran

Load Balance Plots

In contrast to the co-located version mentioned above, its possible for the two basic components of FOAM (the ocean and atmosphere models) to each run on different subsets of the total processor space. The atmosphere model requires much more real time than the ocean model to simulate a given length of time. This difference can be attributed to the atmosphere model's more complex code (with radiation and convection parameterizations) and shorter time step (half hour for the atmosphere vs. six hours for the ocean.) 

This difference in complexity is reflected in how the processors are divided among the two models in this mode. One common arrangement uses 16 processors for the atmosphere model and only one for the ocean. Even though the atmosphere has most of the nodes, the ocean model still spends time waiting for the atmosphere to catch up.

The plots below give an idea of the relative workload on the processors. The different colors represent different tasks: 

  • Atmosphere: Green 
  • Ocean: Blue 
  • Coupler: Red 
  • Waiting: Violet 
  • Each row represents a different processor (only a few of the atmosphere processors are shown, but all of the ocean processors are shown). The numbers on the y-axis represent the rank in MPI_COMM_WORLD. 

    This shows how the tasks are decomposed onto separate processors, and also how the ocean has to wait after its steps (the violet area after each blue bar) for the atmosphere to send some more information. Notice that every other atmosphere segment (in green) is longer. This is because the radiation calculations are done every other step. The longer coupler segment (in red) is a result of the added time needed to perform the communication with the ocean model. 

    For 17 Processors: (Each tick along the bottom represents 2.5 seconds)

    For 34 Processors (32 atmosphere and 2 ocean): (Each tick along the bottom represents 1 second)

    These plots were created using UPSHOT