Initial Results:  Experiments on Watson BlueGene/L


Note:  These results are not yet for public distribution.  Until we can confirm and analyze all the data, these results are preliminary.

October 26-27, 2005
Investigators:  Pete Beckman, Susan Coghlan, Kamil Iskra, Kazutomo Yoshii

Goals

The interaction between operating and run-time system components on large scale MPPs remains largely a mystery.  While anecdotal evidence suggests that TLB misses, interrupts, and asynchronous events can dramatically impact performance, the research community lacks a clear understanding of such behavior at scale.  Are there levels of OS interaction that are acceptable?  How significant is the performance difference between barrier operations and global collective operations, such as collectives, in the presence of OS interference?  Are there thresholds that can be tolerated for some applications? Which?  These and other questions remain largely unstudied as we search to build ever-larger peta-scale MPPs.  However, the answer to these questions is critical to the designs and computational modules for future architectures. Understanding how operating systems interact with applications, and how interrupts, kernel scheduling, and I/O processing affect performance on large scale systems is key to petascale systems research.

Experiments: Benchmarking Operating Systems on BGW

The purpose of the investigation was to analyze and understand the performance of the BG/L operating system and collective operations at large scale,

We ran several tests:

Selfish:  Measures kernel interaction with an application.  By assuming that the computation wants all the available cycles (is selfish), the benchmark records all interactions or "detours" that distract the CPU from the application.  Detours can be caused by timer interrupts and other preemptive behavior from the Operating System.

Detour_barrier, Detour_allreduce, Detour_alltoall, and Detour_myallreduce: These benchmarks run as applications on the BG/L compute kernel, and simulate heavy OS interaction by artificially introducing detours via timer interrupts.  We varied both the length and frequency of the timer interrupts while benchmarking their effect on collective operations, which have the highest potential for disruption from asynchronous behavior. 

Das_Boot:  We varied the ramdisk of the I/O node kernel and then measured boot times to understand how OS size affects booting times

ZeptoOS:  Our Linux I/O node kernel and ramdisk, ZeptoOS, was also tested at scale.

Results

The most interesting results are for examining the MPI collectives in the presences of synchronous OS events.

Background:  The Nature of Detours

Before providing an overview of the BGW results, a quick introduction into the nature of "detours" is necessary.  For the purposes of discussing OS interaction, we use the term "detour" to mean any asynchronous event that temporarily distracts a selfish, CPU-loving application from its workload.  Interrupting a simulation to update the time of day clock, run a heart-beat process, or handle a remote login are all examples of detours, as we define them.

In contrast, a TLB miss is not a detour, since it is simply memory/page bookkeeping triggered by the user application, and is not strictly asynchronous behavior.  In other words, handling a TLB miss is really no different than handling cache misses, library calls, or floating point traps.  All such activities can be managed by either eliminating them (e.g. making pages so large that TLB misses cannot occur) or programming specifically to manage or avoid them.  While their operation may be (poorly) implemented for a particular operating system, it is core to the notion of a multi-processing, asynchronous operating system.  Asynchronous events, detours, on the other hand, are generally not manageable from user-space processes and nearly always the domain of the privileged operating system.

Furthermore, for a uniprocessor job running over a large period of time, asynchronous operating system detours pose no particular problem.  If the operating system steals 0.2% of the CPU in detours from the application, over a long period of time, the application will simply run slower by that negligible amount.  On the other hand, in parallel applications, the detours can be additive. 

Imagine for a moment a ring of p processors.  Each one simply waits for a message and then sends it off to its neighbor.  Let's further suppose that the likelihood that any individual processor suffers an OS detour immediately prior to sending the ring message to its neighbor is 0.2%.  If an OS detour is 10 microseconds and there are 100K processors, the average delay for a single trip around the ring would be 2 milliseconds.  However, if an OS detour was 100 milliseconds, the time slice often given to other jobs competing for the CPU, the delay would be 20 seconds!  Naturally, the statistical distribution of detours, their length, and their frequency can dramatically affect collective operations.  Our goal was to measure this effect on the BG/L MPI collectives most likely to be susceptible to such dramatic delays.

Barrier

The MPI_Barrier() operation on BlueGene is extremely fast, since it is directly assisted by hardware.  We measured barrier times ranging from about 1 microsecond to 1.3 microseconds without asynchronous detours (simulated OS interaction).  The performance for adding randomized detours was quite predictable, the cost of one detour plus the basic cost of a barrier.  The detours were not additive, and did not significantly increase as the machine size increased to 16K nodes.  For example, for small numbers of nodes (512) the raw barrier time went from about 1 microsecond to about 18.5 microseconds for tests with detours of 16 microseconds.  In similar behavior, on 16K nodes, the time went from 1.3 microseconds with no asynchronous detours to about 20 microseconds for a detour of 16 microseconds.  So, while processing an interrupt or being delayed by a special OS task does indeed slow down barriers, the effect is simple, not compounded by the size of the machine.  Therefore, there are no scalability issues with asynchronous OS behavior and barriers on extreme-scale machines with fast-barriers available via hardware.  Naturally, as described above, if the barrier had been poorly implemented with a linear algorithm, such as a ring, the effect would likely have been dramatic.

All_Reduce

BG/L support some types of reductions within the communication fabric.  For simple reductions, the speed ranged from 8.3 microseconds on 512 nodes to 11.6 microseconds on 16K CPUs.  Introducing detours for basic All_Reduce functionality behaved similarly to barriers, the reduction was simple, not dominated by the scale of the machine.  For detours of 16 microseconds, the reductions went from 43 microseconds for 512 nodes to 49 microseconds for 16K nodes.  Detours of 200 microseconds, an eternity for interrupt handling (a fast interrupt can be handled in 1.8 microseconds), only slowed the All_Reduce functions from 412 microseconds to 417 microseconds for between 512 and 16K nodes.  With delays nearly constant, no dramatic scaling issues are presented for All_Reduce functions where the hardware can assist.

myAll_Reduce

In this experiment, we used a user-defined reductions, forcing the reduction out of the communication fabric and into user code for the operation.  However, like All_Reduce, the results were impressive.  Times ranged from 38 microseconds to 73 microseconds for between 512 and 16K nodes.  This represents roughly a factor of 5 to 8 times slower than the hardware-assisted reductions.   Nevertheless,  the addition of asynchronous detours did not dramatically affect performance.  For detours of 16 microseconds, the user-defined reductions went from  71 microseconds to 121 microseconds.  While still almost a factor of 2 slower, the effect was not dramatic, nor was it significantly compounded by scale.

Conclusions

While we generated lots of data, and are still working to plot 3D graphs, the preliminary numbers were fairly dramatic.  We believe that while it is certainly possible to suffer problems if competing jobs steal time slices from selfish applications, thereby delaying collectives by 10s to possibly 100s of milliseconds, standard OS operations such as interrupt handling and timer updates, that generally take several microseconds, do not pose significant issues for extreme-scale machines.  Even with 16 microsecond detours (about 8 times longer than what is required for a fast interrupt handler) user-defined collective operations only ran 48 microseconds longer for 16K nodes.

However, since we also have data that varies with frequency, and other variables, we reserve final conclusions until all the data has been analyzed.