Initial Results: Experiments on Watson BlueGene/L
Note: These results are not yet
for public distribution. Until we can confirm and analyze all the
data, these results are preliminary.
October 26-27, 2005
Investigators: Pete Beckman, Susan Coghlan, Kamil Iskra, Kazutomo
Yoshii
Goals
The interaction between operating and run-time system components on
large scale MPPs remains largely a mystery. While anecdotal
evidence
suggests that TLB misses, interrupts, and asynchronous events can
dramatically impact performance, the research community lacks a clear
understanding of such behavior at scale. Are there levels of OS
interaction that are acceptable? How significant is the
performance
difference between barrier operations and global collective operations,
such as collectives, in the presence of OS interference? Are
there
thresholds that can be tolerated for some applications? Which?
These
and other questions remain largely unstudied as we search to build
ever-larger peta-scale MPPs. However, the answer to these
questions is
critical to the designs and computational modules for future
architectures. Understanding how operating systems interact with
applications, and how interrupts, kernel scheduling, and I/O processing
affect performance on large scale systems is key to petascale systems
research.
Experiments: Benchmarking Operating Systems on BGW
The purpose of the investigation was to analyze and understand the
performance of the BG/L operating system and collective operations at
large scale,
We ran several tests:
Selfish: Measures kernel
interaction with an application. By assuming that the computation
wants all the available cycles (is selfish), the benchmark records all
interactions or "detours" that distract the CPU from the
application. Detours can be caused by timer interrupts and other
preemptive behavior from the Operating System.
Detour_barrier, Detour_allreduce,
Detour_alltoall, and Detour_myallreduce: These benchmarks run as
applications on the BG/L compute kernel, and simulate heavy OS
interaction by artificially introducing detours via timer
interrupts. We varied both the length and frequency of the timer
interrupts while benchmarking their effect on collective operations,
which have the highest potential for disruption from asynchronous
behavior.
Das_Boot: We varied the
ramdisk of the I/O node kernel and then measured boot times to
understand how OS size affects booting times
ZeptoOS: Our Linux I/O
node kernel and ramdisk, ZeptoOS, was also tested at scale.
Results
The most interesting results are for examining the MPI collectives in
the presences of synchronous OS events.
Background: The Nature of Detours
Before providing an overview of the BGW results, a quick introduction
into the nature of "detours" is necessary. For the purposes of
discussing OS interaction, we use the term "detour" to mean any
asynchronous event that temporarily distracts a selfish, CPU-loving
application from its workload. Interrupting a simulation to
update the time of day clock, run a heart-beat process, or handle a
remote login are all examples of detours, as we define them.
In contrast, a TLB miss is not
a detour, since it is simply memory/page bookkeeping triggered by the
user
application, and is not strictly asynchronous behavior. In other
words, handling a TLB miss is really no different than handling cache
misses, library calls, or floating point traps. All such
activities can be managed by either eliminating them (e.g. making pages
so large that TLB misses cannot occur) or programming specifically to
manage or avoid them. While their operation may be (poorly)
implemented for a particular operating system, it is core to the notion
of a multi-processing, asynchronous operating system.
Asynchronous events, detours, on the other hand, are generally not
manageable from user-space processes and nearly always the domain of
the
privileged operating system.
Furthermore, for a uniprocessor job running over a large period of
time, asynchronous operating system detours pose no particular
problem. If the operating system steals 0.2% of the CPU in
detours from the application, over a long period of time, the
application will simply run slower by that negligible amount. On
the other hand, in parallel applications, the detours can be
additive.
Imagine for a moment a ring of p
processors. Each one simply waits for a message and then sends it
off to its neighbor. Let's further suppose that the likelihood
that any individual processor suffers an OS detour immediately prior to
sending the ring message to its neighbor is 0.2%. If an OS detour
is 10 microseconds and there are 100K processors, the average delay
for a single trip around the ring would be 2 milliseconds.
However, if an OS detour was 100
milliseconds, the time slice often given to other jobs competing for
the CPU, the delay would be 20 seconds! Naturally, the
statistical distribution of detours, their length, and their frequency
can dramatically affect collective operations. Our goal was to
measure this effect on the BG/L MPI collectives most likely to be
susceptible to such dramatic delays.
Barrier
The MPI_Barrier() operation on BlueGene is extremely fast, since it is
directly assisted by hardware. We measured barrier times ranging
from about 1 microsecond to 1.3 microseconds without asynchronous
detours (simulated OS interaction). The performance for adding
randomized detours was quite
predictable, the cost of one detour plus the basic cost of a
barrier. The detours were not additive, and did not significantly
increase as the machine size increased to 16K nodes. For example,
for small numbers of nodes (512) the raw barrier time went from about 1
microsecond to about 18.5 microseconds for tests with detours of 16
microseconds. In similar behavior, on 16K nodes, the time went
from 1.3 microseconds with no asynchronous detours to about 20
microseconds for a detour of 16 microseconds. So, while
processing
an interrupt or being delayed by a special OS task does indeed slow
down barriers, the effect is simple, not compounded by the size of the
machine. Therefore, there are no scalability issues with
asynchronous OS behavior and barriers on extreme-scale machines with
fast-barriers available via hardware. Naturally, as described
above, if the barrier had been poorly implemented with a linear
algorithm,
such as a ring, the effect would likely have been dramatic.
All_Reduce
BG/L support some types of reductions within the communication
fabric. For simple reductions, the speed ranged from 8.3
microseconds on 512 nodes to 11.6 microseconds on 16K CPUs.
Introducing detours for basic All_Reduce functionality behaved
similarly to barriers, the reduction was simple, not dominated by the
scale of the machine. For detours of 16 microseconds, the
reductions went from 43 microseconds for 512 nodes to 49 microseconds
for 16K nodes. Detours of 200 microseconds, an eternity for
interrupt handling (a fast interrupt can be handled in 1.8
microseconds), only slowed the All_Reduce functions from 412
microseconds to 417 microseconds for between 512 and 16K nodes.
With delays nearly constant, no dramatic scaling issues are presented
for All_Reduce functions where the hardware can assist.
myAll_Reduce
In this experiment, we used a user-defined reductions, forcing the
reduction out of the communication fabric and into user code for the
operation. However, like All_Reduce, the results were
impressive. Times ranged from 38 microseconds to 73 microseconds
for between 512 and 16K nodes. This represents roughly a factor
of 5 to 8 times slower than the hardware-assisted
reductions. Nevertheless, the addition of
asynchronous detours did not dramatically affect performance. For
detours of 16 microseconds, the user-defined reductions went from
71 microseconds to 121 microseconds. While still almost a factor
of 2 slower, the effect was not dramatic, nor was it significantly
compounded by scale.
Conclusions
While we generated lots of data, and are still working to plot 3D
graphs, the preliminary numbers were fairly dramatic. We believe
that while it is certainly possible to suffer problems if competing
jobs steal time slices from selfish applications, thereby delaying
collectives by 10s to possibly 100s of milliseconds, standard OS
operations such as interrupt handling and timer updates, that generally
take several microseconds, do not pose significant issues for
extreme-scale machines. Even with 16 microsecond detours (about 8
times longer than what is required for a fast interrupt handler)
user-defined collective operations only ran 48 microseconds longer for
16K nodes.
However, since we also have data that varies with frequency, and other
variables, we reserve final conclusions until all the data has been
analyzed.