Deploying a High-Performance Filesystem on BGL

Hardware

1024 dual-processor compute nodes (one rack's worth of BlueGene)
32 IO nodes with powerpc 440 processors responsible for forwarding system calls from the compute nodes
16 P4 Storage Nodes with 6 GB of ram and about 128 GB of RAID storage.
Several login nodes (login1 - login4). PowerPC 970 processors with a 64 bit kernel and "biarch" userspace (defaults to 32 bit executables, but can build 64 bit with compiler flags)

In these early stages we had some difficulty getting PVFS2 running on all 16 storage nodes. We have 12 nodes, providing in aggregate a 1.1 TB PVFS2 volume.

Hardware benchmarks

Each storage node has a RAID array for pvfs2 storage. To get some idea of the performance of the disk subsystem, here is a run from bonnie++.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
fs2             10G 38476  95 49354  14 23415   5 35808  72 63971   5 557.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  3113  99 +++++ +++ +++++ +++  3138  99 +++++ +++  9620 100
fs2,10G,38476,95,49354,14,23415,5,35808,72,63971,5,557.3,0,16,3113,99,+++++,+++,+++++,+++,3138,99,+++++,+++,9620,100

Software

The storage, login, and IO nodes are all running a CVS snapshot of PVFS2 from Feb. 16th. Our BlueLight version is DRV521_2004-050113. The login and storage nodes run SLES 9.

Items of note

The compute nodes do have a simple socket interface. They can open network sockets, but there is no support for poll(2) or select(2).
The IO nodes currently mount PVFS2, and the compute nodes access the file system through regular unix-like read(2) and write(2) routines. Those system calls get forwarded from the compute node to the IO node.
The 'ciod' process is responsible for servicing those forwarded system calls. In the version we have, however, it deals with forwarded system calls in ~100k messages. Thus, large IO requests get broken up into reads and writes of 95520 bytes. Apparently this behavior is configurable in a newer version of the ciod. No estimate (other than 'soon') of when our machine will have this new version.

Benchmarks

mpi-io-test

mpi-io-test is a simple MPI-IO contiguous access benchmark. Each process writes a large chunk of data to a non-overlapping, non-interleaved region of a file and then reads it back. It reports the aggregate IO performance of all processes involved in the job. We would expect this benchmark to give an upper bound on IO performance.

We ran mpi-io-test across the entire rack, varying the number of compute nodes as well as the amount of data each process wrote to the PVFS2 file. Read performance topped out at 600 MBytes/sec for 1024 nodes, each writing 32 MB chunks. Write performance was quite erratic. Note that the peak write bandwidth tops out at around 150 MBytes/sec.

coll_perf

In coll_perf , the program writes a 3 dimensional array to a file. All processes perform collective IO. Data pending.

mpi-tile-io

Not done yet

Last update:

Tue Mar 29 16:38:05 CST 2005