Deploying a High-Performance Filesystem on BGW

Our goals

BGW represented our first chance to deploy PVFS2 on a system with thousands of clients. We set out to measure how well PVFS2 would both scale and perform on such a large system. While we planned to conduct quite a few workloads, time constraints forced us to focus on two: mpi-io-test and mpi-md-test

The mpi-io-test benchmark mesures large, contiguous I/O. It represents a highly favorable workload for most parallel file systems. Each client writes a fixed amount of data. The more clients running the benchmark, the larger the generated file. As a result, this benchmark frequently uncovers saturation points in either the storage system or networking infrastrucutre. We were hoping to find the peak I/O rate the 33 PVFS2 servers could deliver.

The mpi-md-test benchmark is an aggressive test of several collective MPI-IO metadata operations. Because the operations are collective, some implementations can carry out optimizations which can greatly improve scalability. These optimizations are not yet in wide use, however, so we would expect to see poor scaling on BGW. The benchmark thus gives us an initial point of comparison for future improvements.

Hardware

We were able to commandeer the BGW storage nodes for these experiments. 33 P4 storage nodes with 2.5 GB RAM and 650 GB of RAID-0.

To get some idea of the performance of the disk subsystem, here is a run from bonnie++.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
pvfs1            5G 37190  98 70944  35 31162  11 39441  83 98181  12 455.6   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  2570  98 +++++ +++ +++++ +++  2703  98 +++++ +++  6587 100
pvfs1,5G,37190,98,70944,35,31162,11,39441,83,98181,12,455.6,0,16,2570,98,+++++,+++,+++++,+++,2703,98,+++++,+++,6587,100

Software

The storage, login, and IO nodes are all running PVFS2 1.3.0 and the BlueGene driver is "Driver 253"

Items of note

Benchmarks

mpi-io-test

mpi-io-test is a simple MPI-IO contiguous access benchmark. Each process writes a large chunk of data to a non-overlapping, non-interleaved region of a file and then reads it back. It reports the aggregate IO performance of all processes involved in the job. We would expect this benchmark to give an upper bound on IO performance. In these benchmarks each client (process) is reading and writing a contiguous 8 MB buffer.

When we ran mpi-io-test across 2 rows (8k nodes), we saw good scaling. Our peak read performance was 2.8 GBytes/sec and peak writes were 400 MBytes/sec. Writes should perform worse than reads: write operations reflect the performance of the disk subsystem, while read operations benefit from caching effects on the servers.

[PVFS2 read performance on 2 rows of BGW] [PVFS2 write performance on 2 rows of BGW]

We had a chance to run mpi-io-test across 4 rows (16k nodes), but performance in that case was much worse -- note the y-axis. We were unable to conclusively determine a cause, though we think it might be related to network topology and placement of jobs.

[PVFS2 read performance on 4 rows of BGW] [PVFS2 write performance on 4 rows of BGW]

mpi-md-test

The mpi-md-test benchmark performs several collective MPI-IO metadata operations (create a file, open an existing file, resize an existing file). The lengthy time to service each of these operations reflects both the fact that pvfs2 does not at this time provide any client side caching and that the MPI-IO implementation on BlueGene could implement these routines more efficiently. It should be noted that while several seconds to perform an IO operation is long, we don't expect most BlueGene applications to be terribly affected by this behavior.

[PVFS2 mpi-md-test performance; 2 rows of BGW]


Last update:

Wed Nov  2 17:23:53 CST 2005

Valid HTML 4.01!