BGW represented our first chance to deploy PVFS2 on a system with thousands of clients. We set out to measure how well PVFS2 would both scale and perform on such a large system. While we planned to conduct quite a few workloads, time constraints forced us to focus on two: mpi-io-test and mpi-md-test
The mpi-io-test benchmark mesures large, contiguous I/O. It represents a highly favorable workload for most parallel file systems. Each client writes a fixed amount of data. The more clients running the benchmark, the larger the generated file. As a result, this benchmark frequently uncovers saturation points in either the storage system or networking infrastrucutre. We were hoping to find the peak I/O rate the 33 PVFS2 servers could deliver.
The mpi-md-test benchmark is an aggressive test of several collective MPI-IO metadata operations. Because the operations are collective, some implementations can carry out optimizations which can greatly improve scalability. These optimizations are not yet in wide use, however, so we would expect to see poor scaling on BGW. The benchmark thus gives us an initial point of comparison for future improvements.
We were able to commandeer the BGW storage nodes for these experiments. 33 P4 storage nodes with 2.5 GB RAM and 650 GB of RAID-0.
To get some idea of the performance of the disk subsystem, here is a run from bonnie++.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
pvfs1 5G 37190 98 70944 35 31162 11 39441 83 98181 12 455.6 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 2570 98 +++++ +++ +++++ +++ 2703 98 +++++ +++ 6587 100
pvfs1,5G,37190,98,70944,35,31162,11,39441,83,98181,12,455.6,0,16,2570,98,+++++,+++,+++++,+++,2703,98,+++++,+++,6587,100
The storage, login, and IO nodes are all running PVFS2 1.3.0 and the BlueGene driver is "Driver 253"
mpi-io-test is a simple MPI-IO contiguous access benchmark. Each process writes a large chunk of data to a non-overlapping, non-interleaved region of a file and then reads it back. It reports the aggregate IO performance of all processes involved in the job. We would expect this benchmark to give an upper bound on IO performance. In these benchmarks each client (process) is reading and writing a contiguous 8 MB buffer.
When we ran mpi-io-test across 2 rows (8k nodes), we saw good scaling. Our peak read performance was 2.8 GBytes/sec and peak writes were 400 MBytes/sec. Writes should perform worse than reads: write operations reflect the performance of the disk subsystem, while read operations benefit from caching effects on the servers.
|
|
We had a chance to run mpi-io-test across 4 rows (16k nodes), but performance in that case was much worse -- note the y-axis. We were unable to conclusively determine a cause, though we think it might be related to network topology and placement of jobs.
|
|
The mpi-md-test benchmark performs several collective MPI-IO metadata operations (create a file, open an existing file, resize an existing file). The lengthy time to service each of these operations reflects both the fact that pvfs2 does not at this time provide any client side caching and that the MPI-IO implementation on BlueGene could implement these routines more efficiently. It should be noted that while several seconds to perform an IO operation is long, we don't expect most BlueGene applications to be terribly affected by this behavior.
Last update:
Wed Nov 2 17:23:53 CST 2005