|Title||Enabling Parallel Simulation of Large-Scale HPC Network Systems |
|Publication Type||Journal Article |
|Year of Publication||2016 |
|Authors||Mubarak, M, Carothers, CD, Ross, RB, Carns, PH |
|Journal||IEEE Transactions on Parallel and Distributed Systems (TPDS) |
|Date Published||02/2016 |
|Other Numbers||ANL/MCS-P5580-0316 |
|Abstract||With the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations.