IISWC 2014 tutorial

August 24th, 2014

Darshan – I/O Workload Characterization in MPI Applications

Half-day tutorial to be presented at IISWC 2014

Abstract

I/O performance is an increasingly important factor in the productivity of large-scale HPC systems. The workload diversity of such systems presents a challenge for I/O performance engineering, however. Applications vary in terms of data volume, I/O strategy, and access method, making it difficult to consistently evaluate and enhance their I/O performance.  In addition, I/O performance is highly sensitive to subtle changes in applications and system libraries.

In this tutorial we present an overview of common HPC I/O challenges and the range of tools that are available to help diagnose them.  The concepts will be illustrated using case studies as well as hands-on exercises. We focus particularly on Darshan, a scalable I/O characterization tool that provides an end-to-end system for understanding and interpreting the I/O behavior of high performance computing applications.  Darshan’s lightweight, transparent design enables it to be used in production with negligible impact on behavior. Attendees will learn how to use Darshan to find I/O hotspots in MPI programs as well as basic techniques to optimize their codes’ I/O behavior.

Overview

I/O performance is an increasingly important factor in the productivity of large-scale HPC systems. The workload diversity of such systems presents a challenge for I/O performance engineering, however. Applications vary in terms of data volume, I/O strategy, and access method, making it difficult to consistently evaluate and enhance their I/O performance.  In addition, I/O performance is highly sensitive to subtle changes in application and system libraries.

In this tutorial we present an overview of common HPC I/O challenges and the range of tools that are available to help diagnose them.  The concepts will be illustrated using case studies as well as hands-on exercises. We focus particularly on Darshan, a scalable I/O characterization tool that provides an end-to-end system for understanding and interpreting the I/O behavior of high performance computing applications.  Darshan’s lightweight, transparent design enables it to be used in production with negligible impact on behavior. Darshan has been deployed on multiple supercomputers in multiple compute facilities.

Target Audience and Prerequisites

This tutorial contains 70% introductory level material and 30% intermediate level material. An attendee will benefit the most from the tutorial if he/she has entry-level understanding of MPI programming and parallel I/O libraries. For the hands-on sessions to be successful, an attendee is expected to have basic knowledge of Linux environment and how to submit batch jobs.

Goals: What the Audience is Expected to Learn

The attendees are expected to learn the basics of parallel I/O and I/O performance characterization, as well as how to compile and run Darshan enabled MPI applications and interpret Darshan results. The attendees are also expected to learn about common I/O performance problems in large MPI jobs. The ultimate goal for this tutorial is to raise the awareness of I/O performance and help attendees to write high efficiency MPI programs.

Tutorial Outline (3 Hours In Total)

  • Basics of Parallel I/O (20′)
  • Basics of I/O Performance Characterization (20′)
  • Introduction to Darshan (30′)
  • Typical I/O Bloopers (20′)
  • Break / Account Setup (20′)
  • Hands-on Exercises (70′)

Details of Hands-on Sessions

Participants will have the opportunity to experiment with Darshan in a hands-on session. The hands-on exercise contents have been well tested in a production environment and we will customize it to reflect the interests and expertise of the audience.

Required Equipments

The attendees are required to have a laptop with a working web browser and SSH client. For the purpose of this tutorial, all exercises can be performed via an SSH terminal. We will require robust wireless (or wired) internet access provided by IISWC.

Backend Resources

Thanks to NERSC, this tutorial will use the Edison supercomputer (http://www.nersc.gov/users/computational-systems/edison/) for hands-on exercises. The presenters will hand out login credentials during the tutorial. Users are expected to run interactive batch jobs for exercises. A number of compute nodes will be reserved for the duration of the tutorial to minimize queue wait time. NERSC has provided backend resource for numerous tutorials in a variety of conferences. We expect the backend to be very reliable and scalable to hundreds of users.

Proposed Exercises

  • Basics Instructions (Logging in, job submission, etc)
  • Compile/Link MPI Program with Darshan
  • Running MPI Programs with Darshan
  • Interpreting and Visualizing Darshan Logs
  • Run samples of “bad” I/O applications and examine hot-spots

Presenter: Yushu Yao

Affiliation: Lawrence Berkeley National Laboratory

Email: yyao@lbl.gov

Bio: Yushu Yao is a high performance computing consultant at NERSC. Yushu received his PhD in experimental particle physics in 2008. In graduate school he built the luminosity monitor for the ATLAS experiment of the Large Hadron Collider at CERN switzerland, developing both software, hardware and simulation for the detector. After graduation he joined Lawrence Berkeley National Laboratory to build and support the data analytic framework for the ATLAS collaboration, enabling thousands of scientists to efficiently mine petabytes of data generated by the detector. The ATLAS detector discovered the Higgs (God) particle which lead to the Nobel Prize in Physics in 2013.

Yushu’s primary focus at NERSC is to develop and deploy technologies as production services that enable data intensive science. Some services include Hadoop, Spark, SciDB, etc. Yushu also deployed Darshan on Hopper and Edison systems at NERSC to automatically collect workload statistics.

Presenter: Phil Carns

Affiliation: Argonne National Laboratory

Email: carns@mcs.anl.gov

Bio: Philip Carns is a software development specialist in the Mathematics and Computer Science Division of Argonne National Laboratory. He received his Ph.D. in computer engineering from Clemson University in 2005, spent three years developing storage technology at Acxiom Corporation, one of the world’s largest business intelligence companies, and then joined Argonne where he has worked in a research and development role since 2008. Philip’s research interests include measurement and observation, simulation, and implementation of large scale storage systems. He has also acted as a primary developer on a variety of high performance storage projects, including the Darshan I/O characterization tool, the PVFS parallel file system, the BMI network abstraction layer, the CODES storage simulation environment, and the Triton distributed object storage system.

 

Comments are closed.