Argonne National Laboratory
Deep Learning Software Stacks and Requirements for MPI
Abstract: In this talk I will discuss the emerging requirements for message passing from deep learning software environments. There are about a dozen deep learning frameworks in wide use (e.g. TensorFlow, CNTK, MXnet, Caffe, etc.) however only a small fraction are natively using MPI to support distributed memory parallelism. Deep learning offers opportunities for concurrency at multiple levels of granularity. At the highest level most serious applications have a need to search the hyperparameter space to find good model structures, good learning parameters and training settings. These search spaces can be quite large and are generally explored by hundreds or thousands of nearly completely independent jobs coordinated via a high-level supervisor at the job launch level. MPI is typically not used in this dynamic job oriented workflow. The next level of concurrency is “data parallelism” where many instances of the model are trained in parallel on a subset of the training data but with each member of the ensemble updated in parallel from gradient information gathered at a central “parameter” server. Data parallelism can take advantage of optimized reductions and reduced precision broadcasts that in the limit uses only 1 bit per gradient. Data parallelism requires large training batch sizes which limits performance opportunities when training datasets are limited in size. Finally fine grain parallel implementation of deep learning models arise when a single neural network model is partitioned across multiple processor domains, this is called “model parallelism”. The partitioning scheme can be block oriented or layer oriented or some combination. Since the amount of communication that is required for model parallelism can be extremely high, this is typically done by multi-threading in single coherency domain, however it can also be implemented via distributed memory provided enough bandwidth. In this case nearest neighbor bandwidth and latency tend to dominate the scalability. Most of the deep learning frameworks have been developed by groups outside of the HPC community and so it is relatively common for the support for data parallelism to use an IP RPC (e.g. gRPC) oriented protocol rather than a HPC performance oriented protocol such as MPI. Some implementations have also been developed on top of SPARK. There are some exceptions. The Microsoft Cognitive Toolkit (aka CNTK) uses MPI as its communication layer, as do some experimental implementations of Caffe, Theano, LBANN, and others. I’ll review the high-level issues with MPI as we understand them as well as some emerging opportunities and offer some modest proposals to the MPI community.Bio: Rick Stevens has been a professor at the University of Chicago since 1999, and an Associate Laboratory Director at Argonne National Laboratory since 2004. He is internationally known for work in high-performance computing, collaboration and visualization technology, and for building computational tools and web infrastructures to support large-scale genome and metagenome analysis for basic science and infectious disease research.
University of Delaware
Building the next Generation of MapReduce Programming Models over MPI to Fill the Gaps between Data Analytics and Supercomputers
Abstract: MapReduce (MR) has become has become the dominant programming model for analyzing large volumes of data. Most implementations of the MR model were initially designed for Cloud and commodity clusters. When moved to supercomputers these implementations often exhibit their inability to cope with features that are common in Cloud but missing in HPC (e.g., on-node persistent storage) or to leverage features that are common in HPC but missing in Cloud (e.g., fast interconnects).
In this talk I will discuss how the model can be redesigned to incorporate optimization techniques tackling some of the supercomputing challenges listed above. I will present Mimir, an MR implementation over MPI that bridges the gap between data analytics and supercomputing by enabling the in-memory analysis of significantly larger datasets on HPC platforms without sacrificing the ease-of-use MR provides. I will show how fundamental operations in DNA research and genome analytics such as K-mer counting can be executed on unprecedentedly large genomics datasets (up to 24 TB) and how they can achieve large performance gains on supercomputers when executed on top of Mimir.
Sandia National Laboratories
What Will Impact the Future Success of MPI?
Abstract: Over the last 25 years, MPI has become the dominant programming system for high-performance computing (HPC). MPI has enabled complex science and engineering applications on a variety of platforms, from laptops to extreme-scale systems. An important key to the success of MPI has been the continued evolution of the specification, which has expanded to include new functionality in response to changes in HPC architectures and applications. Perhaps more importantly, MPI implementations have been able to deliver scalability and performance for a broad range of hardware. However, applications and systems are becoming increasingly more complex, more so than any other time in the history of MPI. This talk will examine some current and future challenges that may impact the ability of MPI to continue to dominate the parallel computing landscape over the next 25 years.
Bio: Ron Brightwell currently leads the Scalable System Software Department at Sandia National Laboratories. After joining Sandia in 1995, he was a key contributor to the high-performance interconnect software and lightweight operating system for the world’s first terascale system, the Intel ASCI Red machine. He was also part of the team responsible for the high-performance interconnect and lightweight operating system for the Cray Red Storm machine, which was the prototype for Cray’s successful XT product line. The impact of his interconnect research is visible in technologies available today from Atos (Bull), Intel, and Mellanox. He has also contributed to the development of the MPI-2 and MPI-3 specifications. He has authored more than 100 peer-reviewed journal, conference, and workshop publications. He is an Associate Editor for the IEEE Transactions on Parallel and Distributed Systems, has served on the technical program and organizing committees for numerous high-performance and parallel computing conferences, and is a Senior Member of the IEEE and the ACM.