P2S2 2020 Live Session
Workshop Tentative Program (All times are EDT/GMT-4)
Argobots: A Lightweight Threading Framework for Massive Fine-Grained Parallelism
Over a decade, exponential performance improvement of processors has been achieved by increasing cores. To achieve high scalability on modern processors, applications must be decomposed and parallelized in a finer-grained manner so that the computation is sufficiently fed into each core. However, as the number of cores increases, parallelization overheads in the runtime system are becoming significant and hinder scalability. Such overheads are often incurred by the heavyweight nature of threads provided by operating systems, while the major widely-used runtime systems still use OS-level threads as parallel units. To address this issue, we are developing a low-level lightweight threading library called Argobots, which is hundreds of times faster than conventional threads. Argobots exposes fine-grained control over threading features, scheduling, and synchronization, which promotes better interoperability with other programming models such as MPI and OpenMP and trims down the threading overheads. Our evaluation demonstrates that Argobots succeeds in enhancing performance by exploiting inherent parallelism in several applications.Bio:
Shintaro Iwasaki is a Postdoctoral Appointee at Argonne National Laboratory working with the Programming Models and Runtime Systems group led by Dr. Pavan Balaji. His research interest includes parallel languages, compilers, runtime systems, and scheduling techniques for high-performance computing. He received his MS and Ph.D. degrees from the University of Tokyo. He is the recipient of the Best Paper Award at PACT 2019 and the Best Poster Award at HPDC 2016. For more information, please see https://www.mcs.anl.gov/~iwasaki/
Preparing Software Stack for the Next Generation Systems - An opportunity or a nightmare?
Architectures evolve as we speak. Does the software evolve too rapidly as the hardware? We know the answer is "not quite". This talk will cover 1-2 case studies of scientific applications targeting today's and aiming to target tomorrow's platforms. The talk will also cover a project that tests software stack esp. on-node programming models to get them ready for upcoming systems. When a hardware platform changes drastically, the toolset including profilers change and the scientific application needs to be combed all over again with a new set of toolsets - this process can be time consuming and error prone. Discussions will include some preliminary results, struggles leading to potential solutions and opportunities and stories learned from code migration nightmares.Bio:
Sunita Chandrasekaran is an Assistant Professor with the Department of Computer and Information Sciences at the University of Delaware, USA. She received her Ph.D. in 2012 on Tools and Algorithms for High-Level Algorithm Mapping to FPGAs from the School of Computer Science and Engineering, Nanyang Technological University, Singapore. Her research spans High Performance Computing, parallel programming, benchmarking and data science. Applications of interest include scientific domains such as plasma physics, biophysics, solar physics and bioinformatics. She is a recipient of the 2016 IEEE-CS TCHPC Award for Excellence for Early Career Researchers in High Performance Computing. Sunita has been involved in the program, steering and organization committees of several conferences and workshops including SC, ISC, IPDPS, IEEE Cluster, CCGrid, WACCPD, AsHES and P3MA.
"Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing.", Michael Trotter, Timothy Wood and H. Howie Huang. [slides_full] [slides_pitch] [video]
"Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation.", Alexander Matz, Johannes Doerfert and Holger Fröning. [slides_full] [slides_pitch] [video]
"BSRNG: A High Throughput Parallel BitSliced Approach for Random Number Generators.", Saleh Khalaj Monfared, Omid Hajihassani, Mohammad Sina Kiarostami, Soroush Meghdadi Zanjani, Dara Rahmati and Saeid Gorgin. [slides_full] [slides_pitch] [video]
NVSHMEM: A Partitioned Global Address Space Library for NVIDIA GPU Clusters
GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. Addressing the apparent Amdahl's fraction of synchronizing with the CPU for communication is critical for strong scaling of applications on GPU clusters. GPUs are designed to maximize throughput and have enough state and parallelism to hide long latencies to global memory. It's important to take advantage of these inherent capabilities of the GPU and the CUDA programming model when tackling communications between GPUs. NVSHMEM provides a Partitioned Global Address Space (PGAS) that spans memory across GPUs and provides an API for fine-grained GPU-GPU data movement and synchronization from within a CUDA kernel. NVSHMEM also provides CPU-side API for GPU-GPU data movement that provides a progression for applications to move to NVSHMEM. CPU-side communication can be issued in stream order, similar to CUDA operations. We demonstrate the benefits of using NVSHMEM through several benchmarks and applications.Bio:
Akhil Langer is a Senior Software Engineer at Nvidia Corporation. He completed his Ph.D. in Computer Science from University of Illinois at Urbana-Champaign in 2015. His research interest lies in high performance computing runtimes and applications, parallel programming, scalability, and stochastic optimization. At present, he is lead developer of NVSHMEM that is a parallel programming library for Nvidia GPU clusters.
Copyright (C): Pavan Balaji, Argonne National Laboratory