Designing a new operating system for exascale architectures

August 7, 2013

Argonne National Laboratory has been awarded a grant from the Department of Energy Office of Science to lead Argo, a multi-institutional research project to design and develop a platform-neutral prototype of an exascale operating system and runtime software.

Researchers in Argonne’s Mathematics and Computer Science Division will collaborate with scientists from Pacific Northwest National Laboratory and Lawrence Livermore National Laboratory, as well as with several universities nationwide, on the three-year, $9.75 million project.

The world’s fastest computers have proved essential for scientific discoveries, spanning from designing new materials to understanding the cosmos. And the emerging development of exascale computers offers the potential to address even more complex problems. But such computers present significant challenges. The enormous increase in parallelism will require extremely efficient synchronization mechanisms. Moreover, handling the demands raised by 3D stacked memory, heterogeneous CPUs, and embedded network controllers will require dynamic programming environments. Today’s operating system and runtime software cannot be incrementally extended to exploit the complexity and scale of these emerging exascale platforms. A radical new approach is required.

In Greek mythology, Argo (which means “swift”) was the ship used by Jason in his quest for the Golden Fleece. “We chose the project name Argo because it exemplifies several characteristics of our approach,” said Pete Beckman, director of the Exascale Technology and Computing Institute and chief architect of the Argo project. “Our system must be swift and seaworthy. It requires an able crew, with expertise in areas such as global optimization, power management, code integration, lightweight threads, and interconnection fabrics. And as Argonauts we must be ready to face risk in our quest to develop a modular architecture that supports extreme-scale scientific computation.”

At the heart of the project are four key innovations: dynamic reconfiguring of node resources in response to workload, allowance for massive concurrency, a hierarchical framework for power and fault management, and a “beacon” mechanism that allows resource managers and optimizers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures. It is expected to form the basis of production exascale systems deployed in the 2018–2020 timeframe.

The design is based on a hierarchical approach. A global view enables Argo to control resources such as power or interconnect bandwidth across the entire system, respond to system faults, or tune application performance. A local view is essential for scalability, enabling compute nodes to manage and optimize massive intranode thread and task parallelism and adapt to new memory technologies.

“Bringing together these multiple views and the corresponding software components through a whole-system approach distinguishes our strategy from existing designs,” said Beckman. “We believe it is essential for addressing the key exascale challenges of power, parallelism, memory hierarchy, and resilience.”

To achieve such a whole-system perspective, the Argo team members introduced the idea of  “enclaves,” a set of resources dedicated to a particular service, and capable of introspection and autonomic response. Enclaves will be able to change the system configuration of nodes and the allocation of power to different nodes or to migrate data or computations from one node to another. The enclaves will be used to demonstrate the support of different levels of fault tolerance – a key concern of exascale systems – with some enclaves handling node failures by means of global restart and other enclaves supporting finer-level recovery.

Testing is essential in order to move successfully from innovative concept to full-scale deployment. The researchers will test the new mechanisms both as individual components and as an integrated system.

“We will use available simulators and early-access platforms with many cores, to help us explore designs,” said Beckman. “Particularly challenging is the need to be able to shift research directions quickly in response to changes in technology and architecture roadmaps.”

The modern-day Argonauts will use U.S. Department of Energy science applications to evaluate the correctness, scalability, resilience, and completeness of Argo on both homogeneous and heterogeneous computer architectures.

 “The Argo project is ambitious, but we are confident that we can succeed in our quest. We fully intend to bring to the scientific community our “Golden Fleece” – version 1.0 of the Argo exascale operating system and runtime system – at the end of three years,” said Beckman.