A Lightweight Low-level Threading/Tasking Framework
Concurrency in today’s most powerful HPC systems is exploding. As noted in the International Exascale Software Project roadmap (IESP), the future exascale systems are likely to be composed of hundreds of millions of arithmetic units and future applications may contain billions of threads. Computational scientists are adapting their programming models and computational methods for emerging applications. In order to tackle massive parallelism, new programming models will decompose the computational problem into lightweight tasks or threads that can increase concurrency, hide latency, and improve resilience.
Argobots, which was developed as a part of the Argo project, is a lightweight runtime system that supports integrated computation and data movement with massive concurrency. It will directly leverage the lowest-level constructs in the hardware and OS: lightweight notification mechanisms, data movement engines, memory mapping, and data placement strategies. It consists of an execution model and a memory model.
Lightweight Threads and Tasklets
We envision two abstractions in the Argo runtime system. The first abstraction supports lightweight concurrent execution, such as lightweight threads or tasks, that can dynamically and efficiently adapt to the tug of requirements from applications and hardware resources. Threads and tasks must be efficiently scheduled based on power, resilience, memory locality, and capabilities. The second abstraction supports low-latency, lightweight, message-driven activation.
Two levels of parallelism are supported. 1. Work Units, associated with function calls, each work unit will execute to its completion. 2. Execution Streams, bound to hardware resources, and guarantee progress.
Localized scheduling strategies such as those used in current runtime systems, while efficient for short execution, are unaware of global strategies and priorities. Adaptability and dynamic system behavior must be handled by scheduling strategies that can change over time or be customized for a particular algorithm or data structure. “Plugging” in a specialized scheduling strategy lets the OS/R handle the mechanism and utilize lightweight notification features while leaving the policy to higher levels of the system-software stack
Current communication libraries are not well integrated with the scheduler or the memory manager. All three must work together in order to support optimized data movement across nodes and provide the features needed by the higher layers that built the specific programming environment. Argobots provides an integrated memory model that supports eventual consistency.
Consistency Domains (CD)
Current hardware couples consistency and coherence into one, but this can be expensive on NUMA machines or non-DRAM like memories. Argobots decouples consistency and coherence by explicitly divide memory space into difference domains. There are three level of consistency in Argobots: 1. Consistency Domains; 2. Non-coherent Load/Store Domains (NCLSD); 3. Outside an NCLSD.
A Consistency Domain is a region of memory that data becomes eventually consistent, which means if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. At the same time, immediate consistency can be enforced with memory barriers. In NCLSD, data are accessed using load/store, but not hardware consistency is provided. Outside an NCLSD, explicit Put/Get/Messaging models are used to move data.
Argobots has been expanding its ecosystem both inside and outside the Argo project. Various programming models are integrating Argobots into their runtime so that their applications can take advantage of Argobots without modifying the code.
- Argo internal connections
- MPI interoperation with Argobots
- OpenMP over Argobots (see BOLT)
- Charm++ over Argobots
- CilkBots over Argobots
- TASCEL over Argobots
- PaRSEC over Argobots
- External connections
- OpenMP/XcalableMP (RIKEN)
- OmpSs (BSC)
- Mercury (Argonne)