|Title||Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000 |
|Publication Type||Journal Article |
|Year of Publication||2015 |
|Authors||Schanen, M, Marin, O, Zhang, H, Anitescu, M |
|Journal||Procedia Computer Science |
|Date Published||06/2016 |
|Other Numbers||ANL/MCS-P5422-1015 |
|Abstract||Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale systems, which we later demonstrate to be realistic on the IBM Blue Gene/Q system Mira, we create a model of the expected performance of our checkpointing approach and validate it using the highly scalable Navier-Stokes spectral-element solver Nek5000 on small to moderate subsystems of the Mira supercomputer. In turn, this allows us to predict optimal algorithmic choices when using all of Mira. We also demonstrate that two-level checkpointing is significantly superior to single-level checkpointing when adjoining a large number of time integration steps. To our knowledge, this is the first time two-level checkpointing had been designed, implemented, tuned, and demonstrated on fluid dynamics codes at large scale of 50k+ cores.