Optimization of Multi-level Checkpoint Model with Uncertain Execution Scales

TitleOptimization of Multi-level Checkpoint Model with Uncertain Execution Scales
Publication TypeReport
Year of Publication2014
AuthorsDi, S, Gomez, LABautist, Cappello, F
Other NumbersANL/MCS-P5160-0714

As for future extreme scale systems, there could be different types of failures striking exascale applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multi-level checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is three-fold. (1) We provide an in-depth analysis on why it is very tough to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously. (2) We devise a novel method which can quickly obtain an optimized solution, which is the first successful attempt in the multi-level checkpoint model with uncertain scales. (3) We perform both large-scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. Experiments confirm our optimized solution outperforms other state-of-the- art solutions by 4.3-88% on wall-clock length.