Argonne National Laboratory

Extending the Binomial Checkpointing Technique for Resilience

TitleExtending the Binomial Checkpointing Technique for Resilience
Publication TypeReport
Year of Publication2017
AuthorsWalther, A, Narayanan, SHK

In terms of computing time, adjoint methods oer an attractive alternative for computing gradient information, required, for example, for optimization. Together with this very favorable temporal complexity result, however, comes a memory requirement that is in essence proportional to the operation count of the underlying function, for example, if algorithmic dierentiation is used to provide the adjoints. For this reason, checkpointing approaches in many variants have become popular. This paper analyzes an extension of the so-called binomial approach to cover also possible failures of the computing systems. Such a measure of precaution is of special interest for massive parallel simulations and adjoint calculations where the mean time between failure of the large-scale computing system is smaller than the time needed to complete the calculation of the adjoint information. We describe the extensions of standard checkpointing approaches required for such resilience, provide a corresponding implementation and discuss numerical results.