Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints
Speaker: Michel Schanen (ANL)
Date: Thursday, 3 December 2015, 10:30-12:00
Session: Resilience I
Talk type: Short talk (15 min)
Abstract: Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. Essential to their performance is an efficient checkpointing scheme for the recovering of the primal values in the adjoint run, this being a trade-off between memory requirement and recomputation. We have implemented an asynchronous two-level adjoint checkpointing scheme for multistep numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale system, 50k+ cores on the IBM Blue Gene/Q system Mira, we validate our checkpointing approach and our performance model using the spectral element solver Nek5000. To our knowledge, this is the first time two-level checkpointing had been designed, implemented, tuned, and demonstrated on fluid dynamics codes at large scale of 50k+ cores. Based on this experience we want to discuss challenges in gathering adjoint information on future exascale systems.