Fault-tolerant and adaptive time integrators
Many PinT methods share features that make them natural candidates for algorithmic-based fault tolerance (ABFT): they hold copies of the (approximate) solution at different times on different processors and they are iterative and/or hierarchical by nature. Since time stepping is typically the outermost loop for the numerical solution of a time-dependent partial differential equation, protecting it by ABFT covers a larger area of the code. Efforts to provide ABFT, for instance based on adaptivity in time, can boost resilience and computational efficiency at the same time and are very promising.
Fault-tolerant PFASST
We introduce and analyze different strategies for the parallel-in-time integration method PFASST to recover from hard faults and subsequent data loss. Since PFASST stores solutions at multiple time steps on different processors, information from adjacent steps can be used to recover after a processor has failed. PFASST’s multi-level hierarchy allows to use the coarse level for correcting the reconstructed solution, which can help to minimize over- head.
Ref: Robert Speck, Daniel Ruprecht , Toward fault-tolerant parallel-in-time integration with PFASST, Parallel Computing, Vol.62, 20-37, 2017.
Adaptive SDC
The time integration method spectral deferred correction (SDC) offers multiple forms of adaptivity, where the accuracy is measured and extra work is performed to increase accuracy if the target requirement is not met. This is primarily a strategy for adapting the temporal resolution to the requirements of the problem at runtime. However, if the accuracy target is not reached due to a temporary fault altering the solution, adaptive SDC will also expend more work in an attempt to correct the fault. We show that adaptivity can increase computational efficiency and resilience to soft faults at the same time for a range of problems by performing experiments with manual insertion of bitflips in the solution.
Ref. Thomas Baumann, Sebastian Götschel, Thibaut Lunet, Daniel Ruprecht, Robert Speck, Resilience Against Soft Faults through Adaptivity in Spectral Deferred Correction, arXiv.2412.00529 [cs.DC], submitted.