IAS-Seminar "Fault-tolerant multigrid methods"
Referent:
Prof. Dr. Dominik Göddeke, Institute of Applied Analysis and Numerical Simulation (IANS), University of Stuttgart
Abstract:
Future computer architectures are likely to exhibit a substantially reduced meantime-between-failure, due to their sheer size and hardware design decisions induced by power limitations. At the same time, checkpoint-restart techniques to improve resilience are not expected to scale, at least for runs with huge data volume. Algorithm-based fault tolerance (ABFT) is a family of techniques to alleviate these problems, in particular when combined with the ”local failure local recovery“ paradigm.
In this talk, we will demonstrate that multigrid solvers are inherently self-stabilising and robust to local failures. We will then present different ABFT techniques for both node loss scenarios as well as silent data corruption, that exploit algorithmic and numerical properties of multigrid methods to detect and correct failures. The resulting methods significantly improve robustness with almost negligible overhead, and can be easily added on top of existing implementations.
Alle Interessierten sind zu diesem Vortrag herzlich eingeladen.
Kontakt: Robert Speck, JSC