IAS Seminar "Fault-tolerant multigrid methods"

Start
11th May 2017 12:00 PM
End
11th May 2017 01:00 PM
Location
Jülich Supercomputing Centre, Hörsaal, building 16.3, room 222

Speaker:

Prof. Dr. Dominik Göddeke, Institute of Applied Analysis and Numerical Simulation (IANS), University of Stuttgart

Abstract:

Future computer architectures are likely to exhibit a substantially reduced meantime-between-failure, due to their sheer size and hardware design decisions induced by power limitations. At the same time, checkpoint-restart techniques to improve resilience are not expected to scale, at least for runs with huge data volume. Algorithm-based fault tolerance (ABFT) is a family of techniques to alleviate these problems, in particular when combined with the ”local failure local recovery“ paradigm.

In this talk, we will demonstrate that multigrid solvers are inherently self-stabilising and robust to local failures. We will then present different ABFT techniques for both node loss scenarios as well as silent data corruption, that exploit algorithmic and numerical properties of multigrid methods to detect and correct failures. The resulting methods significantly improve robustness with almost negligible overhead, and can be easily added on top of existing implementations.

Anyone interested is cordially invited to participate in this seminar.

Contact: Robert Speck, JSC

Last Modified: 30.04.2022