Navigation and service

IAS Seminar "Fault-tolerant multigrid methods"

11 May 2017 14:00
11 May 2017 15:00
Jülich Supercomputing Centre, Hörsaal, building 16.3, room 222
Prof. Dr. Dominik Göddeke, Institute of Applied Analysis and Numerical Simulation (IANS), University of Stuttgart

Future computer architectures are likely to exhibit a substantially reduced meantime-between-failure, due to their sheer size and hardware design decisions induced by power limitations. At the same time, checkpoint-restart techniques to improve resilience are not expected to scale, at least for runs with huge data volume. Algorithm-based fault tolerance (ABFT) is a family of techniques to alleviate these problems, in particular when combined with the ”local failure local recovery“ paradigm.

In this talk, we will demonstrate that multigrid solvers are inherently self-stabilising and robust to local failures. We will then present different ABFT techniques for both node loss scenarios as well as silent data corruption, that exploit algorithmic and numerical properties of multigrid methods to detect and correct failures. The resulting methods significantly improve robustness with almost negligible overhead, and can be easily added on top of existing implementations.

Thursday, 11 May 2017, 14:00
Jülich Supercomputing Centre, Hörsaal, building 16.3, R. 222
Announcement as pdf file:
 Fault-tolerant multigrid methods (PDF, 29 kB)

Anyone interested is cordially invited to participate in this seminar.
Contact: Robert Speck, JSC