Navigation und Service

Trainingskurs "From zero to hero, Part II: Understanding and fixing intra-node performance bottlenecks"

(Kurs-Nr. 1092019 im Trainingsprogramm 2019 des Forschungszentrums)

05.11.2019 09:00 Uhr
06.11.2019 16:30 Uhr
Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, R. 213a
Scientists/Developers who want to understand performance-critical hardware features of modern CPUs such as SIMD, ILP, caches or out-of-order execution, and utilize these features in their applications in a performance portable way. (Advanced course)

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available?

In Part I of this course we provided insights in today's CPU microarchitecture. As example applications we used a plain vector reduction and a simple Coulomb solver. We started from basic implementations and advanced to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase on-core performance. Part II sheds some light on achieving portable intra-node performance.

Continuing with the example applications from Part I, we use threading with C++11 std::thread to exploit multi-core parallelism and SMT (Simultaneous Multi-Threading). In this context, we discuss the fork-join model, tasking approaches and typical synchronization mechanisms.

To understand the parallel performance of memory-bound algorithms we take a closer look at the memory hierarchy and the parallel memory bandwidth. We consider data locality in the context of shared caches and NUMA (Non-Uniform Memory Access).

In this course we present several abstraction concepts to hide the hardware-specific optimizations. This improves readability and maintainability. We also discuss the overhead costs of the introduced abstractions and show compile-time SIMD configurations as well as corresponding performance results on different platforms.

Covered topics:

  • Memory Hierarchy: From register to RAM
  • Data structures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA, JURECA Booster and JUWELS
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Separation of concerns: Decoupling hardware details from suitable algorithms

This course is for you if one of the following questions:

  • Why is my parallel performance so bad?
  • Why should I not be afraid of threads?
  • When should I use SMT (hyperthreading)?
  • What is NUMA and why does it hurt me?
  • Is my data structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step.

Participation in the Part I course or deep knowledge of the covered topics;
Linux (ssh), Command line tools (grep, less), knowledge of Fortran, C or C++ and a threading framework (std::thread, pthreads, ...);
Experience with own code exhibiting performance/scaling bottlenecks;
Git: examples are provided in a git repository
Editors: vim or emacs to work on remote machines
Der Kurs wird auf Englisch gehalten.
2 Tage
5. - 6. November 2019, 9.00 - 16.30 Uhr
Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, Raum 213a
mindestens 5, höchstens 15
Andreas Beckmann, Dr. Ivo Kabadshow, JSC
Photo Andreas Beckmann
Andreas Beckmann
Telefon: +49 2461 61-8713

Bitte senden Sie Ihre Anmeldung bis 25. Oktober 2019 an Andreas Beckmann.

Wenn Sie nicht Mitarbeiter des Forschungszentrums Jülich sind, geben Sie bei der Anmeldung bitte die folgenden Daten an:
Vorname, Name, Geburtsdatum, Nationalität, vollständige Adresse des Wohnorts, E-Mail-Adresse