Navigation und Service

VERSCHOBEN auf 2. Halbjahr 2020 -- Trainingskurs "From zero to hero, Part I: Understanding and fixing on-core performance bottlenecks"

Dieser Kurs wird auf das 2. Halbjahr 2020 verschoben. Termin wird noch festgelegt.
(Kurs-Nr. 852020 im Trainingsprogramm 2020 des Forschungszentrums)

Anfang
28.04.2020 09:00 Uhr
Ende
29.04.2020 16:30 Uhr
Veranstaltungsort
Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, R. 213a

Inhalt:

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compilers cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. Hence, scientists have the additional responsibility to design generic data layouts and data access patterns. This gives the compiler a fighting chance to generate code that utilizes most of the available hardware features. Those data layouts and access patterns are vital to utilize performance from vectorization/SIMDization.

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available? The training course sheds some light on achieving on-core performance.

We provide insights in today's CPU microarchitecture and apply this knowledge in the hands-on sessions. As example applications we use a plain vector reduction and a simple Coulomb solver. We start from basic implementations and advance to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase performance. The course also contains training on the use of open-source tools to measure and understand the achieved performance results.

Covered topics:

  • Inside a CPU: A scientists view on modern CPU microarchitecture
  • Data structures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA, JURECA Booster and JUWELS
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Data Reuse: Register file and cache tiling
  • Compiler: When and how to use compiler optimization flags

This course is for you if you ever asked yourself one of the following questions:

  • What is the performance of my code and how fast could it actually be?
  • Why is my performance so bad?
  • Does my code use SIMD?
  • Why does my code not use SIMD and why does the compiler not help me?
  • Is my data structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

In Part II of the course you will learn how to utilize these features in a performance portable way on multiple cores of a node. Furthermore, we will show how to use abstraction layers to separate the hardware-specific optimizations from the algorithm.

Voraussetzungen:

Linux (ssh), Command line tools (grep, less), Kenntnisse in Fortran, C, C++;
Experience with own code exhibiting performance/scaling bottlenecks;
optional:
Git: examples will be provided in a git repository
Editors: vim or emacs for easier/faster handling of performance data

Zielgruppe:

Wissenschaftler/Softwareentwickler, die die performanz-kritischen Hardwareaspekte moderner CPUs verstehen wollen (fortgeschrittener Kurs)

Sprache:

Der Kurs wird auf Englisch gehalten.

Dauer:

2 Tage

Zeit:

verschoben auf 2. Halbjahr 2020

Ort:

Jülich Supercomputing Centre, Ausbildungsraum 1, Geb. 16.3, Raum 213a

Teilnehmerzahl:

mindestens 5, höchstens 15

Referenten:

Andreas Beckmann, Dr. Ivo Kabadshow, JSC

Ansprechpartner:

Photo Andreas Beckmann
Andreas Beckmann
Telefon: +49 2461 61-8713
E-Mail: a.beckmann@fz-juelich.de

Anmeldung:

erst möglich, wenn der Termin festgelegt wurde.