Training course "From zero to hero, Part I: Understanding and fixing on-core performance bottlenecks"

14th May 2019 07:00 AM
15th May 2019 14:30 PM
Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a

(Course no. 1082019 in the training programme 2019 of Forschungszentrum Jülich)

Target audience:

Scientists/Developers who want to understand performance-critical hardware features of modern CPUs (like SIMD, ILP, caches, out-of-order execution) and utilize these features in their code. (Advanced course)




Linux (ssh), Command line tools (grep, less), knowledge of Fortran, C or C++;

Experience with own code exhibiting performance/scaling bottlenecks;


Git: examples are provided in a git repository

Editors: vim or emacs to work on remote machines


The course is given in English.


2 days


14-15 May 2019, 09:00-16:30


Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a

Number of participants:

minimum 5, maximum 15


Andreas Beckmann, Dr. Ivo Kabadshow, JSC


Andreas Beckmann

Phone: +49 2461 61-8713



Please register with Andreas Beckmann until 30 April 2019.

If you do not belong to the staff of Forschungszentrum Jülich, we need these data for registration:

Given name, name, birthday, nationality, complete home address, email address

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compilers cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. Hence, scientists have the additional responsibility to design generic data layouts and data access patterns. This gives the compiler a fighting chance to generate code that utilizes most of the available hardware features. Those data layouts and access patterns are vital to utilize performance from vectorization/SIMDization.

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available? The training course sheds some light on achieving on-core performance.

We provide insights in today's CPU microarchitecture and apply this knowledge in the hands-on sessions. As example applications we use a plain vector reduction and a simple Coulomb solver. We start from basic implementations and advance to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase performance. The course also contains training on the use of open-source tools to measure and understand the achieved performance results.

Covered topics:

  • Inside a CPU: A scientists view on modern CPU microarchitecture
  • Data structures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA, JURECA Booster and JUWELS
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Data Reuse: Register file and cache tiling
  • Compiler: When and how to use compiler optimization flags

This course is for you if you ever asked yourself one of the following questions:

  • What is the performance of my code and how fast could it actually be?
  • Why is my performance so bad?
  • Does my code use SIMD?
  • Why does my code not use SIMD and why does the compiler not help me?
  • Is my data structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

In Part II of the course you will learn how to utilize these features in a performance portable way on multiple cores of a node. Furthermore, we will show how to use abstraction layers to separate the hardware-specific optimizations from the algorithm.

Last Modified: 20.05.2022