Navigation and service

Training course "From zero to hero, Part II: Understanding and fixing intra-node performance bottlenecks"

(Course no. 862020 in the training programme 2020 of Forschungszentrum Jülich)

03 Nov 2020 09:00
04 Nov 2020 16:30
Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a


Target audience:
Scientists/Developers who want to understand performance-critical hardware features of modern CPUs such as SIMD, ILP, caches or out-of-order execution, and utilize these features in their applications in a performance portable way. (Advanced course)

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available?

In Part I of this course we provided insights in today's CPU microarchitecture. As example applications we used a plain vector reduction and a simple Coulomb solver. We started from basic implementations and advanced to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase on-core performance. Part II sheds some light on achieving portable intra-node performance.

Continuing with the example applications from Part I, we use threading with C++11 std::thread to exploit multi-core parallelism and SMT (Simultaneous Multi-Threading). In this context, we discuss the fork-join model, tasking approaches and typical synchronization mechanisms.

To understand the parallel performance of memory-bound algorithms we take a closer look at the memory hierarchy and the parallel memory bandwidth. We consider data locality in the context of shared caches and NUMA (Non-Uniform Memory Access).

In this course we present several abstraction concepts to hide the hardware-specific optimizations. This improves readability and maintainability. We also discuss the overhead costs of the introduced abstractions and show compile-time SIMD configurations as well as corresponding performance results on different platforms.

Covered topics:

  • Memory Hierarchy: From register to RAM
  • Data structures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA, JURECA Booster and JUWELS
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Separation of concerns: Decoupling hardware details from suitable algorithms

This course is for you if one of the following questions:

  • Why is my parallel performance so bad?
  • Why should I not be afraid of threads?
  • When should I use SMT (hyperthreading)?
  • What is NUMA and why does it hurt me?
  • Is my data structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step.

Participation in the Part I course or deep knowledge of the covered topics;
Linux (ssh), Command line tools (grep, less), knowledge of Fortran, C or C++ and a threading framework (std::thread, pthreads, ...);
Experience with own code exhibiting performance/scaling bottlenecks;
Git: examples are provided in a git repository
Editors: vim or emacs to work on remote machines
The course is given in English.
2 days
3-4 November 2020, 09:00-16:30
Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a
Number of participants:
minimum 5, maximum 15
Andreas Beckmann, Dr. Ivo Kabadshow, JSC
Photo Andreas Beckmann
Andreas Beckmann
Phone: +49 2461 61-8713
Please register with Andreas Beckmann ( until 20 October 2020.
If you do not belong to the staff of Forschungszentrum Jülich, we need these data for registration:
Given name, name, birthday, nationality, complete home address, email address

Additional Information

JSC Events - Measures Regarding the Coronavirus Pandemic

Due to the preventive measures at Forschungszentrum Jülich regarding the spreading of the Coronavirus, all JSC courses and events in March and April were cancelled. They will be rescheduled later.

For the time being and with the rapidly changing situation in mind, JSC cannot foresee whether courses and events in May and later can take place as scheduled as face-2-face events. Seminar talks might be streamed as video conferences, courses might be postponed or partly given as webinars. We still take registrations for the upcoming courses. All participants who registered for courses so far will be notified by e-mail after the regular registration deadline whether and how the courses will be held.

Please check regularly with the webpage whether upcoming events will take place at all or be streamed as a video conference.