Training course "From zero to hero: Understanding and fixing intra-node performance bottlenecks"

Start
11th April 2018 07:00 AM
End
12th April 2018 14:30 PM
Location
Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a

(Course no. 1242018 in the training programme 2018 of Forschungszentrum Jülich)

This course is fully booked. Further registrants will be put on the waiting list.

Target audience:

Scientists/Developers who want to understand performance-critical hardware features of modern CPUs (like SIMD, ILP, caches, out-of-order execution) and utilize these features in their code. (Advanced course)

Contents:

 

Prerequisites:

Linux (ssh), Command line tools (grep, less), Kenntnisse in Fortran, C, C++


optional:


Git: examples will be provided in a git repository


Editors: vim or emacs for easier/faster handling of performance data

Language:

The course is given in English.

Duration:

2 days

Date:

11-12 April 2018, 09:00-16:30

Venue:

Jülich Supercomputing Centre, Ausbildungsraum 1, building 16.3, room 213a

Number of participants:

minimum 5, maximum 14

Instructors:

Andreas Beckmann, Dr. Ivo Kabadshow, JSC

Contact:

Andreas Beckmann


Phone: +49 2461 61-8713


E-mail: a.beckmann@fz-juelich.de

Registration:

This course is fully booked. Further registrants will be put on the waiting list.


Contact Andreas Beckmann.


If you do not belong to the staff of Forschungszentrum Jülich, we need these data for registration:


Given name, name, birthday, nationality, complete home address, email address

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compiler cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. As scientists we therefore have the added responsibility to design generic data layouts and data access patterns to give the compiler a fighting chance to generate code utilizing most of the available hardware features. Such data layouts and access patterns are vital to utilize performance from vectorization/SIMDization. Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations.

But what happens if your problem does not fall into this category and 3rd-party libraries are not available? This training course will shed some light on how the goal of utilizing on-core performance and ultimatively performance portability can be achieved.

In the first part of the training course we want to give insights in today's CPU microarchitecture and apply this knowledge in the hands-on session. As a demonstrator we will use a simple Coulomb solver and improve the code step-by-step. We will start from a basic implementation and advance to an optimized version using hardware features like vectorization to increase performance.

The exercises will also contain training on the use of open-source tools to measure and understand the achieved performance. Such optimizations, however, depend heavily on the targeted hardware and should not be part of the algorithmic layer of the code.

In the second part we will present a detailed description of possible abstraction layers to hide such hardware-specifics and therefore maintain readability and maintainability. We will also discuss the overhead costs of our introduced abstraction and show compile-time SIMD configurations and corresponding performance results on different platforms.

Some covered topics:

  • Inside a CPU: A scientists view on modern CPU microarchitecture
  • Datastructures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA and JURECA Booster
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Data Reuse: Register file and cache-blocking
  • Compiler: When and how to use compiler optimization flags

If you ever asked yourself one of the following questions, this course is for you.

  • What is the performance of my code and how fast could it actually be?
  • Why is my performance so bad?
  • Does my code use SIMD?
  • Why does my code not use SIMD and why does the compiler not help me?
  • Is my data-structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is this so complicated, I thought the science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

Last Modified: 20.05.2022