Institute for Advanced Simulation (IAS)

Jülich Supercomputing Centre (JSC)

Directions for the hands-on session

The following sections are meant as rough guideline what to look for or what tools to try during the workshop. It is by no means a comprehensive guide for porting or tuning your code, nor do you have to work through all sections. It also does not try to cover a large number of cases or topics, it is merely meant to be a starting point for you. Ask our advisors for more expert advice or additional help.

Porting

Have you familiarised yourself with JURECA's hardware and noted relevant differences to your previous architecture?
This may severely affect performance or result in non-optimal use of available resources.
Do you encounter problems with a specific compiler or version of a compiler?
We do provide several compilers and toolchains. Not all are equally verbose or strict.
Have you tried different available compilers and compared the resulting performance and encountered warnings?
It is also worthwhile checking the compiler reports for optimisation hints.
For Fortran users: are compile time checks enabled?
Are any dependencies (libraries) missing?
Do you need to compile them all yourself or do we provide optimised versions, e.g. for GPUs, threaded or vectorised versions?

Relevant talks and JSC on-line resources:

talk: JURECA hardware and best practices
Introduction to supercomputing resources
JURECA configuration
Available compilers and libraries
Software modules
Compile and execute
Processor affinity

Debugging

Do errors depend on specific problem sizes or the number of nodes used?
Vary problem or run sizes, also try different toolchains (compilers, MPI).
Do the errors depend on I/O or MPI?
Several tools help with investigating I/O or MPI correctness.
Do you encounter errors that require debuggers?
Start with more conservative compiler flags. Fortran users should enable run-time checks.
Do you need parallel debugging or is serial debugging sufficient?
The choice of debuggers is different for serial/parallel debugging.

Relevant talks and JSC on-line resources:

talk: Debuggers and performance tools
Parallel debugging

Performance analysis

Start by generating a baseline measurement. Identify a suitable test case that is small enough not to waste resources (and waiting time) but that is representative enough for your simulations.

Have you tried setting up automated benchmarks?
The Jülich Benchmarking Environment helps automating benchmarks, including analysing the results.
What are relevant metrics for your code?
Think about FLOP rates, bandwidths, updates/iterations per time, I/O sizes, time steps per wall-clock time, etc.
How do those metrics compare to other systems, theoretical peak numbers for JURECA, or your compute time budget?
Performance modelling may help to judge whether improvements can be expected and are worthwhile looking for.
Have you generated a first performance report with Allinea tools?
These provide a general overview with rough guides what to improve and highlight bottlenecks.
Have you tried a performance analysis with Score-P/Scalasca or other tools?
Flat profiles provide insight into MPI problems, general profiling reveals hot spots, tracing may reveal imbalances.
What proportion of your code runs serial? How much time is spent in parallel execution?
Depending on your answers, Amdahl's law might provide a limit to scalabilty.
How much time is lost due to MPI overhead?
A different communication scheme or more asynchronicity may reduce the impact of overhead.
How much time is spent for MPI communication and synchronisation?
Are there better communication schemes or can you reduce imbalances, increase concurrency?
Have you investigated hardware counters?
PAPI counters can be used stand-alone or from most performance tools.
What proportion of your code uses vectorised instructions?
Modern processors rely heavily on vectorisation to reach their peak performance, JURECA might loose 3/4 of its potential.
Can you identify scalability bottlenecks?
These may result in a sweet spot for your runs on JURECA or need to be rectified.
Can you identify performance bottlenecks?
Additional compiler switches or optimised libraries should be your first attempt to tackle hotspots.
What is the execution time overhead for OpenMP/pthreads?
Reducing parallel regions or different parallel constructs may reduce overhead, also consider the type of locks or atomic operations you use.
Have you studied the effect of compilers and their options?
Try generating compiler reports to look for performance critical hints.

Finish by running the same setup as for your baseline measurement and quantify any differences.

Relevant talks and JSC on-line resources:

talk: Debuggers and performance tools
talk: Performance tools use case
talk: Performance Tuning with Intel software tools: Advisor, VTune, ITAC
talk: The Jülich Benchmarking Environment
talk: Vectorization -- Why you shouldn't care

I/O

How many files are written do disk?
Consider generating fewer files to help I/O and post-processing.
What are the file sizes?
Note file sizes and I/O times to compare with the theoretical bandwidth.
Did you analyse I/O behaviour?
Tools like Scalasca and darshan provide metrics especially for I/O.
What are the size and number of I/O operations?
Darshan provides those specific metrics that allow for a base evaluation.
Have you investigated parallel I/O?
There are several methods for parallel I/O: HDF5, netCDF, MPI-IO, SIONlib.
Do you use the right filesystem?
/work has more performance than /home which is not meant for production data.

Relevant talks and JSC on-line resources:

talk: Parallel I/O
talk: Visualisation on JURECA
talk: Scientific Big Data Analytics by HPC
GPFS filesystem

Multi-threading

Is your code suitable for multi-threading?
Different data structures are suitable for shared- and distributed parallelisations, refactoring may be required.
Have you looked for multi-threaded libraries?
E.g. the Intel MKL library has options for multi-threading, reducing changes you have to make to your code.
Is multi-threading overhead significant?
This may depend on the multi-threading model or the use of locks and atomic operations.
Have you tested your code for load-imbalances?
Experiment with a different scheduling or a programming approach like tasking.
What fraction of your code uses multi-threading?
You might profit from further asynchronous code paths if simple parallel loops are not an option.
Did you investigate code scalability with respect to the number of threads and MPI tasks?
NUMA, memory requirements, MPI imbalances or domain decompositions influence the ideal ratio of threads vs. MPI tasks.

Relevant talks and JSC on-line resources:
Simultaneous multi-threading

GPU programming

Have you used nvprof and the NVIDIA Visual Profiler to analyze the performance of your GPU application?
Both tools are the weapon of choice for GPU-focused performance analysis; they provide a lot of functionality to judge the bottlenecks of your application.
Tip: Try using the guided analysis and the provided performance experiments in the Visual Profiler!
What is your GPU utilization?
Is the GPU used where it makes sense or is it idle? Apart from checking the timeline in the Visual Profiler there are also dedicated experiments for this.
Are you avoiding unnecessary data transfers?
Data transfers between host and device (i.e. CPU and GPU) are costly in most cases. Rethink their necessities and try to avoid them.
Can you overlap unavoidable data transfer with GPU kernels?
Streams can create different pipelines in your CUDA program to overlap data transfers and computation.
Are you exposing enough parallelism for the most time-consuming GPU kernels?
GPU computing benefits from many concurrently running thread blocks. Do you launch enough?
Do you have coalesced memory access in the most time-consuming GPU kernels?
Data, which is aligned in memory, can be accessed more efficiently than memory-scattered data. What is the memory access pattern in your case?
Tip: Compile your CUDA kernels with the --lineinfo option run the Global Memory Access Pattern experiment in the Visual Profiler. Also, are you data structures suited for coalesced access (SoA instead of AoS)?
How much time is spent on data transfer to/from the GPU vs. time spent doing work on the GPU?
Does the overhead still warrant work being done on the GPU, consider transfer tuning options mentioned above.
Are your kernels compute intensive, i.e., do you do many operation per byte?
A compute intensive kernel has less overhead and is much more suited for GPUs than the CPU.

Relevant talks and JSC on-line resources:

talk: GPU Programming with OpenACC: A LBM Case Study
GPU computing

Last Modified: 16.11.2022

Directions for the hands-on session

Table of Contents

Porting

Debugging

Performance analysis

I/O

Multi-threading

GPU programming