Search

Institute for Advanced Simulation (IAS)

Compiling and Tuning Applications on JUQUEEN

Compilers

The IBM XL compiler suite for C, C++ and FORTRAN as well as the GNU compiler collection are available on JUQUEEN. The compilers from the table can be used. The GNU compilers for the compute nodes are located in the directory "/bgsys/drivers/ppcfloor/gnu-linux/bin". Wrappers are offered in order to link the applications with the appropriate MPI libraries.

LanguageIBM XL CompilerIBM XL MPI compiler wrapperGNU CompilerGNU MPI compiler wrapper
Cbgxlcmpixlcpowerpc64-bgq-linux-gccmpigcc
C++bgxlc++, bgxlCmpixlcxxpowerpc64-bgq-linux-g++mpig++
FORTRANbgxlf77, bgxlf90, bgxlf95, bgxlf2003mpixlf77, mpixlf90, mpixlf95, mpixlf2003powerpc64-bgq-linux-gfortranmpigfortran

Note: Standard compiler commands (xlc, xlC, gcc, g++, xlf, xlf90, gfortran) generate executables for login node (frontend) only!!!

C++: The LLVM/Clang experimental C++ compiler is also available for JUQUEEN, please see here: Clang.

GCC: Newer versions of the GNU Compiler Collection are available on JUQUEEN as an experimental installation, see GCC.

Compiling MPI programs on JUQUEEN

The Blue Gene/Q software provides scripts to compile and link MPI programs. These scripts simplify building MPI programs by setting the include paths for the compiler and linking in the libraries that implement MPICH2, the common Blue Gene/Q message layer interface (PAMI), and the low-level hardware interfaces (MUSPI) that are required by Blue Gene/Q MPI programs.

There are six versions of the libraries and the scripts which are located in the directory /bgsys/drivers/ppcfloor/comm/.

 gcc A version of the libraries that was compiled with the GNU Compiler Collection (GCC) and uses fine-grained locking in MPICH. These libraries also have error checking and assertions enabled. gcc.legacy A version of the libraries that was compiled with the GNU Compiler Collection and uses a coarse-grain lock in MPICH. These libraries also have error checking and assertions enabled and can provide slightly better latency in single-thread codes, such as those that do not call MPI_Init_thread( ... MPI_THREAD_MULTIPLE ...). Use one of the gcc libraries for initial application porting work. xl A version of the libraries with MPICH compiled with the XL compilers and PAMI compiled with the GNU compilers. This version has the fine-grained MPICH locking and all error checking and asserts enabled. These libraries can provide a performance improvement over the gcc libraries. xl.legacy A version of the libraries with MPICH compiled with the XL compilers and PAMI compiled with the GNU compilers. This version has the coarse-grained MPICH lock and all error checking and assertions are enabled. These libraries can provide a performance improvement over the gcc.legacy libraries for single-threaded applications. xl.ndebug A version of the libraries with MPICH compiled with the XL compilers and PAMI compiled with the GNU compilers. This version has the fine-grained MPICH locking. Error checking and assertions are disabled. This setting can provide a substantial performance improvement when an application functions satisfactorily. Do not use this library version for initial porting and application development. xl.legacy.ndebug A version of the libraries with MPICH compiled with the XL compilers and PAMI compiled with the GNU compilers. This version has the coarse-grained MPICH lock. Error checking and asserts are disabled. This can provide a substantial performance improvement when an application functions satisfactorily. Do not use this library version for porting and application development. This library version can provide a performance improvement over the xl.ndebug library version for single-threaded applications.

The information about the compilers and their different versions are taken from IBM Blue Gene/Q Application Development Redbook)

The legacy versions provide low latency and should be used when running many MPI tasks per node and having short messages. The ndebug version usually show the best performance. Howerver, they should only be used for correctly running codes since they do not provide any error notifications.

OpenMP support

For OpenMP applications use the thread-safe version of the compilers by adding an "_r" to the name, e.g. bgxlC_r or mpixlf77_r, and add -qsmp=omp -qnosave.

Support for Shared libraries and dynamic executables

Blue Gene/Q offers the possibility to create shared libraries and dynamic executables, but in general it is not recommended to use shared libraries on JUQUEEN because loading the shared libraries can delay the startup of the dynamically linked application considerably, especially when using large partitions. Therefore, please use shared libraries *only* if there is no other possiblity.
See here for further information how to generate shared libraries and create dynamic executables in C and Fortran.

Compiler Options (IBM XL compiler)

Compiler Options for Optimization

In the following we provide hints and recommendations how to tune applications on JUQUEEN by choosing optimal compiler flags and customizing the runtime environment. In order to detect performance bottlenecks connected to algorithmic or communication problems we recommend to analyze your application in detail with performance analysis tools like SCALASCA.

We recommend to start with -O2 -qarch=qp -qtune=qp and increase the level of optimization stepwise according to the following table. Always check that the numerical results are still correct when using more aggressive optimization flags. For OpenMP codes, please add -qsmp=omp -qthreaded.

 Optimization level Description -O2 -qarch=qp -qtune=qp Basic optimization -O3 -qstrict -qarch=qp -qtune=qp More aggressive optimizations are performed, no impact on accuracy -O3 -qhot -qarch=qp-qtune=qp Aggressive optimization, that may impact the accuracy (high order transformations of loops) -O4 -qarch=qp -qtune=qp Interprocedural optimization at compile time -O5 -qarch=qp -qtune=qp Interprocedural optimization at Link time,whole-program analysis

Inlining of functions:

In order to avoid a performance overhead caused by frequently calling functions the XL compilers offer the possibility to use inline functions. With the option -qipa=inline which is automatically set at optimization level -O4 or higher the compiler will choose appropriate functions to inline. Furthermore, the user can specify explicitly functions for inlining in the following way:

-qipa=inline=func1,func2 or -Q+func1:func2

Both specifications are equivalent, the compiler will attempt to inline the functions func1 and func2.

Using MASS (C/C++):

The MASS library offers optimized versions of mathematical functions, for scalar as well a vector data types. It is linked by using

bgxlc progc.c -o progc -lmass , for the scalar routines

bgxlc progc.c -o progc -lvmass , for the vector routines

bgxlc progc.c -o progc -lmass_simd , for the simd version

We recommend to benchmark the use of '-lmass' against the usage of '-lm'. In certain cases there can be a significant performance gain.

Note: If both ESSL and MASS are used it may happen that multiple definitions of the same function are found. In this case the compiler option

-Wl,--allow-multiple-definition

should be used in the link step and the order of linkage should be defined according to which library should be used for those multiply defined funtions.

ESSL routines (FORTRAN):

The ESSL (Engineering and Scientific Subroutine Library) provides optimized mathematical and scientific routines. When specifying the options

-qessl -lessl or -qessl -lesslsmp

the compiler attempts to replace some intrinsic FORTRAN 90 procedures by essl routines where it is safe to do so.

Note: If both ESSL and MASS are used it may happen that multiple definitions of the same function are found. In this case the compiler option

-Wl,--allow-multiple-definition

should be used in the link step and the order of linkage should be defined according to which library should be used for those multiply defined funtions.

Diagnostics and reporting:

In order to verify and/or understand what kind of optimization is actually performed by the compiler you can use the following flags:

-qreport

The compiler will generate for each source file <name> a file <name>.lst with pseudo code and a description of the kind of code optimizations which were performed.

-qxflag=diagnostic

This flag causes the compiler to print information about code optimization during compile time to stdout.

Prefetching Options for L1 Cache

Stream Prefetching

The BG/Q architecture provides a sophisticated prefetching functionality which can be adjusted and used in different manners.

Applies to C/C++ Code: To find out if it is worthwhile to dive into the details of this, we recommend to first play with a couple of generic settings for the general prefetching behaviour.

Access to the corresponding low-level configuration functions of the BG/Q built-in prefetcher is provided by including 'sprefetch.h':

#include <spi/include/l1p/sprefetch.h>

This gives access to a variety of functions, amongst which is

L1P_SetStreamPolicy( 'input' );

where 'input' can be one of the following:

a) L1P_stream_optimistic -> first cache miss always triggers stream prefetching immediately

b) L1P_stream_confirmed -> first cache miss has to be confirmed by another one to trigger stream prefetching

c) L1P_confirmed_or_dcbt -> as b), but a stream can also be triggered explicitely in the source code

d) L1P_stream_disable -> no stream prefetching

The default setting is L1P_stream_confirmed_or_dcbt. It is worthwhile to run benchmarks using 'optimistic', 'confirmed' and 'disable'
to get a feeling for the impact of prefetching on the performance of your application.

If this check hints towards a substantial relevance of prefetching for the performance of your code we recommend reading the section on this subject
in the IBM redbook on BG/Q application development and to refer to the online manual on BG/Q compilers.

For further questions you can also contact the supercomputing support team via sc@fz-juelich.de.

Perfect Prefetching

As an alternative to stream prefetching the L1 prefetcher supports list-based (or perfect) prefetching for non-consecutive but repetitive memory access patterns. Such access patterns may arise with indirect addressing or index calculations. Section 4.8.2 of the IBM redbook on BG/Q application development describes all details including the L1P API. Some more details and examples are also included in the talk by T. Mauerer from the Juqueen Porting and Tuning Workshop.

In short: To use the perfect prefetcher, one first has to record an access pattern that is reused in the following iterations. Multiple such patterns can be recorded and used. To ensure the recorded list is long enough the length of the pattern can be configured. This length can be determined by measuring the L1 misses for one iteration.

Since the API is for C/C++ we sketch the usage briefly for Fortran.

To begin with, pull the necessary function calls from the header files by creating wrapper calls. Create a file l1p.c with:

#ifndef _L1PTOOLS_INCLUDED_
#define _L1PTOOLS_INCLUDED_

// include API calls
#include <spi/include/l1p/sprefetch.h>
#include <spi/include/l1p/pprefetch.h>

// create wrappers for Fortran to catch inlined
// functions
int L1P_PatternConfigure_F(uint64_t n)
{
 return L1P_PatternConfigure(n);
}

int L1P_PatternStart_F(int record)
{
 return L1P_PatternStart(record);
}

int L1P_PatternStop_F()
{
 return L1P_PatternStop();
}

#endif

and compile with

BGQ_INSTALL_DIR = /bgsys/drivers/V1R2M0/ppc64
SYS_INC = -I$(BGQ_INSTALL_DIR) \  -I$(BGQ_INSTALL_DIR)/spi/include/kernel/cnk
$(CC) -O3$(SYS_INC) -c l1p.c

and later link the resulting object code to your Fortran code. When doing so
include the following libraries

LIBS = $(BGQ_INSTALL_DIR)/spi/lib/libSPI_cnk.a \ $(BGQ_INSTALL_DIR)/spi/lib/libSPI_l1p.a
$(LD) ...$(LIBS)

In your Fortran code (best use a separate module) you now add interfaces to the
wrapper functions:

...
use , intrinsic :: iso_c_binding
...
! interface with L1P API
interface
 function L1P_PatternConfigure(val) &
 bind(c,name="L1P_PatternConfigure_F")
 use, intrinsic :: iso_c_binding
 integer(c_int64_t), value :: val
 integer(c_int) :: L1P_PatternConfigure
 end function L1P_PatternConfigure
 function L1P_PatternStart(val) &
 bind(c,name="L1P_PatternStart_F")
 use, intrinsic :: iso_c_binding
 integer(c_int), value :: val
 integer(c_int) :: L1P_PatternStart
 end function L1P_PatternStart
 function L1P_PatternStop() &
 bind(c,name="L1P_PatternStop_F")
 use, intrinsic :: iso_c_binding
 integer(c_int) :: L1P_PatternStop
 end function L1P_PatternStop
end interface
...

To configure the length of the pattern, from within your code issue a call like

...
integer(c_int) :: ret_val
...
ret_val = L1P_PatternConfigure(pattern_size)

The configured list can now be used to record and replay a pattern within loops:

...
integer(c_int) :: pattern
...
! start 'time loop'
do it = 1,nt
 pattern = L1P_PatternStart(1)
 do ip = 1,np
 part_v(ip)%vx = part_v(ip)%vx + ...
 part_v(ip)%vy = part_v(ip)%vy + ...
 part_v(ip)%vz = part_v(ip)%vz + ...
 ! other code...
 enddo
 pattern = L1P_PatternStop()
enddo
...

The argument to L1P_PatternStart determines whether a list is recorded (set parameter 1), used (0) or used and updated (1). Its return value should be checked for errors. Likewise, additional function calls should be implemented to check the status and success of the perfect prefetcher.

Environment variables

Important note: Setting these variables only takes effect if they are set within a job script AND exported explicitly when the runjob command is invoked.

BG_SHAREDMEMSIZE=<integer>

Sets the size of the shared memory space (in units of MB). Please see the user information for shared memory usage on JUQUEEN.

Default value: not set (But the default size of the shared memory space is 32MB)

Hints for MPI usage on JUQUEEN

under construction

Environment Variables

Important note: Setting these variables only takes effect if they are set within a job script AND exported explicitly when the runjob command is invoked.

BG_MAPCOMMONHEAP=<integer>

Controls the type of heap assignment to tasks running on the same a node.

Setting the environment variable BG_MAPCOMMONHEAP=1 will more evenly and effectively allocate the heap between MPI tasks. One caveat is that one process may overwrite another process' heap due to relaxed memory protection. Another caveat is that dynamically linked codes may fail unexpectedly.

Default value is 0.

PAMID_COLLECTIVES=<integer>
Controls whether optimized collectives are used. The possible values are 0 (optimized collectives are not used. Only MPICH point-to-point, based collectives are used) or 1 (optimized collectives are used). The default is 1.

PAMID_EAGER=<bytes>
Sets the cutoff for the switch from the eager to the rendezvous protocol. The default size is 4097 bytes. The MPI rendezvous protocol is optimized for maximum bandwidth. However, there is an initial handshake between the communication partners, which increases the latency. In case your application uses many short messages you might want to decrease the message size (even down to 0). On the other hand if your application can be mapped well to the TORUS network and uses mainly large messages increasing the limit might lead to a better performance. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used. This environment variable is identical to PAMID_RZV. See also PAMID_EAGER_LOCAL.

PAMID_EAGER_LOCAL=<bytes>
Sets the cutoff for the switch from the eager to the rendezvous protocol (PAMID_EAGER for further information) see when the destination rank is local (i.e. on the same node). The default size is 4097 bytes. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used. This environment variable is identical to PAMID_RZV_LOCAL.

PAMID_NUMREQUESTS=<integer>
Controls how many asynchronous collectives to issue before barriering. Default is 1. Setting it higher may slightly improve performance but will use more memory for queueing unexpected data.

PAMID_RZV=<bytes>
See PAMID_EAGER.

PAMID_RZV_LOCAL=<bytes>
See PAMID_EAGER_LOCAL.

Experimental C++ compiler LLVM/Clang

The LLVM/Clang eperimental C++ compiler available on JUQUEEN is a snapshot of the svn repository, patched for the BG/Q compute nodes (see http://trac.alcf.anl.gov/projects/llvm-bgq).

The Clang compiler fully supports the C++11 standard, uses it's own standard library libc++ (sharing the GCC ABI), supports QPX vector instructions and the IBM vector intrinsics syntax.

The compiler is available via the modules environment. The Clang modules are labeled with the svn revision number. Loading the clang module, the compiler wrappers get redefined (e.g. mpicxx, mpicc, mpic++, bgclang, bgclang++).

module help clang

Experimental newer versions fo the GNU Compiler Collection

Versions 4.6.4, 4.7.3, 4.8.1 (latest stable release) and 4.9 (latest development trunk) of the GNU Compiler Collection are available on JUQUEEN. These installations are considered as experimental but have already been succesfully used for several applications where the originally installed compilers suffered from internal errors or lack of support for newer langugage features.

Especially GCC versions 4.8 and later offer some very powerful link time optimization capabilities that - in combination with static linking - can yield some performance improvement compared to using older GCC versions.

The GCC can be used via module load gcc/VERSION, see module avail for a list of available versions. The compiler wrappers are mpigcc, mpig++, mpigfortran.

It is strongly recommended to consult the module documentation that is available via
module help gcc/VERSION
It contains a vast amount of details, solutions for possible errors etc., suggested compiler flags, etc.