# Memory Optimisation

How to make optimal use of the available memory on the compute and front-end nodes

Contents:

### Optimising the available memory per core

The compute nodes of JUROPA/HPC-FF have a NUMA (Non-Uniform Memory Architecture) design, i.e. on these systems the memory (24 GB/node) is divided into two memory nodes of 12 GB each, where each memory node is bound to one CPU socket containing 4 cores. Since latency and bandwidth of memory accesses are significantly worse, if the other socket is involved, processes are pinned to specific cores and bound to the local memory node. By default, the first 4 tasks are distributed across the first socket and further tasks are distributed across the second socket. If 3 GB of memory per core is not enough for your needs, then this ratio can be increased in several ways. Some examples are given in the table below:

88---112Only one task is posed on a node, i.e. only the memory connected to the first socket is available.
88export124If one task is posed on a node and the variable __PSI_NO_MEMBIND is set, the whole memory is available.
81---838 tasks are distributed across the node.
84---212Four cores are reserved for each task, the first task is set on the first socket the second task is set on the second socket. Both tasks can use their local memory exclusively.
82---46Two cores are reserved for each task, so 6 GB of memory are dedicated to each task.

Abbreviations in the above table:

ppn
Processes per node; this value is specified in your job script or msub command through the ppn option
tpp
Threads per process; this value is specified through the environment variable PSI_TPP in your job script (see text below)
nobind
No memory binding; this value is specified through the environment variable __PSI_NO_MEMBIND in your job script (see text below)
Lists the effective number of tasks per node resulting from the settings of ppn and tpp
mem
Lists the available memory per task in GB resulting from the settings of ppn and tpp

The setting for ppn should be clear from the examples given in section Quick Introduction. PSI_TPP and __PSI_NO_MEMBIND have to be exported within a batch script. Instead of exporting PSI_TPP it is also possible to set its value implicitly with the following MOAB command within your batch script:

#MSUB -v tpt=1.

Then, of course, the value of tpt has to be set to your needs. Please be aware that, on the one hand, the access to the memory of a process from one socket to the other is assured through

export __PSI_NO_MEMBIND=<some_value>

(provided that bash is used), but, on the other hand, the performance of this memory access might be reduced considerably, so it is a good idea to weigh up this option carefully.

### Reducing the memory consumption of MPI connections

In some cases it might be necessary to reduce the amount of memory that has to be dedicated to MPI connections. This will apply in particular in cases where the job runs on many cores (>2000) and all-to-all communication is required. (As long as all-to-all communication is not needed, the use of on-demand connections will lead to a higher benefit; for this case see PSP_ONDEMAND below.)

In ParaStation MPI, roughly 0.55 MB of memory are needed for each MPI connection. If a program runs on 4000 cores and the communication pattern includes all-to-all communication, a total of 4000 times 3999 times 0.55 MB = 8797.8 GB of memory is needed just for the MPI connections, that is about 2.2 GB out of a total of 3 GB available per core.

Per connection, ParaStation MPI uses 16 send buffers and 16 receive buffers by default. Each buffer has a size of 16 KB. While the size of the buffers cannot be changed, the number of buffers can be modified via the environment variables PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE. If you want to reduce the number of these buffers you have to set the corresponding variables in your batch script:

export PSP_OPENIB_SENDQ_SIZE=3 (3 buffers for send)
export PSP_OPENIB_RECVQ_SIZE=3 (3 buffers for receive)

provided that bash is used. Both sizes might be modified independently.

The following table gives some figures for this purpose:

Q_SIZEMB/connection
30.141
40.172
80.305

But please be aware that a reduced Q_SIZE might degrade the MPI's throughput and messaging rate. Furthermore, each Q_SIZE must be at least 3 in order to prevent deadlocks.

### Using dynamic memory allocation for MPI connections

As described in the previous chapter, each MPI connection needs a certain amount of memory by default. The more processes are started, the more memory will be used for the MPI connections. As a rule of thumb, each MPI connection needs roughly 0.55 MB of memory. Besides the memory needed for your application, you have to take this fact into account.

On JUROPA/HPC-FF, all MPI connections of your application will be established in the beginning of the run. This might lead to a memory shortage due to the reasons descibed above. If the memory allocation of your application plus the memory allocation for the MPI connections exceeds the memory capacity of the compute node, the job will fail.

The environment variable PSP_ONDEMAND influences the memory allocation for MPI connections. On JUROPA/HPC-FF, the default of this variable is PSP_ONDEMAND=0, i.e. all needed memory will be allocated in the beginning of the run. If you alter the variable into PSP_ONDEMAND=1 within your batch script, then the memory allocation will take place dynamically. Dependent on the communication pattern of your application, this setting might circumvent your problem. But, if you do have all-to-all communication in your application, your program is likely to fail, since all MPI connections will have to be established when the all-to-all communication takes place.

### Pinning OpenMP threads to processors

The Intel compiler's OpenMP runtime library provides the possibility to influence the binding of OpenMP threads to physical processing units. The behaviour is controlled by the environment variable KMP_AFFINITY. Information on this issue can be found on the Intel webpage Intel.

On Juropa/HPC-FF, the default for the pinning is controlled by the PSI daemon. This means, the first four threads (0-3) are bound to the four cores of the first socket, all other threads (4-7) are bound to the cores of the second socket NUMA. If the user wants to change this setting, it is necessary to switch off the default setting explicitly by setting the environment variable __PSI_NO_PINPROC. Furthermore, the memory binding has to be switched off Memory Binding. The following example should make clear the general proceeding:

export KMP_AFFINITY=verbose,scatter
export __PSI_NO_PINPROC=1
export __PSI_NO_MEMBIND=1

Environment variables belonging to the PSI daemon need not to be exported explicitly in the mpiexec command.

Taking into account the setting from the example above the binding is as follows:

The threads are represented by circles in the picture above. KMP_AFFINITY=scatter distributes the threads as evenly as possible across the node.

### Analyzing the memory consumption of applications

JSC offers a tool called jumel (JUROPA Memory Logger) to analyze the memory consumption of applications on JUROPA. Please, check the jumel website for further information about the tool and its usage.

### Requesting large amounts of memory on the front-end nodes

Pre- and post-processing of input and output data is typically done on one of the JUROPA/HPC-FF front-end nodes.

While all login nodes and most of the GPFS nodes possess 24 GB of main memory, there are two GPFS nodes (juropagpfs04, juropagpfs05) that are equipped with 192 GB each. These nodes should preferably be used, if interactive pre- or post-processing require more than 24 GB of memory. Please note that the GPFS nodes are interactive front-end nodes, they cannot be used by batch jobs.

To access the GPFS nodes login with ssh:

ssh <userid>@juropagpfs04.fz-juelich.de

or

ssh <userid>@juropagpfs05.fz-juelich.de

The following table summarizes the available resources:

ResourceLogin Nodes GPFS Nodes 01 .. 03 GPFS Nodes 04, 05
Main Memory24 GB24 GB192 GB
CPU Time Limit30 min6 h6 h
Lustre File Systemyesyesyes
GPFS File Systemnoyesyes

For more details on resource limits and access to the GPFS nodes see: