Search

link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service


SMT

Simultaneous Multi-Threading

The Intel Nehalem processor offers the possibility of Simultaneous Multi-Threading (SMT) which Intel formerly also called Hyper-Threading (HT). In SMT mode each processor core can execute two threads or tasks (called processes in the following for simplicity) simultaneously, leading to an execution of at most 16 processes per JUROPA/HPC-FF compute node. The two "slots" of each core for running processes are called Hardware Threads (HWT) throughout this document.

Each JUROPA/HPC-FF compute node consists of two quad-core CPUs, located on socket 0 and 1, respectively. The cores are numbered 0 to 7 and the HWT are named 0 to 15 in a round-robin fashion. Figure 1 depicts a node schematically and illustrates the naming convention.

Node schematically

Figure 1: Illustration of a JUROPA compute node including Hardware Threads (HWT).

Using SMT on JUROPA

In order to use SMT on JUROPA you need to set ppn=16 in your MOAB job script. The following examples show how to use SMT for pure MPI and MPI/OpenMP hybrid applications.

Example 1: Pure MPI code

#!/bin/bash -x
#MSUB -N SMT_MPI_64x1_job
#MSUB -l nodes=4:ppn=16
### start of jobscript ###

mpiexec -np 64 application.exe

This script will start application.exe on 4 nodes using 16 MPI tasks per node, where two MPI tasks will be executed on each core.

Example 2: Hybrid MPI/OpenMP code

#!/bin/bash -x
#MSUB -N SMT_hybrid_8x8_job
#MSUB -l nodes=4:ppn=16
#MSUB -v tpt=8
### start of jobscript ###

export OMP_NUM_THREADS=8
mpiexec -np 8 --exports=OMP_NUM_THREADS application.exe

This script will start application.exe on 4 nodes using 2 MPI tasks per node and 8 OpenMP threads per task.

How to profit from SMT

Processes which are running on the same core will need to share the resources available to that particular core. Therefore, applications will profit most from SMT if processes are running on one core which are complementary in their usage of resources. For example, while one of the two processes is performing some kind of computation the second one is accessing the memory. This way the two processes do not compete for resources. If on the other hand the two processes would need to access the memory at the same time they would need to share the caches and memory bandwith and would therefore hamper each other. We recommend to test whether your code profits from SMT or not.

In order to check whether your application profits from SMT you should compare timings t of two runs on the same number of physical cores (i.e. the specification of nodes should be the same for both jobs), one run without SMT (two) and the second one with SMT (tw). If two/tw > 1 your application profits using SMT, for two/tw < 1 it does not.

According to experience you can expect a maximum speed-up of up to two/tw = 1.5 for codes profiting from SMT. However, applications may show a smaller benefit or might even slow down when using SMT. In such cases SMT should not be applied.

Default mapping of processes

By default the processes for pure MPI applications are mapped to the cores of each JUROPA node in a round-robin fashion. This means that the first 8 processes are allocated on the HWT 0 to 7 with process 0 on HWT 0 and process 7 on HWT 7 and the next 8 processes are allocated on HWT 8 to HWT 15.

Example 3: Pure MPI code

#!/bin/bash -x
#MSUB -l nodes=1:ppn=16
#MSUB -N SMT_MPI_16x1_job
#MSUB -v tpt=1
### start of jobscript ###

mpiexec -np 16 application.exe

This will map the MPI tasks in a round-robin fashion to the cores/HWT as shown in Figure 2. Here, t n means the n-th MPI task of each node.

SMT 16 tasks round-robin

Figure 2: Distribution of tasks (pure MPI).


For hybrid applications (MPI/OpenMP) the tasks and threads will be distributed over the cores in such a way that all threads belonging to one task will share the physical cores among themselves. Three scenarios for the distribution of tasks and threads for each node are handled by default:

  • 2 MPI tasks with 8 OpenMP threads per task per node (2×8)
  • 4 MPI tasks with 4 OpenMP threads per task per node (4×4)
  • 8 MPI tasks with 2 OpenMP threads per task per node (8×2)

The corresponding mapping is shown in Figure 3 to Figure 5 below. Threads belonging to the same MPI task are depicted with the same color. Here, t n.m names the m-th OpenMP thread of the n-th MPI task of each node.

Example 4a: Hybrid MPI/OpenMP code (2x8)

#!/bin/bash -x
#MSUB -N SMT_hybrid_2x8_job
#MSUB -l nodes=1:ppn=16
#MSUB -v tpt=8
### start of jobscript ###

export OMP_NUM_THREADS=8
mpiexec -np 2 --exports=OMP_NUM_THREADS application.exe

This way 2 MPI tasks will be allocated on one node and each task will spawn 8 OpenMP threads. First the 8 threads of task 0 (t 0.0 to t 0.7) will be allocated on HWT 0 to 3 and 8 to 11 and the threads of the remaining task 1 (t 1.0 to t 1.7) will be distributed on HWT 4 to 7 and 12 to 15 as shown in Figure 3.

MPI OpenMP 2 Taks 8 TPT

Figure 3: Distribution of 2 MPI tasks and 8 OpenMP threads per task.


Example 4b: Hybrid MPI/OpenMP code (4x4)

#!/bin/bash -x
#MSUB -N SMT_hybrid_4x4_job
#MSUB -l nodes=1:ppn=16
#MSUB -v tpt=4
### start of jobscript ###

export OMP_NUM_THREADS=4
mpiexec -np 4 --exports=OMP_NUM_THREADS application.exe

Here 4 MPI tasks will be allocated on one node and each task will spawn 4 OpenMP threads. The distribution of threads is shown in Figure 4.

MPI OpenMP 4 Tasks 4 TPT

Figure 4: Distribution of 4 MPI tasks and 4 OpenMP threads per task.


Example 4c: Hybrid MPI/OpenMP code (8x2)

#!/bin/bash -x
#MSUB -N SMT_hybrid_8x2_job
#MSUB -l nodes=1:ppn=16
#MSUB -v tpt=2
### start of jobscript ###

export OMP_NUM_THREADS=2
mpiexec -np 8 --exports=OMP_NUM_THREADS application.exe

8 MPI tasks will be allocated on one node and each task will spawn 2 OpenMP threads. The distribution of threads is shown in Figure 5.

MPI OpenMP 8 Tasks 2 TPT

Figure 5: Distribution of 8 MPI tasks and 2 OpenMP threads per task..


Important note: The mapping described here does not ensure that threads belonging to one task are pinned to a certain HWT. The mapping of tasks and threads just ensures that the m threads belonging to task n will be executed on the HWT assigned to task n. This means that for instance in example 4.c above (Figure 5) it is only ensured that thread t 0.0 and thread t 0.1 are executed on the first core. But thread t 0.0 is not necessarily executed on HWT 0. Correspondingly thread t 0.2 in example 4b will be executed on one of the HWT 0,1,8 or 9 but not necessarily on HWT 8.

Customizing the mapping of processes

You can customize the mapping of processes to the HWT by using the variable __PSI_CPUMAP (please note the two underscores "_" at the beginning of the variable). This variable should contain a comma-separated list of HWT, to which threads should be mapped. The threads will be distributed in the order of tasks.

Example 5: Hybrid MPI/OpenMP code (2x8, customized mapping)

#!/bin/bash -x
#MSUB -N SMT_hybrid_2x8_job
#MSUB -l nodes=1:ppn=16
#MSUB -v tpt=8
### start of jobscript ###

export OMP_NUM_THREADS=8
export __PSI_CPUMAP="0-3,12-15,4-11"
mpiexec -np 2 --exports=OMP_NUM_THREADS application.exe

This will allocate the threads of the first MPI task on the HWT 0-3 and 12-15 and the threads of the second MPI task on the HWT 4-11 (see Figure 6).

MPI OpenMP 2 Tasks 8 TPT CM

Figure 6: Distribution of 8 MPI tasks and 2 OpenMP threads per task with customized mapping.


Servicemeu

Homepage