The Intel Nehalem processor offers the possibility of Simultaneous Multi-Threading (SMT) which Intel formerly also called Hyper-Threading (HT). In SMT mode each processor core can execute two threads or tasks (called processes in the following for simplicity) simultaneously, leading to an execution of at most 16 processes per JUROPA/HPC-FF compute node. The two "slots" of each core for running processes are called Hardware Threads (HWT) throughout this document.
Each JUROPA/HPC-FF compute node consists of two quad-core CPUs, located on socket 0 and 1, respectively. The cores are numbered 0 to 7 and the HWT are named 0 to 15 in a round-robin fashion. Figure 1 depicts a node schematically and illustrates the naming convention.
Using SMT on JUROPA
In order to use SMT on JUROPA you need to set ppn=16 in your MOAB job script. The following examples show how to use SMT for pure MPI and MPI/OpenMP hybrid applications.
Example 1: Pure MPI code
mpiexec -np 64 application.exe
This script will start application.exe on 4 nodes using 16 MPI tasks per node, where two MPI tasks will be executed on each core.
Example 2: Hybrid MPI/OpenMP code
This script will start application.exe on 4 nodes using 2 MPI tasks per node and 8 OpenMP threads per task.
How to profit from SMT
Processes which are running on the same core will need to share the resources available to that particular core. Therefore, applications will profit most from SMT if processes are running on one core which are complementary in their usage of resources. For example, while one of the two processes is performing some kind of computation the second one is accessing the memory. This way the two processes do not compete for resources. If on the other hand the two processes would need to access the memory at the same time they would need to share the caches and memory bandwith and would therefore hamper each other. We recommend to test whether your code profits from SMT or not.
In order to check whether your application profits from SMT you should compare timings t of two runs on the same number of physical cores (i.e. the specification of nodes should be the same for both jobs), one run without SMT (two) and the second one with SMT (tw). If two/tw > 1 your application profits using SMT, for two/tw < 1 it does not.
According to experience you can expect a maximum speed-up of up to two/tw = 1.5 for codes profiting from SMT. However, applications may show a smaller benefit or might even slow down when using SMT. In such cases SMT should not be applied.
Default mapping of processes
By default the processes for pure MPI applications are mapped to the cores of each JUROPA node in a round-robin fashion. This means that the first 8 processes are allocated on the HWT 0 to 7 with process 0 on HWT 0 and process 7 on HWT 7 and the next 8 processes are allocated on HWT 8 to HWT 15.
Example 3: Pure MPI code
mpiexec -np 16 application.exe
This will map the MPI tasks in a round-robin fashion to the cores/HWT as shown in Figure 2. Here, t n means the n-th MPI task of each node.Figure 2: Distribution of tasks (pure MPI).
For hybrid applications (MPI/OpenMP) the tasks and threads will be distributed over the cores in such a way that all threads belonging to one task will share the physical cores among themselves. Three scenarios for the distribution of tasks and threads for each node are handled by default:
- 2 MPI tasks with 8 OpenMP threads per task per node (2×8)
- 4 MPI tasks with 4 OpenMP threads per task per node (4×4)
- 8 MPI tasks with 2 OpenMP threads per task per node (8×2)
The corresponding mapping is shown in Figure 3 to Figure 5 below. Threads belonging to the same MPI task are depicted with the same color. Here, t n.m names the m-th OpenMP thread of the n-th MPI task of each node.
Example 4a: Hybrid MPI/OpenMP code (2x8)
This way 2 MPI tasks will be allocated on one node and each task will spawn 8 OpenMP threads. First the 8 threads of task 0 (t 0.0 to t 0.7) will be allocated on HWT 0 to 3 and 8 to 11 and the threads of the remaining task 1 (t 1.0 to t 1.7) will be distributed on HWT 4 to 7 and 12 to 15 as shown in Figure 3.Figure 3: Distribution of 2 MPI tasks and 8 OpenMP threads per task.
Example 4b: Hybrid MPI/OpenMP code (4x4)
Here 4 MPI tasks will be allocated on one node and each task will spawn 4 OpenMP threads. The distribution of threads is shown in Figure 4.Figure 4: Distribution of 4 MPI tasks and 4 OpenMP threads per task.
Example 4c: Hybrid MPI/OpenMP code (8x2)
8 MPI tasks will be allocated on one node and each task will spawn 2 OpenMP threads. The distribution of threads is shown in Figure 5.Figure 5: Distribution of 8 MPI tasks and 2 OpenMP threads per task..
Important note: The mapping described here does not ensure that threads belonging to one task are pinned to a certain HWT. The mapping of tasks and threads just ensures that the m threads belonging to task n will be executed on the HWT assigned to task n. This means that for instance in example 4.c above (Figure 5) it is only ensured that thread t 0.0 and thread t 0.1 are executed on the first core. But thread t 0.0 is not necessarily executed on HWT 0. Correspondingly thread t 0.2 in example 4b will be executed on one of the HWT 0,1,8 or 9 but not necessarily on HWT 8.
Customizing the mapping of processes
You can customize the mapping of processes to the HWT by using the variable __PSI_CPUMAP (please note the two underscores "_" at the beginning of the variable). This variable should contain a comma-separated list of HWT, to which threads should be mapped. The threads will be distributed in the order of tasks.
Example 5: Hybrid MPI/OpenMP code (2x8, customized mapping)
This will allocate the threads of the first MPI task on the HWT 0-3 and 12-15 and the threads of the second MPI task on the HWT 4-11 (see Figure 6).Figure 6: Distribution of 8 MPI tasks and 2 OpenMP threads per task with customized mapping.