Optimal OpenMP Threading
Table of Contents
Different systems and compilers provide different ways to obtain the best performance with threaded applications. Mostly, this applies to OpenMP threads and core binding (how those threads are "bound" to the cores) on a modern Intel processor. However, parts of the information herein could apply to other threading methodologies on other hardware. All examples below use a single node with no MPI (i.e., a single process for all threads) unless otherwise specified. All examples use OpenMP, so $OMP_NUM_THREADS is used to specify the number of threads per process.
1. Cray
Executing applications on Cray system compute nodes requires aprun. Aprun provides its own core binding functionality with the -d and -cc flags. Use -d n, where n is the number of threads per process. Use -cc method, where method is how the threads are bound to the cores.
However, when compiling applications with the Intel compiler, the compiler builds its own core binding procedures into the application. Thus, the Cray aprun and Intel-compiler core bindings can conflict without the proper settings.
1.1. Cray: Without the Intel Compiler (Cray compiler, GCC, etc.)
The best approach for applications built with most compilers is to not
use -cc. The following example (Bourne/BASH shell) executes
32 threads on a single 32 core node:
export OMP_NUM_THREADS=32
aprun -n 1 -d $OMP_NUM_THREADS ./myapp
Note that we reuse $OMP_NUM_THREADS after -n for convenience.
The above approach also applies to hybrid MPI/OpenMP applications.
So, such an MPI application would simply change the -n parameter
to correspond to the number of nodes, e.g., for 32 threads on 64 nodes
(2048 threads total):
export OMP_NUM_THREADS=32
aprun -n 64 -d $OMP_NUM_THREADS ./myapp
However, you might find that running multiple MPI processes per node is
advantageous. Usually, for current Intel processors on current HPC nodes,
2 or 4 processes per node is the correct number. This is because there are
typically 2 processors (each on a separate processor socket) per node, and 2
groups of cores sharing cache per processor.
Cray documentation refers to each processor as a "NUMA node".
To execute the same hybrid MPI/OpenMP application as above with 2 processes
per node, on a 32 core system (16 cores per processor), across 64 nodes --
double the tasks (-n), half the threads per task
($OMP_NUM_THREADS), and specify 2 tasks per node
(-N), and 1 task per processor (-S), as follows:
export OMP_NUM_THREADS=16
aprun -n 128 -N 2 -S 1 -d $OMP_NUM_THREADS ./myapp
Alternatively, for 4 tasks per node:
export OMP_NUM_THREADS=8
aprun -n 256 -N 4 -S 2 -d $OMP_NUM_THREADS ./myapp
1.2. Cray: With the Intel Compiler
Intel-compiled applications require the proper use of aprun's -cc flag, to entirely or partially disable Cray's core binding in favor of Intel's. Also, one may need to set Intel's $KMP_AFFINITY environment variable for best performance -- usually the best setting is "scatter" for most applications. Full documentation for Intel's affinity settings can be found at https://software.intel.com/en-us/node/522691.
The following example (Bourne/BASH shell) executes 32 threads on a single 32
core node, with -cc none to disable Cray's binding:
export KMP_AFFINITY="scatter"
export OMP_NUM_THREADS=32
aprun -n 1 -d $OMP_NUM_THREADS -cc none ./myapp
Note that we reuse $OMP_NUM_THREADS after -n for convenience.
The above approach also applies to hybrid MPI/OpenMP applications. So,
such an MPI application would simply change the -n parameter
to correspond to the number of nodes, e.g., for 32 threads on 64 nodes
(2048 threads total):
export KMP_AFFINITY="scatter"
export OMP_NUM_THREADS=32
aprun -n 64 -d $OMP_NUM_THREADS -cc none ./myapp
However, you might find that running multiple MPI processes per node is
advantageous. Usually, for current Intel processors on current HPC nodes,
2 or 4 processes per node is the correct number. This is because there are
typically 2 processors (each on a separate processor socket) per node
and 2 groups of cores sharing cache per processor.
Cray documentation refers to each processor as a "NUMA node".
To execute the same hybrid MPI/OpenMP application as above with 2 processes
per node, on a 32-core system (16 cores per processor), across 64 nodes --
double the tasks (-n), half the threads per task
($OMP_NUM_THREADS), and specify 2 tasks per node
(-N), 1 task per processor (-S), and numa_node
binding (-cc) as follows:
export OMP_NUM_THREADS=16
aprun -n 128 -N 2 -S 1 -cc numa_node -d $OMP_NUM_THREADS ./myapp
Alternatively, for 4 tasks per node, which also requires aprun's depth binding:
export OMP_NUM_THREADS=8
aprun -n 256 -N 4 -S 2 -cc depth -d $OMP_NUM_THREADS ./myapp
2. SGI/HPE
SGI/HPE uses omplace for core binding, in conjunction with
$OMP_NUM_THREADS typically. The following is a simple example
on a 36-core system):
export OMP_NUM_THREADS=36
omplace ./myapp
Hybrid MPI/OpenMP applications will use the MPI launcher (mpiexec_mpt
for SGI/HPE MPT) in conjunction with omplace. So, such an MPI
application would add mpiexec_mpt -np followed by the number
of nodes, e.g., for 36 threads on 64 nodes (2,304 threads total):
export OMP_NUM_THREADS=36
mpiexec_mpt -np 64 omplace ./myapp
omplace is supposed to also handle multiple tasks per node,
but this has not been tested. You might find that running
multiple MPI processes per node is advantageous. Usually, for current Intel
processors on current HPC nodes, 2 or 4 processes per node is the correct
number. This is because there are typically 2 processors (each on a separate
processor socket) per node and 2 groups of cores sharing cache on each processor.
In theory (again, untested!), to run 2 MPI tasks per node,
with half the threads each (18), on 64 nodes:
export OMP_NUM_THREADS=18
mpiexec_mpt -np 128 -perhost 2 omplace ./myapp
3. Intel Turbo Boost: Don't expect perfect thread scaling!
Modern Intel processors come with the "Turbo Boost" feature. When fewer than the maximum cores are used on a processor, the processor can boost its clock frequency by some amount. The amount depends upon how many cores are used and environmental conditions such as temperature. The maximum boost will occur when only 1 core is used. Also, note that there are often at least 2 processors per node, so you can see 100% speedup efficiency (2x speedup) when using 2 threads per node.
We have observed scaling efficiencies of around 75% on some of our systems as a result of 24x speedup with 32 threads on a 32 core node.
This is subject to change with system updates. They may vary slightly over time due to environmental conditions and can vary even more depending upon how the processor is used (e.g., if vectorization is used).