Optimal OpenMP Threading

Different systems and compilers provide different ways to obtain the best performance with threaded applications. Mostly, this applies to OpenMP threads and core binding (how those threads are "bound" to the cores) on a modern Intel processor. However, parts of the information herein could apply to other threading methodologies on other hardware. All examples below use a single node with no MPI (i.e., a single process for all threads) unless otherwise specified. All examples use OpenMP, so $OMP_NUM_THREADS is used to specify the number of threads per process.

1. Cray

Executing applications on Cray system compute nodes requires aprun. Aprun provides its own core binding functionality with the -d and -cc flags. Use -d n, where n is the number of threads per process. Use -cc method, where method is how the threads are bound to the cores.

However, when compiling applications with the Intel compiler, the compiler builds its own core binding procedures into the application. Thus, the Cray aprun and Intel-compiler core bindings can conflict without the proper settings.

1.1. Cray: Without the Intel Compiler (Cray compiler, GCC, etc.)

The best approach for applications built with most compilers is to not use -cc. The following example (Bourne/BASH shell) executes 32 threads on a single 32 core node:

export OMP_NUM_THREADS=32
aprun -n 1 -d $OMP_NUM_THREADS ./myapp

Note that we reuse $OMP_NUM_THREADS after -n for convenience.

The above approach also applies to hybrid MPI/OpenMP applications. So, such an MPI application would simply change the -n parameter to correspond to the number of nodes, e.g., for 32 threads on 64 nodes (2048 threads total):

export OMP_NUM_THREADS=32
aprun -n 64 -d $OMP_NUM_THREADS ./myapp

However, you might find that running multiple MPI processes per node is advantageous. Usually, for current Intel processors on current HPC nodes, 2 or 4 processes per node is the correct number. This is because there are typically 2 processors (each on a separate processor socket) per node, and 2 groups of cores sharing cache per processor. Cray documentation refers to each processor as a "NUMA node". To execute the same hybrid MPI/OpenMP application as above with 2 processes per node, on a 32 core system (16 cores per processor), across 64 nodes -- double the tasks (-n), half the threads per task ($OMP_NUM_THREADS), and specify 2 tasks per node (-N), and 1 task per processor (-S), as follows:

export OMP_NUM_THREADS=16
aprun -n 128 -N 2 -S 1 -d $OMP_NUM_THREADS ./myapp

Alternatively, for 4 tasks per node:

export OMP_NUM_THREADS=8
aprun -n 256 -N 4 -S 2 -d $OMP_NUM_THREADS ./myapp

1.2. Cray: With the Intel Compiler

Intel-compiled applications require the proper use of aprun's -cc flag, to entirely or partially disable Cray's core binding in favor of Intel's. Also, one may need to set Intel's $KMP_AFFINITY environment variable for best performance -- usually the best setting is "scatter" for most applications. Full documentation for Intel's affinity settings can be found at https://software.intel.com/en-us/node/522691.

The following example (Bourne/BASH shell) executes 32 threads on a single 32 core node, with -cc none to disable Cray's binding:

export KMP_AFFINITY="scatter" 
export OMP_NUM_THREADS=32
aprun -n 1 -d $OMP_NUM_THREADS -cc none ./myapp

Note that we reuse $OMP_NUM_THREADS after -n for convenience.

The above approach also applies to hybrid MPI/OpenMP applications. So, such an MPI application would simply change the -n parameter to correspond to the number of nodes, e.g., for 32 threads on 64 nodes (2048 threads total):

export KMP_AFFINITY="scatter" 
export OMP_NUM_THREADS=32
aprun -n 64 -d $OMP_NUM_THREADS -cc none ./myapp

However, you might find that running multiple MPI processes per node is advantageous. Usually, for current Intel processors on current HPC nodes, 2 or 4 processes per node is the correct number. This is because there are typically 2 processors (each on a separate processor socket) per node and 2 groups of cores sharing cache per processor. Cray documentation refers to each processor as a "NUMA node". To execute the same hybrid MPI/OpenMP application as above with 2 processes per node, on a 32 core system (16 cores per processor), across 64 nodes -- double the tasks (-n), half the threads per task ($OMP_NUM_THREADS), and specify 2 tasks per node (-N), 1 task per processor (-S), and numa_node binding (-cc) as follows:

export OMP_NUM_THREADS=16
aprun -n 128 -N 2 -S 1 -cc numa_node -d $OMP_NUM_THREADS ./myapp

Alternatively, for 4 tasks per node, which also requires aprun's depth binding:

export OMP_NUM_THREADS=8
aprun -n 256 -N 4 -S 2 -cc depth -d $OMP_NUM_THREADS ./myapp

2. SGI

SGI uses omplace for core binding, in conjunction with $OMP_NUM_THREADS typically. The following is a simple example on a 36 core system (e.g., Thunder):

export OMP_NUM_THREADS=36
omplace ./myapp

Hybrid MPI/OpenMP applications will use the MPI launcher (mpiexec_mpt for SGI MPT) in conjunction with omplace. So, such an MPI application would add mpiexec_mpt -np followed by the number of nodes, e.g., for 36 threads on 64 nodes (2304 threads total):

export OMP_NUM_THREADS=36
mpiexec_mpt -np 64 omplace ./myapp

omplace is supposed to also handle multiple tasks per node, but this has not been tested. You might find that running multiple MPI processes per node is advantageous. Usually, for current Intel processors on current HPC nodes, 2 or 4 processes per node is the correct number. This is because there are typically 2 processors (each on a separate processor socket) per node and 2 groups of cores sharing cache on each processor. In theory (again, untested!), to run 2 MPI tasks per node, with half the threads each (18), on 64 nodes:

export OMP_NUM_THREADS=18
mpiexec_mpt -np 128 -perhost 2 omplace ./myapp

3. Intel Turbo Boost: Don't expect perfect thread scaling!

Modern Intel processors come with the "Turbo Boost" feature. When fewer than the maximum cores are used on a processor, the processor can boost it's clock frequency by some amount. The amount depends upon how many cores are used and environmental conditions such as temperature. The maximum boost will occur when only 1 core is used. Also, note that there are often at least 2 processors per node, so you can see 100% speedup efficiency (2x speedup) when using 2 threads per node.

The following are the best speedup efficiency values observed on several systems as of December 2015:

  • Conrad (Cray XC40), at 32 threads/node: 75% (24x speedup)
  • Thunder (SGI ICE X), at 36 threads/node: 99% (35.6x speedup -- Turbo Boost must be disabled or not working)

These are subject to change with system updates. They may vary slightly over time due to environmental conditions and can vary even more depending upon how the processor is used (e.g., if vectorization is utilized).