Techniques for Improving Performance

1. Introduction

There are a number of simple things one can do to optimize C, C++, or Fortran code for enhanced performance. Many of these ideas are covered in optimization guides from various hardware vendors, compiler vendors, and HPC system vendors. The optimizations considered here are based on understanding how different high-level languages access data in memory, how certain arithmetic operations work, how function calls work, and how cache-based and vector-based architectures work.

2. Apply compiler optimizations

These optimizations are probably the easiest to perform. The approach is to compile the code without any optimizations flagged at all and then run the binary on a problem that is similar to the work you will be doing. This will establish a baseline time. After reading the man page, manual, or other documentation that explains the flags available for that compiler, recompile the code with those options and rerun the code on the same data set. Be careful to check that the results are identical or close to the baseline run results, for some compilers are known to rearrange the program's instructions for more efficient execution, which may have the side effect of producing incorrect results. There may be compiler flags to force strict ordering of the instructions, or you can select a lower level of optimization.

Things to try include default optimization (frequently -O, -O2, or -O3), vectorization, loop unrolling, data prefetching, instruction lookahead, loop fusion, function inlining, interprocedural analysis and optimization, linking in more highly tuned libraries (see next section), and so on. The debugger flags (such as -g) in modern compilers typically have very minimal effect upon performance.

Some compilers also allow one to fine tune the optimizations along with the mathematical precision and ordering in order to find the best balance between speed and accuracy. Moreover, some advanced optimizations are not provided by the default optimization levels. For example, the following is a common set of HPC optimization settings for the Intel compiler:

-O3                # maximum optimizations
-fp-model precise  # enforce precise math ordering and operations
-ipo               # interprocedural optimization (not provided by -O3)
-xhost             # best vectorization instructions for current processor

See each systems’ User Guide for the compilers available on the systems.

3. Use tuned libraries

Many HPC systems are provided with libraries tuned by the hardware and compiler vendors. In particular, there are accelerated libraries for common mathematical operations such as linear algebra and Fourier transforms (FFT) as well as communication (e.g., MPI) and I/O. Reformulating code to use these libraries instead of hand-written routines can often result in significant speedup. These are also sometimes easy optimizations to perform.

See each systems’ User Guide for information on accelerated libraries that may be available and how to use them.

4. Profile the code

The two above methods are typically easy enough to identify and try without knowing too much about what parts of the code are taking the most time. Most of the remaining methods are more complex and shouldn’t be attempted before first determining if they are indeed required, if for no other reason than to avoid wasting your time. Therefore, at this point, we recommend profiling your code.

Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization. This increases the performance of a program by modifying certain aspects for increased efficiency.

See each systems’ User Guide for information on profilers that may be available and how to use them.

5. Avoid floating-point divisions

In instances where there are many divisions by the same expression or value, compute the reciprocal first and then multiply, for division requires many more cycles than multiplication.

For example, in Fortran, rewrite:

do i = 2, n-1
  x(i) = (a(i-1) - 2.0*a(i) + a(i+1))/delta_t
end do

as

delta_t_inv = 1.0/delta_t
do i = 2, n-1
  x(i) = (a(i-1) - 2.0*a(i) + a(i+1))*delta_t_inv
end do

Modern compilers may perform this particular optimization automatically.

6. Avoid repeated multiplication by constants

In loops where multiplication by a constant is applied at every iteration, such as summations, move such operations to outside of loops.

For example, in Fortran, if computing an integral using Simpson's Rule, replace:

do i = 2, 2*n-1, 2
  sum = sum + h*(f(i-1) + 4.0*f(i) + f(i+1))/3.0
end do

with

do i = 2, 2*n-1, 2
  sum = sum +  f(i-1) + 4.0*f(i) + f(i+1)
end do
sum = sum*h/3.0

If the expressions are not too complex, a good compiler might define a temporary variable and assign the value of h/3.0 and move multiplication out of the loop without the programmer taking any action.

7. If possible, use floating-point multiply-add

Many architectures come equipped with hardware and instruction sets for performing floating-point multiply-add (FMA) operations. For such systems, the user is at an advantage to implement the code using those instructions.

As a case in point, there is Horner's rule for computing polynomials. For example, in Fortran, computing a cubic, we change

y = a_0 + a_1*x + a_2*x**2 + a_3*x**3

to

y = a_0 + x*(a_1 + x*(a_2 + x*(a_3*x)))

As another example, certain Fast Fourier Transform (FFT) algorithms perform better than others. Specifically, Stockham and Transposed Stockham algorithms in Fortran tend to perform better on architectures equipped for floating-point multiply-add instructions than Cooley-Tukey. But note that FFTs are also provided by vendor-tuned libraries.

8. If making repeated function calls to generate the same values, create variables or parameters specifically for those values

In the old days, when memory was available at a premium and there was not much of it, it was better to compute function values as needed. Nowadays, huge amounts of memory are available cheaply, so unless there is a large number of function evaluations to be made, it may be more efficient to compute the values needed once and store them in arrays or parameters for use as needed. Note that this will incur the time required to retrieve those values from local data cache or main memory, degrading the efficiency of your algorithm. You might need to try it both ways to determine what works best for your code.

9. In numerically intensive regions of code, stride 1 memory access is best

When data is loaded to the data cache, entire cache lines are grabbed. For example, if the data are defined as an array, say A, of 8-byte numbers, and if memory is arranged in 128-byte lines, then 16 8-byte numbers are grabbed in each and every load. This means that in Fortran, if the innermost loop index proceeds across the array, the loop is not unrolled, and we are performing an algorithm that uses only one element of A indexed by the loop index, then only 1/16 of A's data are being used in each loop iteration, reducing the CPU's performance by a factor of 16. A specific example is matrix-vector multiplication in Fortran, where x and b fit in the data cache:

do i = 1, m
  do j = 1, n
    x(i) = x(i) + a(i,j)*b(j)
  end do
end do

As written, this iterated loop strides across rows of A, computing the elements x(i) of the vector x as one would on paper. If A does not fit in cache, then for each load, only 1/16 of the possible performance is achieved. The performance can be improved by exchanging loop indices, due to stride 1 access of A in the innermost loop:

do j = 1, n
  do i = 1, m
    x(i) = x(i) + a(i,j)*b(j)
  end do
end do

Similar considerations rule against using arrays of certain user-defined data structures where only one or a few of the members are being used, if possible. The reason for this is that each element of the array is a complete instance of the type. Those members that are not being used will still be loaded, effectively increasing the stride from 1. As an example, in C, suppose we define a type that has three 8-byte integers and three 8-byte real numbers, defining the values of a 3-dimensional vector field at a point in 3-dimensional space. On the system described immediately above, each line in memory would hold two instances of each type plus a little more. We can assume that the compiler will pad each instance of our defined type, so each memory line holds exactly two instances and no more. Then for each load, the values of only two points make it into the data cache, effectively cutting the CPU's performance by a factor of eight or more, even if the loop indices access the data by stride "1". One approach to improve performance is then is to convert an array of structures into a structure of arrays.

10. Segment loops so that the loop's data fits in cache - "use it 'til it hurts"

The idea is to perform as many operations as possible on the data in cache before writing it back to main memory. For example, suppose we are integrating Maxwell's full-vector field equations on a regular rectangular grid, so at each time-step, we update E_x, E_y, and E_z, then B_x, B_y, and B_z. E_x will depend on B_y and B_z, E_y will depend upon B_x and B_z, and E_z will depend upon B_x and B_y, and similarly for B_x, B_y, and B_z. We also desire to track the magnitude of the electric field, so we need to record the maximal value of |E|**2 across our computational domain.

Instead of updating the entire field E_x, then E_y, then E_z, followed by B_x, B_y, and B_z, and then computing |E|**2, we might fuse the update loops for E and fuse the update loops for B, and block the data so that the "local" blocks for E_x, E_y, E_z, B_x, B_y, and B_z all fit into the data cache simultaneously. Assuming IEEE double precision real values for each of the six fields and 2-MB cache size, we would then load the six fields in blocks of about 43K elements each, or dimensions of 35 cells on each side. The algorithm then proceeds by updating the three fields E_x, E_y, and E_z block by block, computing |E|**2 simultaneously, and then updating the fields B_x, B_y, and B_z block by block. Or we might be careful where neighboring blocks border each other and instead update E_x, E_y, E_z, then compute the maximal value of |E|**2, then update B_x, B_y, and B_z, proceeding with updates of all six fields, block by block.

11. If possible, avoid indirect addressing

The current state of compiler technology does not make it possible to optimize for memory access patterns where the memory addresses are not known at compile time. Such patterns are so-called random memory access patterns, and these patterns may defy the compiler's ability to write prefetch instructions and effectively use stride 1 load instructions.

12. If possible, minimize the number of call statements or function evaluations within loops

This means that it is better to avoid calling functions or subroutines or executing if-statements in computationally intensive loops, if possible. Then instead of using single values as the function arguments and calling the function many times, we use arrays and call the function once. However, the function may need to be rewritten to handle arrays. Ideally, this is not a huge cost. Many of the intrinsic functions, such as sin, cos, and exp in Fortran 90 and 95 have been extended to accept array arguments and return arrays conforming to the input arguments.

As an example, suppose X and Y are arrays of double precision numbers, and y(i,j) = f(x(i, j)). One way to compute Y is by the following iterated loop:

do j = 1, n
  do i = 1, m
    y(i,j) = f(x(i,j))
  end do
end do

which features m*n function calls. It would probably be better to modify f to accept array arguments and return a conformable array result, and then feed it X, giving one function call, something like:

y=f(x,m,n)

If the evaluation for f is simple, we might write the evaluation for f explicitly within the loop structure. For general functions f, we might try inlining f's instructions using appropriate compiler flags.

13. Minimize the number of branch statements within a loop

In these situations, we have an if-statement or if-block within a loop structure. To get around executing an if-statement at each iteration of a loop, we might try defining a mask. A mask is an array of FORTRAN type "logical" or "integer" which takes values based upon the values of the conditional expression for the corresponding loop indices. We then use this array to define where to instruct the compiler to make the various types of assignments formerly within the if-structure within the loop. Alternatively, if writing in Fortran 90 or 95, we might use the conditional expression in the if-block as the conditional expression in a where-construct.

For example, suppose we need to evaluate exp(1.0/x(i,j)), where X is an array of type double precision, written in Fortran 90. Where x(i,j) is zero, we assign 1.0 to Y. Instead of the following loop structure,

do j = 1, n
  do i = 1, m
    if (x(i,j) /= 0.0) then
      y(i,j) = exp(1.0/x(i,j))
    else
      y(i,j) = 1.0
    end if
  end do
end do

we might try

where (x /= 0.0)
  y = exp(1.0/x)
elsewhere
  y = 1.0
end where

According to the Fortran 90 standard, the difference between these iterated loop structures is that in the former, at most one block of the if-statement is executed at each iteration; but in the latter, every assignment in the where-construct is executed. For an array X, the expression 1.0/x is understood to be computed term-wise and not to mean the inverse of the array X.

14. Parallelize or vectorize scalar code

Parallelization means to augment and revise scalar code so as to run simultaneously on many processors at once, and vectorizing means to modify the code and/or compile it to run on systems whose registers are set up to perform the same operations on many numbers, or elements, at once. That is, instead of manipulating scalars, they manipulate vectors.

One may parallelize code by implementing message passing through application of subroutines or functions from the Message Passing Interface (MPI), shmem (available on Cray XE and HPE SGI architectures), OpenMP, pthreads, or by using Unified Parallel C (UPC) or co-Array Fortran (CAF) constructs. To use UPC or CAF, contact the HPC Help Desk to inquire about the availability of these tools on the system you wish to use.

Current Intel, AMD, and Arm technology support various vector instruction sets, such as AVX-512, AVX2, SSE, SVE, etc. To use vector capabilities on HPCMP systems, try using compiler optimizations that make use of those instruction sets. Review /proc/cpuinfo on the compute nodes of the system you wish to use to determine which instruction sets that architecture supports.

15. Use vectors as much as possible

This includes avoiding scalar function calls as much as possible and using long loops with large amounts of work. Where function calls are unavoidable, vectorize the call if at all possible. That is, modify the function call so that vectors can be passed in as input arguments, and the result of the function call is a conformable vector or array as opposed to a scalar. It is also important to inline as many function calls as possible.

Current architectures in use at the DoD HPCMP have short vectors, and the compilers will make use of vectorizing instructions. As an example, the PGI compilers use the AVX instruction sets for this, so it is enough to use the optimization flag "fastsse". On the other hand, recent Intel systems support the AVX2 and AVX-512 instruction sets, which can be used by setting the "xavx" Intel compiler option (also included in the "fast" compiler option). Read the man page on your preferred compiler to learn if and how vector instructions are used to compile your code.

16. If using explicit calls in message passing codes, such as MPI or SHMEM, make as few communication calls as possible, and send as much data as possible in each call

For example, at the beginning of execution, we might initialize MPI and then assign each processor its rank in the MPI communicator. Process 0 might then open up the input data files containing values that all of the processors need to begin solving the problem at hand. Instead of using MPI_BCAST to send each value one at a time, a temporary buffer is defined, and all of the values are stored in it, using type casts if necessary. The values in the buffer are then broadcast in a single call, after which all of the processors initialize their data using the received buffer.

In a similar way, when processors need to exchange data, it is much better to load up large buffers and call the exchange routines once, or as few times as possible, than to send many small messages. If possible, your algorithms might admit exchange of multiple steps of "ghost" data, (that is, data residing on the boundaries of local arrays that are updated by another process on another processor, but which are required to update array values locally) enabling an exchange of messages, say, every other step, or more infrequently. This communication technique may be referred to as passing a "halo".

17. In message passing codes, use nonblocking calls to hide latency

The idea here is to post either nonblocking send or receive operations well in advance of the matching call. For example, suppose we have some sort of algorithm that updates array values according to a multidimensional stencil. We might post nonblocking receive operations first; then update the local data and store the ghost values needed by neighboring processes; then call the matching send operations to send them. Or, if updating multiple arrays, we might update one; post nonblocking send operations; then update the next array; post the matching receive operations; then wait for the communications to complete by posting wait operations. Or perhaps post nonblocking sends for the second array's values, then update the next array, and so on. Care must be exercised, however, to ensure -- by use of wait or waitall operations -- that prior communications are completed before the next iteration's calls are initiated.