Requirements for Vectorizable Loops

ID 761661
Updated 3/4/2019
Version Latest
Public

author-image

By

Vectorization of Loops

For the Intel® C++ and Intel® Fortran compilers for Intel® 64 or Intel® Xeon architecture with Intel® AVX features, “vectorization” of a loop means unrolling the loop so that it can take advantage of packed SIMD instructions available with Intel AVX to perform the same operation on multiple data elements in a single instruction. For example, where a non-vectorized "DAXPY" loop

for (i=0;i<MAX;i++) z[i]=a*x[i]+y[i]; 

might use scalar SIMD instructions such as addsd and mulsd, a vectorized loop would use the packed versions, addpd or mulpd. (In the penultimate character, s stands for “scalar” and p stands for “packed”. In the final character, s stands for single precision and d stands for double). In the most recent Intel compilers, vectorization is one of many optimizations that are enabled by default.

Vectorization can be thought of as executing more than one consecutive iteration of the original loop at the same time. For processors supporting Streaming SIMD Extensions, this is usually 2 or 4 iterations, but potentially could be more, especially for integer arithmetic or for more advanced instruction sets. This leads to some restrictions on the types of loop that can be vectorized. Additional requirements for effective vectorization come from the properties of the SIMD instructions themselves.

Requirements for Loop Vectorization:

• The loop should consist primarily of straight-line code. There should be no jumps or branches such as switch statements, but masked assignments are allowed, including if-then-else constructs that can be interpreted as masked assignments.

• The loop should be countable, i.e. the number of iterations should be known before the loop starts to execute, though it need not be known at compile time. Consequently, there should be no data-dependent exit conditions, with the exception of very simple search loops.

• There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2 for correct results. This allows consecutive iterations of the original loop to be executed simultaneously in a single iteration of the unrolled, vectorized loop.

OK (vectorizable):  a[i-1] is always computed before it is used:

for (i=1; i<MAX; i++) {
   a[i] = b[i] + c[i]
   d[i] = e[i] – a[i-1]
}


Not OK (unvectorizable): a[i-1] might be needed before it has been computed:

for (i=1; i<MAX; i++) {
   d[i] = e[i] – a[i-1]
   a[i] = b[i] + c[i]
}

• There should be no special operators and no function or subroutine calls, unless these are inlined, either manually or automatically by the compiler, or they are SIMD (vectorized) functions. Intrinsic math functions such as sin(), log(), fmax(), etc. are allowed, since the compiler runtime library contains SIMD (vectorized) versions of these functions. See the comments section for a more extensive list.

• If a loop is part of a loop nest, it should normally be the inner loop. Outer loops can be parallelized using OpenMP* or autoparallelization (-parallel), but they can only rarely be auto-vectorized, unless the compiler is able either to fully unroll the inner loop, or to interchange the inner and outer loops. (Additional high level loop transformations such as these may require -O3. This option is available for both Intel® and non-Intel microprocessors but it may result in more optimizations for Intel microprocessors than for non-Intel microprocessors). The SIMD pragma or directive can be used to ask the compiler to vectorize an outer loop. Read Requirements for Vectorizing Loops with #pragma SIMD for more information about what sort of loops can be vectorized using #pragma simd, !DIR$ SIMD or their OpenMP 4.0 counterparts, #pragma omp simd and !$OMP SIMD.

Advice:

• Both reductions and vector assignments to arrays are allowed.

• Try to avoid mixing vectorizable data types in the same loop (except for integer arithmetic on array subscripts). Vectorization of type conversions may be  inefficient.

• Try to access contiguous memory locations. (So loop over the first array index in Fortran, or the last array index in C). Whilst the compiler is often able to vectorize loops with indirect or non-unit stride memory addressing, the cost of gathering data from or scattering back to memory may be too great to make vectorization worthwhile. In Fortran, the "CONTIGUOUS" keyword may be used to assert that assumed shape arrays or pointer arrays are contiguous.

• The “ivdep” pragma or directive may be used to advise the compiler that there are no loop-carried dependencies that would make vectorization unsafe.

• The “vector always” pragma or directive may be used to override the compiler’s heuristics that estimate whether vectorization of a loop is likely to yield a performance benefit. This pragma does not override the compiler's dependency analysis.

• To see whether a loop was or was not vectorized, and why, look at the vectorization component of the optimization report. By default, it is written to a file with extension .optrpt. The report may be enabled by the command line switches /Qopt-report:2 /Qopt-report-phase:vec (Windows*) or -qopt-report=2 -qopt-report-phase=vec (Linux* or macOS*). Additional information may be obtained by increasing the report level from 2 up to 5. 

• For further vectorization advice and help in interpreting the optimization report, try running Intel® Advisor on your application. 

• Explicit Vector Programming can make the vectorization of loops more predictable, through the use of SIMD functions and SIMD pragmas and directives or their OpenMP 4.0 counterparts.

More Information

See the Intel Compiler documentation for Fortran and C++ for more about automatic vectorization.

 

Optimization Notice in English