Intel® Fortran Compiler Classic and Intel® Fortran Compiler Developer Guide and Reference

ID 767251
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Vectorization Programming Guidelines

The goal of including the vectorizer component in the Intel® Fortran Compiler is to exploit single-instruction multiple data (SIMD) processing automatically. Users can help by supplying the compiler with additional information; for example, by using auto-vectorizer hints or directives.

NOTE:

This option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as -arch or -x (Linux), or /arch or /Qx (Windows).

Guidelines to Vectorize Innermost Loops

Follow these guidelines to vectorize innermost loop bodies.

Use:

  • Straight-line code (a single basic block).

  • Vector data only (arrays and invariant expressions on the right-hand side of assignments). Array references can appear on the left-hand side of assignments.

  • Only assignment statements.

Avoid:

  • Function calls (other than math library calls).

  • Non-vectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions).

  • Mixing vectorizable types in the same loop (leads to lower resource utilization).

  • Data-dependent loop exit conditions (leads to loss of vectorization).

To make your code vectorizable, you need to edit your loops. You should only make changes that enable vectorization, and avoid these common changes:

  • Loop unrolling, which the compiler performs automatically.

  • Decomposing one loop with several statements in the body into several single-statement loops.

Restrictions

There are a number of restrictions that you should consider. Vectorization depends on two major factors: hardware and style of source code.

Factor

Description

Hardware

The compiler is limited by restrictions imposed by the underlying hardware. For example, Intel® Streaming SIMD Extensions (Intel® SSE) has vector memory operations that are limited to stride-1 accesses with a preference to 16-byte-aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it for a distinct target architecture.

Style of source code

The style in which you write source code can inhibit vectorization. For example, avoid using a pointer unless its association with a variable is established within the same procedure. Otherwise, the compiler may not be able to prove that two memory references refer to distinct locations.

Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures. The ambiguity arises from the complexity of the keywords, operators, data references, pointer arithmetic, and memory operations within the loop bodies.

By understanding these limitations and by knowing how to interpret diagnostic messages, you can modify your program to overcome the known limitations and enable effective vectorization.

Guidelines for Writing Vectorizable Code

Follow these guidelines to write vectorizable code:

  • Use simple DO loops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices.

  • Write straight-line code. Avoid branches such as DO, GOTO, and function calls other than math library calls. Complicated IF conditions in loops may also prevent your code from vectorizing.

  • Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies.

  • Try to use array notations instead of the using pointers. Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.

  • Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate counter for use as an array address.

  • Access memory efficiently:

    • Favor inner loops with unit stride.

    • Minimize indirect addressing.

    • Align your data to 16-byte boundaries (for Intel® SSE instructions).

  • Choose a suitable data layout with care. Most multimedia extension instruction sets are rather sensitive to alignment.

    For example, the data movement instructions of Intel® SSE operate much more efficiently on data that is aligned at a 16-byte boundary in memory. Therefore, the success of a vectorizing compiler also depends on its ability to select an appropriate data layout which, in combination with code restructuring (like loop peeling), results in aligned memory accesses throughout the program.

  • Use aligned data structures: Data structure alignment is the adjustment of any data object in relation with other objects.

    CAUTION:
    Use this hint with care. Incorrect usage of aligned data movements result in an exception when using Intel® SSE.

  • Use structure of arrays (SoA) instead of array of structures (AoS): An array is the most common type of data structure that contains a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can be a hindrance for use of vector processing. To make vectorization of the resulting code more effective, you can also select appropriate data structures.

Dynamic Alignment Optimizations

Dynamic alignment optimizations can improve the performance of vectorized code, especially for long trip count loops. Disabling such optimizations can decrease performance, but it may improve bitwise reproducibility of results, factoring out data location from possible sources of discrepancy.

To enable or disable dynamic data alignment optimizations, specify the option /Qopt-dynamic-align[-] (Windows) or -q[no-]opt-dynamic-align (Linux).

Use Aligned Data Structures

Data structure alignment is the adjustment of any data object with relation to other objects. The Intel® Fortran Compiler may align individual variables to start at certain addresses to speed up memory access. Misaligned memory accesses can incur large performance losses on certain target processors that do not support them in hardware.

Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called naturally aligned if its address is aligned to its size; otherwise, it is called misaligned. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to eight (8).

A data structure is a way of storing data in a computer so that it can be used efficiently. Often, a carefully chosen data structure allows a more efficient algorithm to be used. A well-designed data structure allows a variety of critical operations to be performed, using as little resources (execution time and memory space) as possible. Example:

type mytype
  integer(kind=2):: Data1
  integer(kind=2):: Data2
  integer(kind=2):: Data3
end type mytype

In the example data structure above, if the type integer(kind=2) is stored in two bytes of memory then each member of the data structure is aligned to a boundary of two bytes. Data1 would be at offset 0, Data2 at offset 2 and Data3 at offset 4. The size of this structure is six bytes. The type of each member of the structure usually has a required alignment, meaning that it is aligned on a pre-determined boundary, unless you request otherwise. In cases where the compiler has taken sub-optimal alignment decisions, you can use the declaration !DIR$ ATTRIBUTES ALIGN : n :: var, to indicate that var must be allocated with alignment n. For example:

real (kind=8) :: A(N), B(N)

do I=1, N-1
   A(I+1) = B(I) * 3
end do

…

If the first element of both arrays is aligned at a 16-byte boundary, then either an unaligned load of elements from B or an unaligned store of elements into A must be used after vectorization.

The compiler will decide whether it is more cost-effective to generate a loop that aligns the vectorized stores to A or the vectorized load from B. If aligning the stores is deemed more important, the compiler will peel the first iterations of the loop to enable this. In order for the compiler to make this choice, you can inform the compiler of the alignment as follows:

!DIR$ ATTRIBUTE ALIGN : 16 :: A
!DIR$ ATTRIBUTE ALIGN : 16 :: B

Runtime optimization provides a generally effective way to obtain aligned access patterns at the expense of a slight increase in code size and testing. If incoming access patterns are aligned at a 16-byte boundary, you can avoid this overhead with the hint !DIR$ ASSUME_ALIGNED X:16 in the function to convey this information to the compiler.

For example, suppose you can introduce an optimization in the case where a block of memory with address n2 is aligned on a 16-byte boundary. You could use !DIR$ ASSUME_ALIGNED n2:16.

CAUTION:
Incorrect use of aligned data movements results in an exception for Intel® SSE.

Use Structure of Arrays Versus Array of Structures

The most common and well-known data structure is the array that contains a contiguous collection of data items, which can be accessed by an ordinal index. This data can be organized as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization works excellently for encapsulation, for vector processing it works poorly.

You can select appropriate data structures to make vectorization of the resulting code more effective. To illustrate this point, compare the traditional array of structures (AoS) arrangement for storing the r, g, b components of a set of three-dimensional points with the alternative structure of arrays (SoA) arrangement for storing this set.







With the AoS arrangement, a loop that visits all components of an RGB point before moving to the next point exhibits a good locality of reference. This is because all elements in the fetched cache lines are used. The disadvantage of the AoS arrangement is that each individual memory reference in such a loop exhibits a non-unit stride, which, in general, adversely affects vector performance. Furthermore, a loop that visits only one component of all points exhibits less satisfactory locality of reference because many of the elements in the fetched cache lines remain unused.

With the SoA arrangement, the unit-stride memory references are more amenable to effective vectorization and still exhibit good locality of reference within each of the three data streams. Consequently, an application that uses the SoA arrangement may outperform an application based on the AoS arrangement when compiled with a vectorizing compiler. This performance difference may not be obviously apparent during the early implementation phase.

Before you start vectorization, try out some simple rules:

  • Make your data structures vector-friendly.
  • Make sure that the inner loop indices correspond to the leftmost array index (column-major order).
  • Make sure that the outer loop indices correspond to the rightmost array index.
  • Use structure of arrays over array of structures.

For instance, when dealing with three-dimensional coordinates, use three separate arrays for each component (SoA), instead of using one array of three-component structures (AoS). To avoid dependencies between loops that will eventually prevent vectorization, use three separate arrays for each component (SoA), instead of one array of three-component structures (AoS).

When you use the AoS arrangement, each iteration produces one result by computing XYZ, but it can at best use only 75% of the SSE unit because the fourth component is not used. Sometimes, the compiler may use only one component (25%).

When you use the SoA arrangement, each iteration produces four results by computing XXXX, YYYY and ZZZZ, using 100% of the SSE unit. A drawback for the SoA arrangement is that your code will likely be three times as long.

If your original data layout is in AoS format, you may want to consider a conversion to SoA before the critical loop:

  • Use the smallest data types that give the needed precision to maximize potential SIMD width. (If only 16-bits are needed, using a integer(kind=2) rather than an integer(kind=4) can make the difference between 8-way or four-way SIMD parallelism.)
  • Avoid mixing data types to minimize type conversions.
  • Avoid operations not supported in SIMD hardware.
  • Use all the instruction sets available for your processor. Use the appropriate command line option for your processor type, or select the appropriate IDE option (Windows only):
    • Project > Properties > Fortran > Code Generation > Intel Processor-Specific Optimization, if your application runs only on Intel® processors.
    • Project > Properties > Fortran > Code Generation > Enable Enhanced Instruction Set, if your application runs on compatible, non-Intel processors.
  • Vectorizing compilers usually have some built-in efficiency heuristics to decide whether vectorization is likely to improve performance. The Intel® Fortran Compiler disables vectorization of loops with many unaligned or non-unit stride data access patterns. If experimentation reveals that vectorization improves performance, you can override this behavior using the !DIR$ VECTOR ALWAYS hint before the loop. The compiler vectorizes any loop regardless of the outcome of the efficiency analysis (provided that vectorization is safe).