Developer Guide and Reference

ID 767251
Date 10/31/2024
Public
Document Table of Contents

Use Automatic Vectorization

The information below will guide you in setting up the auto-vectorizer.

Vectorization Speedup

Where does the vectorization speedup come from? Consider the following sample code, where a, b, and c are integer arrays:


do I=1,MAX
    C(I)=A(I)+B(I)
end do

If vectorization is not enabled, and you compile using the O1, -no-vec (Linux), or /Qvec- (Windows) option, the compiler processes the code with unused space in the SIMD registers, even though each register can hold three additional integers. If vectorization is enabled (compiled using O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (O2) or higher.

NOTE:

This option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as -arch or -x (Linux), or /arch or /Qx (Windows).

Linux

To evaluate performance enhancement, run Vectorize VecMatMult:

  1. Download and run the driver.f90 and matvec.f90 samples from Vectorize VecMatMul src folder on GitHub.

  2. This application multiplies a vector by a matrix using the following loop:

    
    do i=1,size1
       c(i) = c(i) + a(i,j) * b(j)
    end do

  3. Compile and run the application, first without enabling auto-vectorization. The default O2 optimization enables vectorization, so you need to disable it with a separate option.

    ifx -no-vec  driver.f90 matvec.f90 -o NoVectMult
    ./NoVectMult

  4. Build and run the application, this time with auto-vectorization.

    ifx driver.f90 matvec.f90 -o VectMult 
    ./VectMult

Windows

To evaluate performance enhancement, run Vectorize VecMatMult:

  1. Select Start > Intel oneAPI <version> > Intel oneAPI Command Prompt for Intel 64 for Visual Studio <version>.

  2. Download and run the driver.f90 and matvec.f90 samples from the Vectorize VecMatMul src folder on GitHub.

  3. This application multiplies a vector by a matrix using the following loop:

    
    do i=1,size1
       c(i) = c(i) + a(i,j) * b(j)
    end do

  4. Compile and run the application, first without enabling auto-vectorization. The default O2 optimization enables vectorization, so you need to disable it with a separate option.

    ifx /Qvec- driver.f90 matvec.f90 /exe:NoVectMult
    NoVectMult

  5. Build and run the application, this time with auto-vectorization.

    ifx driver.f90 matvec.f90 /exe:VectMult
    VectMult

When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the O1 option.

Obstacles to Vectorization

The following issues do not always prevent vectorization, but frequently cause the compiler to decide that vectorization would not be worthwhile.

  • Non-contiguous memory access: Four consecutive integers or floating-point values, or two consecutive doubles, may be loaded directly from memory in a single SSE instruction. But if the four integers are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, shown in the examples below. The compiler rarely vectorizes these loops, unless the amount of computational work is larger compared to the overhead from non-contiguous memory access.

    
    ! arrays accessed with non-unit stride 2 
    do I=1,SIZE,2
       B(I)=B(I)+(A(I)*X(I))
    end do
    
    ! inner loop accesses matrix A with non-unit stride SIZE2 
    do J=1,SIZE1
       do I=1,SIZE2
          B(I)=B(I)+(A(J,I)*X(J))
       end do
    end do
    
    ! indirect addressing of X using index array INDX
    do I=1,SIZE,2
       B(I)=B(I)+(A(I)*X(INDX(I)))
    end do

    The typical message from the vectorization report is: vectorization possible but seems inefficient, although indirect addressing may also result in the following report: existence of vector dependence.

  • Data dependencies: Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.

    • The simplest case is when data elements that are written (stored to) do not appear in any other iteration of the individual loop. In this case, all the iterations of the original loop are independent of each other, and can be executed in any order, without changing the result. The loop may be safely executed using any parallel method, including vectorization.

    • When a variable is written in one iteration and read in a subsequent iteration, there is a read-after-write dependency, also known as a flow dependency, for example:

      
      do J=2,5
         A(J)=A(J-1)+1
      end do

      The value of A(1) is propagated to all A(J). This cannot safely be vectorized: if the first two iterations are executed simultaneously by a SIMD instruction, the value of A(2) is used by the second iteration before it has been calculated by the first iteration.

    • When a variable is read in one iteration and written in a subsequent iteration, this is a write-after-read dependency, also known as an anti-dependency, for example:

      
      do J=2,5
         A(J-1)=A(J)+1
      end do
      ! this is equivalent to: 
      A(1)=A(2)+1
      A(2)=A(3)+1
      A(3)=A(4)+1
      A(4)=A(5)+1

      This write-after-read dependency is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. No iteration with a higher value of J can complete before an iteration with a lower value of J, and so vectorization is safe (it gives the same result as non-vectorized code).

      The following example may not be safe, since vectorization might cause some elements of A to be overwritten by the first SIMD instruction A(J-1)=A(J)+1 before being used for B in the second SIMD instruction B(J)=B(J)+A(J).

      
      do J=2,5
        A(J-1)=A(J)+1
        B(J)=B(J)+A(J)
      end do 
      ! this is equivalent to: 
      A(1)=A(2)+1
      B(2)=B(2)+A(2)
      A(2)=A(3)+1  
      B(3)=B(3)+A(3) 
      A(3)=A(4)+1
      B(4)=B(4)+A(4)
      A(4)=A(5)+1
      B(4)=B(4)+A(4)

    • Read-after-read situations are not really dependencies, and do not prevent vectorization or parallel execution. If a variable is unwritten, it does not matter how often it is read.

    • Write-after-write, or output dependencies, where the same variable is written to in more than one iteration, are generally unsafe for parallel execution, including vectorization.

    • One important exception that contains all of the above types of dependency is:

      MYSUM=0
      do J=1,MAX
         MYSUM = MYSUM + A(J)*B(J)
      end do

      Although MYSUM is both read and written in every iteration, the compiler recognizes such reduction idioms, and is able to vectorize them safely. The loop in the first example was another example of a reduction, with a loop-invariant array element in place of a scalar.

      These types of dependencies between loop iterations are sometimes known as loop-carried dependencies.

      The above examples are of proven dependencies. The compiler cannot safely vectorize a loop if there is even a potential dependency. For example:

      
      real, pointer :: A(:),B(:),C(:)
      ...
      do I=1,SIZE
         C(I)=A(I)*B(I)
      end do
      ...

      In the above example, the compiler needs to determine whether, for some iteration I, C(I) might refer to the same memory location as A(I) or B(I) for a different iteration. Such memory locations are sometimes said to be aliased. For example, if A(I) pointed to the same memory location as C(I-1), there would be a read-after-write dependency. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints. You can also avoid this problem by making the arrays ALLOCATABLE instead of POINTER, as the compiler knows these cannot be aliased.

See Also