No longer does Moore’s Law result in higher frequencies and improved scalar application performance; instead, higher transistor counts lead to increased parallelism, both through more cores and through wider SIMD registers. To get the full performance benefit from these improvements in processor technology, applications need to take advantage of both forms of parallelism. This article will focus on SIMD parallelism; multi-core parallelism is addressed elsewhere, with both OpenMP* and MPI being widely used in Fortran applications.
If suitable optimized library functions are available, such as the ones in the Intel® Math Kernel Library, these may be a simple and effective way for Fortran developers to take advantage of SIMD parallelism on Intel Architecture. Otherwise, the main way has been through auto-vectorizing compilers. The SIMD intrinsic functions that are available to C and C++ developers do not have a corresponding Fortran interface, and, in any case, require a great deal of programming effort and introduce an undesirable architecture dependence.
Auto-vectorization has its limitations. The compiler must be conservative, and not make any optimizations that could lead to different results from unoptimized code, even for unlikely-seeming values of input data. It must also estimate whether a code transformation is likely to yield faster or slower code. These decisions are typically based on incomplete information, since language standards have not provided a way to convey to the compiler all the information it needs to make more effective decisions or to convey programmer intent. Requirements for auto-vectorization are discussed here.
The auto-vectorizer can be more effective if the compiler is given hints through compiler directives, but whether or not a loop is vectorized still depends on the compiler’s internal analysis and the information available to it. Explicit Vector Programming is an attempt to remove that uncertainty: the programmer, using his knowledge of the application, can instruct the compiler to vectorize a loop. The method is analogous to threading with OpenMP, where a directive may require the compiler to thread a loop. Just like in OpenMP, the programmer is responsible for telling the compiler about reduction variables and private variables, to avoid race conditions.
Beginning with the OpenMP 4.0 standard directives are available as SIMD constructs that support explicit vectorization. The Intel Fortran and C++ compilers have supported the OpenMP SIMD constructs since the release of the Intel Compilers Version 15.
To enable support for the SIMD features of OpenMP in the Intel compilers, applications should be compiled with either /Qopenmp or /Qopenmp-simd (Windows*) or with either -qopenmp or -qopenmp-simd (Linux* or MacOS*). Beginning with the Intel Compilers Version 19, this compiler option is included in the default option, -O2. If the application does not use any OpenMP threading APIs, but only SIMD constructs, /Qopenmp-simd (-qopenmp-simd) should be preferred. Support for the OpenMP SIMD features may be disabled by the options /Qsimd- (Windows) or -qno-openmp-simd (Linux or MacOS).
For clarity in the examples below, lower case variables are used for C and upper case for Fortran, although Fortran is, of course, case insensitive.
Array Notation
In Intel® C and C++, array assignments are written in the form
a[0:n] = a[m:n] + b[0:n] * c[n:n]
A comparable Fortran array assignment, with a lower bound of 1 instead of 0, would be:
A(1:n) = A(1+m:n+m) + B(1:n) * C(n+1:n+n)
However, there are two important differences.
- In C, the quantity following the colon specifies the number of data elements in the assignment. Naturally, this must be the same for each array or pointer in the assignment. In Fortran, the quantity following the assignment specifies the upper bound of the array section being assigned. Thus x[j:k] in C describes the same array section as X(j:j+k-1) in Fortran.
- In C, array syntax asserts to the compiler that there is no backward dependence between data elements on the right hand side and data elements on the left hand side that would make vectorization unsafe. In the example above, no backward dependence would imply that m³0 and no overlap between A(1:n) and B(1:n) or C(n+1:n+n) such that, in a scalar loop, an element of A would be written before the overlapping element of B or C was read. If the assertion is incorrect, it is considered a user error and the results are unpredictable. In Fortran, there is no such assertion. The compiler may need to store a temporary copy of the right hand side (RHS) for all elements before starting assignment to the LHS. This store to a temporary array may introduce a significant performance overhead, but it may permit vectorization of loops that could not otherwise be vectorized. Note that the semantics of Fortran array notation are different from those of a Fortran 77 style DO loop.
-
DO i=1,n A(i) = A(i+m) + B(i) * C(n+i) ENDDO
It is possible to apply compiler directives such as IVDEP to an array assignment, just as for a DO loop. An IVDEP directive allows the compiler to assume there are no potential (data-dependent) dependencies that might make vectorization unsafe; it will not, however, change the semantics of making a temporary copy. The IVDEP directive will not override a proven dependency.
Use the compiler option -qopt-report-phase=vec (/Qopt-report-phase=vec) to get a report explaining which loops were vectorized, which were not and the reason why. Levels -qopt-report=3, 4 or 5 (/Qopt-report:3, 4 or 5) give additional detail. By default, the report is written to a file with extension .optrpt. For short reports, it may be convenient to redirect to standard error using the option -qopt-report-file=stderr (/Qopt-report-file:stderr).
Consecutive array assignments with similar array bounds may sometimes be fused into a single loop, to reduce loop overhead and/or memory traffic for intermediate results. When this happens, a message appears in the 'loop' phase of the optimization report, which may be included with -qopt-report-phase=loop,vec (/Qopt-report-phase:loop,vec).
Loops Containing Procedure Calls
Loops containing procedure calls (subroutine or function calls) cannot, in general, be vectorized, unless either the function can be inlined or a vector version of the function is available. For a small number of calls to small procedures, inlining may be the preferred solution, since it requires little or no source code changes and also eliminates the overhead of a procedure call.
Inlining within the same source file is enabled by default at optimization levels of -O2 and higher; inlining from one source file to another is enabled by interprocedural optimization with -ipo on Linux* or macOS*, /Qipo on Windows*. By default, the compiler decides whether or not to inline a procedure call based on size and other heuristics.
To ensure inlining in order to enable vectorization, the !DIR$ FORCEINLINE directive may be used, either immediately before a statement containing one or more procedure calls, or before a DO loop to inline all procedure calls within the loop. See the compiler user guide for examples.
The compiler option -qopt-report-phase=ipo (/Qopt-report-phase=ipo) may be used to generate a report that shows which functions were inlined.
SIMD-Enabled Procedures (Functions and Subroutines)
For some loops, such as those containing many procedure calls, nested procedure calls or calls to large procedures, inlining may not be practicable, due to complexity, code size or compilation time. Making such procedures SIMD-enabled still allows the containing loop to be vectorized and gives the programmer more control over how vectorization is carried out. For a SIMD enabled procedure, the compiler creates both a scalar and one or more vector implementations of the procedure. The SIMD attribute of the procedure must be declared in both the caller and the callee; the best way to do this is with an explicit interface. For example:
subroutine get_proj (rad, theta, proj)
!$omp declare simd(get_proj)
real, intent(in) :: rad, theta
real, intent(out) :: proj
proj = rad*cos(theta)
end subroutine get_proj
real, dimension(N) :: p,r,t
interface
subroutine get_proj (rad, theta, proj)
!$omp declare simd(get_proj)
real, intent(in) :: rad, theta
real, intent(out) :: proj
end subroutine get_proj
end interface
…
do i=1,N ! or instead of DO loop, make get_proj an elemental
call get_proj( r(i), t(i), p(i) ) ! subroutine and use array notation to write
enddo ! call get_proj( r(:), t(:), p(:) )
From the Intel® Fortran Compiler version 15.0 and onwards, it is sufficient to incorporate the SIMD-enabled procedure within a module. Then any calling routine that USEs the module has access to the interface and can see that a SIMD-enabled version of the function is available.
The OMP DECLARE SIMD or ATTRIBUTES VECTOR directive tells the compiler to create, in addition to a normal scalar version, a SIMD version of subroutine get_proj that takes vector input arguments rad and theta and returns a vector output argument proj. The length of the vector is chosen by the compiler and may depend on the targeted microarchitecture, but it can also be influenced by the programmer using an additional clause. The calling procedure contains an interface block, so that the compiler knows that a SIMD version of the subroutine is available. The compiler is then able to vectorize the loop containing the subroutine call. This is very similar to the way the compiler vectorizes a loop containing a direct call to a math function, making use of the SIMD-enabled functions in the Short Vector Math Library, libsvml. In the present example, such a simple procedure could easily be inlined; the value of SIMD-enabled procedures is in more complicated examples with calls to multiple procedures in different source files.
By default, the compiler expects that all arguments of a SIMD-enabled procedure could be vector. If the programmer knows that one or more arguments will always be scalar, they should be declared using the UNIFORM clause. This allows the compiler to avoid generating vector code for the “uniform” argument, and instead broadcast the scalar value to each vector lane. For example,
module mymod
contains
subroutine get_proj (rad, theta, proj)
!$omp declare simd(get_proj) uniform(rad)
real, intent(in) :: rad, theta
real, intent(out) :: proj
proj = rad*cos(theta)
end subroutine get_proj
end module mymod
subroutine caller
use mymod
real, dimension(N) :: p,t
real, :: r
…
do i=1,N ! or instead of DO loop, make get_proj an elemental
call get_proj( r, t(i), p(i) ) ! subroutine and use array notation to write
enddo ! call get_proj( r, t(:), p(:) )
The source code of the subroutine itself is unchanged, only the subroutine attributes are different.
Calling SIMD-Enabled Procedures With Array Arguments
Instead of calling a SIMD-enabled function from within a loop with arguments that are individual array elements, it is possible to call the SIMD function directly with whole arrays or array sections as arguments. To allow this, the SIMD-enabled function must also be declared as ELEMENTAL. (ELEMENTAL is a Fortran standard keyword). Both aspects are required: a SIMD-enabled function is not automatically elemental, and an ELEMENTAL function is not automatically SIMD-enabled. So for example:
module mymod
contains
elemental subroutine get_proj (rad, theta, proj)
!$omp declare simd(get_proj) uniform(rad)
real, intent(in) :: rad, theta
real, intent(out) :: proj
proj = rad*cos(theta)
end subroutine get_proj
end module mymod
…
subroutine caller
use mymod
real, dimension(N) :: p,t
real :: r
…
call get_proj( r, t(:), p(:) )
Or to pass just half of the array:
call get_proj( r, t(1:N/2), p(1:N/2) )
Efficiency Considerations for SIMD-Enabled Procedures
In C, scalar arguments are by default passed by value; in SIMD-enabled procedures, arguments are passed as a short vector of values. The Fortran default calling convention is for scalar arguments to be passed by reference; the simple extension to SIMD-enabled procedures is to pass a short vector of addresses, instead of a single address. The compiler then “gathers” data from the vector of addresses to create a short vector of values for use in subsequent vector arithmetic. The overhead from this gather means that in order to see a performance benefit, a simple Fortran SIMD-enabled procedure needs to contain more work compared to an analogous SIMD-enabled function in C. This overhead can be mitigated for input arguments by choosing to pass them by value, for example:
subroutine get_proj (rad, theta, proj) bind(C)
!$omp declare simd(get_proj) uniform(rad)
real, intent(in), value :: rad, theta
real, intent(out) :: proj
proj = rad*cos(theta)
end subroutine get_proj
The VALUE keyword alone is not sufficient; it must be combined with BIND(C). Instead of these two keywords, it is possible to use the directive
$DIR$ ATTRIBUTES VALUE :: argname
The keywords are preferred, since they are part of the Fortran language standard.
This method cannot be applied to output arguments, since the Fortran language standard requires INTENT(OUT) or INTENT(INOUT) arguments to be passed by reference. However, a SIMD-enabled subroutine containing a single vector output argument may be converted to a SIMD-enabled function with a vector result, which will be passed back to the calling procedure by value, avoiding the overhead of a gather. For example:
module mymod
contains
real function proj (rad, theta) bind(C)
!$omp declare simd(proj) uniform(rad)
real, intent(in), value :: rad, theta
proj = rad*cos(theta)
end function proj
end module mymod
…
use mymod
real, dimension(N) :: p, t
real :: r
…
do i=1,N ! or instead of DO loop, make proj an elemental function
p(i) = proj( r, t(i) ) ! and use array notation to write
enddo ! p(:) = proj( r, t(:) )
Any additional vector arguments with intent(out) or intent(inout) must be passed by reference in the usual way.
The LINEAR(REF()) Clause
The REF qualifier to the LINEAR clause was introduced in OpenMP 4.5. It is supported by the Intel Fortran Compiler Version 16 update 1 (16.0.1) and later. When applied to a vector argument passed by reference, it tells the compiler that the addresses in the vector argument are consecutive, and so the compiler can generate a single vector load or store instead of a gather or scatter. This allows more efficient SIMD code to be generated with less source changes. Using this new feature, our original example becomes:
module mymod
contains
subroutine get_proj (rad, theta, proj)
!$omp declare simd(get_proj) uniform(rad) linear(ref(theta,proj))
real, intent(in) :: rad, theta
real, intent(out) :: proj
proj = rad*cos(theta)
end subroutine get_proj
end module mymod
The SIMD Directive
The OpenMP 4.0 directive, !$OMP SIMD, and its Intel compiler equivalent, !DIR$ SIMD, instruct the compiler to generate vectorized code for the following loop. Unlike other compiler directives, such as IVDEP or VECTOR ALWAYS, the SIMD directive is not a hint to the compiler, it is a command. The compiler does not analyze the loop for dependencies or estimate whether vectorization is likely to give a speedup, it forges ahead and vectorizes. It is the responsibility of the programmer to ensure there are no dependencies that could make vectorization unsafe, and to judge whether vectorization may improve performance. This behavior is analogous to the !$OMP DO directive, which instructs the compiler to generate threaded code for the following DO loop, leaving the programmer responsible for thread safety and for providing sufficient parallel work to make threading worthwhile.
The compiler may go to great lengths, including emulation of vector functions by scalar calls, to ensure vectorization. However, as for OpenMP threading, SIMD loops must still obey a few basic rules, such as the iteration count being known at entry to the loop.
Here is an example of how vectorization of a loop can be enforced using the SIMD directive:
module mymod
real, pointer, dimension(:) :: a, b, c, d, e
end module mymod
subroutine add4
use mymod
implicit none
integer :: i
!$omp simd
do i=1,1000
e(i) = a(i) + b(i) + c(i) + d(i)
enddo
end subroutine add4
> ifort add4.f90 -c -qopt-report-phase=vec,loop -qopt-report-file=stderr
…
LOOP BEGIN at add4.f90(12,34)
< Multiversioned v1>
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between line 12 and line 12
remark #25438: unrolled without remainder by 4
LOOP END …
Unaided, the compiler does not auto-vectorize because it does not know whether a part of e might be aliased with one of the pointers a, b, c or d (i.e., point to an overlapping memory location). There are too many possibilities for the compiler to test them all at run-time. The IVDEP compiler directive could be used to assert that the potential dependencies are not realized, i.e., that e is independent of a, b, c and d, so it is safe to vectorize the loop. If we recompile with !DIR$ IVDEP, we get two messages: “LOOP WAS VECTORIZED” and “vectorization possible but seems inefficient”. The original message “<Multiversioned v1>” gives a clue. If we ask for more detail in the optimization report by -qopt-report-phase=vec,loop , we see
remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
The compiler does not know if the pointers are pointing to contiguous chunks of memory, or to an array section with a non-unit stride. It generates separate code for each case, and judges vectorization to be worthwhile in the case of contiguous memory (unit stride), but not otherwise. We could invite the compiler to vectorize each case, irrespective of whether it thinks there will be a speed-up, by using in addition the compiler directive !DIR$ VECTOR ALWAYS. With this, both loop versions are vectorized.
A simpler way to ensure vectorization is to compile with the single directive !$OMP SIMD, enabled by /Qopenmp-simd (Windows) or -qopenmp-simd (Linux or macOS). This instructs (not invites) the compiler to vectorize the loop, ignoring any considerations of dependencies or performance. With this, the message
remark #15301: SIMD LOOP WAS VECTORIZED
is seen for both loop versions.
The SIMD directive does not tell the compiler whether the target of the pointer has unit stride, so there are still two loop versions. If you know that the target always has unit stride, then add the “CONTIGUOUS” keyword to the pointer declaration to inform the compiler, which will then generate only the loop version for stride 1, unit stride.
The SIMD directive may disable all dependency checking by the compiler. Using it when real dependencies are present may yield unexpected results or failures.
The SIMD directive includes the functionality of both IVDEP and VECTOR ALWAYS directives, as illustrated by the above example, but it is more powerful. In the following example, using IVDEP and VECTOR ALWAYS is not sufficient to get the loop to vectorize.
real function vec_sum(x,nx)
implicit none
integer, intent(in) :: nx
real, intent(in), dimension(nx) :: x
integer :: i
interface
real function func(x)
!$OMP DECLARE SIMD (func)
real, intent(in) :: x
end function func
end interface
vec_sum = 0.
! !$OMP SIMD reduction(+:vec_sum)
do i = 1,nx
vec_sum = vec_sum + func(x(i))
enddo
end function vec_sum
As coded, the auto-vectorizer is unable to recognize and vectorize the combination of a reduction loop with a vector function call, although it could auto-vectorize either construct separately:
> ifort -c -qopenmp-simd -qopt-report-phase=vec -qopt-report-file=stderr vec_sum.f90
vec_sum.f90(17): (col. 3) remark: loop was not vectorized: existence of vector dependence.
Even an IVDEP directive will not enable the compiler to auto-vectorize. However, uncomment the SIMD directive and the compiler behaves as directed:
> ifort -c -qopenmp-simd -qopt-report-phase=vec -qopt-report-file=stderr vec_sum.f90
…
LOOP BEGIN at vec_sum.f90(17,3)
remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
This example also illustrates the use of a REDUCTION clause on the SIMD directive. Just like in an OpenMP worksharing construct such as !$OMP DO, its use is mandatory to avoid conflicts (race conditions) between different iterations that write to the same reduction variable, vec_sum in the above example.
The PRIVATE clause also functions the same way as in OpenMP to avoid race conditions. For example, if the above loop is modified slightly to become an integral of func over x, a PRIVATE clause is needed:
!$omp simd reduction(+:vec_sum) private(xi)
do i = 1,nx
xi = x0 + bin_width*(i)
vec_sum = vec_sum + func(xi)
enddo
Additional clauses supported by the SIMD directive include LINEAR, to tell the compiler that the loop contains one or more secondary induction variables, and ALIGNED, to tell the compiler that one or more arrays have a specified alignment in memory.
ALIGNED(X:64) would tell the compiler that the start of the array X is always aligned to a 64 byte boundary in memory, and that the compiler can safely generate aligned load instructions without risking a fault, for example on Intel® Xeon Phi™ Coprocessors. It is the programmer’s responsibility to align the array. In many cases, this can be done with a command line switch such as -align array64byte. For more details about these and other clauses supported by the !$OMP SIMD and !DIR$ SIMD directives, see the Intel® Fortran Compiler Developer Guide and Reference under SIMD Directive (OpenMP* API) and SIMD Loop Directive.
If the compiler is unable to vectorize a loop preceded by a !$OMP SIMD or a !DIR$ SIMD directive, for example, if it does not conform to the requirements referenced above, the compiler emits a warning diagnostic:
remark #13379: loop was not vectorized with “simd”.
For !DIR$ SIMD only, this may be converted to a fatal error if an ASSERT clause is added to the directive. This can be a useful way to alert the programmer if future changes unintentionally prevent the loop from vectorizing.
If an application is compiled with the option -no-simd, any !DIR$ SIMD directives are ignored. This can be useful for comparison testing. Any !$OMP SIMD directives are ignored by default, unless the application is built with -qopenmp or -qopenmp-simd.