Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference

ID 767253
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

SIMD-Enabled Functions

SIMD-enabled functions (formerly called elemental functions) are a general language construct to express a data parallel algorithm. A SIMD-enabled function is written as a regular C/C++ function, and the algorithm describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on a single element or it can be called in a data parallel context to operate on many elements.

How SIMD-Enabled Functions Work

When you write a SIMD-enabled function, the compiler generates short vector variants of the function that you requested, which can perform your function's operation on multiple arguments in a single invocation. The short vector variant may be able to perform multiple operations as fast as the regular implementation performs a single one by using the vector instruction set architecture (ISA) in the CPU. When a call to a SIMD-enabled function occurs in a SIMD loop or another SIMD-enabled function, the compiler replaces the scalar call with the best fit from the available short-vector variants of the function.

In addition, when invoked from a pragma omp construct, the compiler may assign different copies of the SIMD-enabled functions to different threads (or workers), executing them concurrently. The result is that your data parallel operation executes on the CPU using both the parallelism available in the multiple cores and the parallelism available in the vector ISA. In other words, if the short vector function is called inside a parallel loop, (a vectorized auto-parallelized loop) you can achieve both vector-level and thread-level parallelism.

Declare a SIMD-Enabled Function

You need to use the appropriate syntax from below in your code for the compiler to generate the short vector function:

Linux

Use the __attribute__((vector (clauses))) declaration:

__attribute__((vector (clauses))) return_type simd_enabled_function_name(parameters)

Alternately, you can use the following OpenMP pragma, which requires the [q or Q]openmp or [q or Q]openmp-simd compiler option:

#pragma omp declare simd clauses

Windows

The clauses in the vector declaration may be used for achieving better performance by overriding defaults. These clauses at SIMD-enabled function definition declare one or several short vector variants for a SIMD-enabled function. Multiple vector declarations with different set of clauses may be attached to one function in order to declare multiple different short vector variants available for a SIMD-enabled function.

The clauses are defined as follows:

Clause Definition
processor(cpuid)

Tells the compiler to generate a vector variant using the instructions, the caller/callee interface, and the default vector length selection scheme suitable to the specified processor. Use of this clause is highly recommended, especially for processors with wider vector register support (example: core_2nd_gen_avx and newer).

cpuid takes one of the following values:

  • core_4th_gen_avx_tsx
  • core_4th_gen_avx
  • core_3rd_gen_avx
  • core_2nd_gen_avx
  • core_aes_pclmulqdq
  • core_i7_sse4_2
  • atom
  • core_2_duo_sse4_1
  • core_2_duo_ssse3
  • pentium_4_sse3
  • pentium_m
  • pentium_4
  • haswell
  • broadwell
  • skylake
  • skylake_avx512

vectorlength(n) / simdlen(n) (for omp declare simd)

Where n is a vector length that is a power of 2, no greater than 32.

The simdlen clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to n times the scalar function execution. When omitted the compiler selects the vector length automatically depending on the routine return value, parameters, and/or the processor clause. When multiple vector variants are called from one vectorization context (for example, two different functions called from the same vector loop), explicit use of identical simdlen values are advised to achieve good performance.

linear(list_item[, list_item...]), where list_item is one of:
  • param[:step]
  • val(param[:step])
  • ref(param[:step])
  • uval(param[:step])

The linear clause tells the compiler that for each consecutive invocation of the routine in a serial execution, the value of param is incremented by step, where param is a formal parameter of the specified function or the C++ keyword this. The linear clause can be used on parameters that are either scalar (non-arrays and of non-structured types), pointers, or C++ references. step is a compile-time integer constant expression, which defaults to 1 if omitted.

If more than one step is specified for a particular parameter, a compile-time error occurs.

Multiple linear clauses will be merged as a union.

The meaning of each variant of the clause is as follows:

  • linear(param[:step]): For parameters that are not C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the value of the parameter will be incremented by step. The clause can also be used for C++ references for backward compatibility, but it is not recommended.
  • linear(val(param[:step])): For parameters that are C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the referenced value of the parameter will be incremented by step.
  • linear(uval(param[:step])): For C++ references: means the same as linear(val()). It differs from linear(val()) so if linear(val()) a vector of references is passed to vector variant of the routine but in case of linear(uval()) only one reference is passed (and thus linear(uval()) is better to use in terms of performance).
  • linear(ref(param[:step])) :For C++ references: means that the reference itself is linear, i.e. the referenced values (that form a vector for calculations) are located sequentially, like in array with the distance between elements equal to step.

uniform(param [, param,]…)

Where param is a formal parameter of the specified function or the C++ keyword this.

The uniform clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. It is often useful in generating more favorable vector memory references. An acknowledgment of a uniform clause may allow broadcast operations to be hoisted out of the caller loop. Evaluate carefully the performance implications. Multiple uniform clauses are merged as a union.

mask / nomask

The mask and nomask clauses tell the compiler to generate only masked or unmasked (respectively) vector variants of the routine. When omitted, both masked and unmasked variants are generated. The masked variant is used when the routine is called conditionally.

inbranch / notinbranch

The inbranch and notinbranch clauses are used with #pragma omp declare simd. The inbranch clause works the same as the mask clause above and the notinbranch clause works the same as the nomask clause above.

Write the code inside your function using existing C/C++ syntax and relevant built-in functions.

Usage of Vector Function Specifications

You may define several vector variants for one routine with each variant reflecting a possible usage of the routine. Encountering a call, the compiler matches vector variants with actual parameter kinds and chooses the best match. Matching is done by priorities. In other words, if an actual parameter is the loop invariant and the uniform clause was specified for the corresponding formal parameter, then the variant with the uniform clause has a higher priority. Linear specifications have the following order, from high priority to low: linear(uval()), linear(), linear(val()), linear(ref()). Consider the following example loops with the calls to the same routine.

// routine prototype
#pragma omp declare simd                           // universal but slowest definition matches the use in all three loops
#pragma omp declare simd linear(in1) linear(ref(in2)) uniform(mul) // matches the use in the first loop
#pragma omp declare simd linear(ref(in2))                            // matches the use in the second and the third loops
#pragma omp declare simd linear(ref(in2)) linear(mul)              // matches the use in the second loop
#pragma omp declare simd linear(val(in2:2))                          // matches the use in the third loop
extern int func(int* in1, int& in2, int mul);

int *a, *b, mul, *c;
int *ndx, nn;
...
// loop examples
   for (int i = 0; i < nn; i++) {
       c[i] = func(a + i, *(b + i), mul); // in the loop, the first parameter is changed linearly, 
                                          // the second reference is changed linearly too
                                          // the third parameter is not changed
   }

   for (int i = 0; i < nn; i++) {
       c[i] = func(&a[ndx[i]], b[i], i + 1); // the value of the first parameter is unpredictable,
                                             // the second reference is changed linearly
                                             // the third parameter is changed linearly
   }

   #pragma omp simd
   for (int i = 0; i < nn; i++) {
       int k = i * 2;  // during vectorization, private variables are transformed into arrays: k->k_vec[vector_length]
       c[i] = func(&a[ndx[i]], k, b[i]); // the value of the first parameter is unpredictable,
                                         // the second reference and value can be considered linear
                                         // the third parameter has unpredictable value
                                         // (the #pragma simd linear(val(in2:2))) will be chosen from the two matching variants)
  }

SIMD-Enabled Functions and C++

You should use SIMD-enabled functions in modern C++ with caution: C++ imposes strict requirements on compilation and execution environments that may not compose well with semantically-rich language extensions such as SIMD-enabled functions. There are three key aspects of C++ that interrelate with SIMD-enabled functions concept: exception handling, dynamic polymorphism, and the C++ type system.

SIMD-Enabled Functions and Exception Handling

Exceptions are currently not supported in SIMD contexts: exceptions cannot be thrown and/or caught in SIMD loops and SIMD-enabled functions. Therefore, all SIMD-enabled functions are considered noexcept in C++11 terms. This affects not only short vector variants of a function, but its original scalar routine as well. This is enforced when the function is compiled: it is checked against throw construct and against function calls throwing exceptions. It is also enforced when the SIMD-enabled function call is compiled.

SIMD-Enabled Functions and Dynamic Polymorphism

Vector specifications are not supported for virtual functions (yet).

SIMD-Enabled Functions and the C++ Type System

Vector attributes are attributes in the C++11 sense and so are not part of a functional type of SIMD-enabled functions. Vector attributes are bound to the function itself, an instance of a functional type. This has the following implications:

  • Template instantiations having SIMD-enabled functions as template parameters won't catch vector attributes, so it is impossible to preserve vector attributes in function wrapper templates like std::bind which add indirection. This indirection may sometimes be optimized away by compiler and the resulting direct call will have all vector attributes associated.
  • There is no way to overload or specialize templates by vector attributes.
  • There is no way to write functional traits to capture vector attributes for the sake of template metaprogramming.

The example below depicts various situations where this situation may be observed:

template <int f(int)>   // Function value template – captures exact function
                        // not a function type
int caller1(int x[100]) {
   int res = 0;
#pragma omp simd reduction(+:res)
   for (int i = 0; i < 100; i++) {
      res += f(x[i]);   // Exact function put here upon instantiation
   }
   return res;
}

template <typename F>  // Generic functional type template – captures 
                       // object type for functors or entire functional type 
                       // for functions. If vector attributes were part of 
                       // a functional type they might be captured and applied
                       // but currently they are not.
int caller2(F f, int x[100]) {
   int res = 0;
#pragma omp simd reduction(+:res)
   for (int i = 0; i < 100; i++) {
      res += f(x[i]);  // Will call matching function f indirectly
                       // Will call matching f.operator() directly
   }
   return res;
}

template <typename RET, typename ARG>  // Type-decomposing template
                                       // captures argument and return types.
                                       // Vector attributes would be lost 
                                       // even if they were part of a 
                                       // functional type.
int caller3(RET (*f)(ARG), int x[100]) {
   int res = 0;
#pragma omp simd reduction(+:res)
   for (int i = 0; i < 100; i++) {
      res += f(x[i]);  // Will call matching function f indirectly
   }
   return res;
}


#pragma omp declare simd 
int function(int x); // SIMD-enabled function
int nv_function(int x);                 // Regular scalar function

struct functor {                        // Functor class with
#pragma omp declare simd                      // SIMD-enabled operator()
   int operator()(int x);
};

int arr[100];

int main() {
   int res;
#pragma noinline
   res = caller1<function>(arr); // This will be instantiated for 
                                 // function() and call short vector variant
#pragma noinline
   res += caller1<nv_function>(arr); // This will be separately instantiated 
                                     // for nv_function()
#pragma noinline
   res += caller2(function, arr); // This will be instantiated for
                                  // int(*)(int) type and will call scalar
                                  // function() indirectly
#pragma noinline
   res += caller2(nv_function, arr); // This will call the same
                                     // instantiation as above on nv_function

#pragma noinline
   res += caller2(functor(), arr); // This will be instantiated for
                                   // functor type and will call short vector
                                   // variant of functor::operator()
#pragma noinline
   res += caller3(function, arr); // This will be instantiated for
                                  // <int, int> types and will call scalar
                                  // function() indirectly
#pragma noinline
   res += caller3(nv_function, arr); // This will call the same
                                     // instantiation as above on nv_function
   return res;
}

NOTE:
If calls to caller1, caller2 and caller3 are inlined, the compiler is able to replace indirect calls by direct calls in all cases. In this case caller2(function, arr) and caller3(function, arr) both call short vector variants of a function as result of the usual replacement of direct calls to function() by matching short vector variants in the SIMD loop.

Invoke a SIMD-Enabled Function with Parallel Context

Typically, the invocation of a SIMD-enabled function provides arrays wherever scalar arguments are specified as formal parameters.

NOTE:
The array notation syntax, as well as calling the SIMD-enabled function from the regular for loop, results in invoking the short vector function in each iteration and using the vector parallelism but the invocation is done in a serial loop, without using multiple cores.

Use of array notation syntax and SIMD-enabled functions in a regular for loop results in invoking the short vector function in each iteration and using the vector parallelism, but the invocation is done in a serial loop without using multiple cores.

Limitations

The following language constructs are not allowed within SIMD-enabled functions:

  • setjmp/longjump calls
  • Exception handling constructs
  • Any OpenMP construct except atomic and simd. For more details please refer to the OpenMP standard.