Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference

ID 767253
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Explicit SIMD SYCL Extension

oneAPI provides an Explicit SIMD SYCL extension (ESIMD) for lower-level Intel GPU programming.

oneAPI provides an Explicit SIMD SYCL extension (ESIMD) for lower-level Intel GPU programming.

ESIMD provides APIs like Intel's GPU Instruction Set Architecture (ISA), but it enables you to write explicitly vectorized device code. This explicit enabling gives you more control over the generated code and allows you to depend less on compiler optimizations.

The specification, API reference, small API demos, and working code examples are available on GitHub.

NOTE:
Some parts of this extension are under active development, and the APIs in the sycl::ext::intel::experimental::esimd package are subject to change.

ESIMD kernels and functions always require a subgroup size of one, meaning the compiler does not provide vectorization across work items in a subgroup. Instead, you must explicitly express the vectorization in your code. Below is an example that adds the elements of two arrays and writes the results to the third:

float *A = malloc_shared<float>(Size, q);
float *B = malloc_shared<float>(Size, q);
float *C = malloc_shared<float>(Size, q);

for (unsigned i = 0; i != Size; i++) {
  A[i] = B[i] = i;
}

q.parallel_for(Size / VL, [=](id<1> i)[[intel::sycl_explicit_simd]] {    
   auto offset = i * VL;
   // pointer arithmetic, so offset is in elements:
   simd<float, VL> va(A + offset);
   simd<float, VL> vb(B + offset);
   simd<float, VL> vc = va + vb;
   vc.copy_to(C + offset);
}).wait_and_throw();

In the example above, the lambda function passed to the parallel_for is marked with a special attribute: [[intel::sycl_explicit_simd]]. This attribute tells the compiler that the kernel is ESIMD-based and that ESIMD APIs can be used inside it. Here, the simd objects and copy_to functions are used. They are available only in the ESIMD extension.

Fully runnable code samples can be found on GitHub.

Compile and Run ESIMD Code

Code that uses the ESIMD extension can be compiled and run using the same commands as you would with standard SYCL:

To compile using the open source oneAPI DPC++ Compiler:

clang++ -fsycl vadd_usm.cpp

To compile using an Intel® oneAPI Toolkit:

icpx -fsycl vadd_usm.cpp

To run on an Intel-specific GPU device through the oneAPI Level Zero (Level Zero) backend:

ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./a.out

The resulting executable ($./a.out) can be run only on Intel GPU hardware, such as Intel® UHD Graphics 600 or later. The SYCL runtime automatically recognizes ESIMD kernels and dispatches their execution, so no additional setup is needed. Linux and Windows platforms, including OpenCL™ and Level Zero backends, are supported.

Regular SYCL and ESIMD kernels can co-exist in the same translation unit and application.

SYCL and ESIMD Interoperability

SYCL kernels can call ESIMD functions using the special invoke_simd API. Details are available in the invoke_simd API specification.Examples and test cases are also available.

#include <sycl/ext/intel/esimd.hpp>
#include <sycl/ext/oneapi/experimental/invoke_simd.hpp>
#include <sycl/sycl.hpp>

constexpr int N = 8;

namespace seoe = sycl::ext::oneapi::experimental::simd;
namespace esimd = sycl::ext::intel::simd;

// ESIMD function
[[intel::device_indirectly_callable]] SYCL_EXTERNAL seoe::simd<float, N> __regcall esimd_scale(seoe::simd<float, N> x, float n) SYCL_ESIMD_FUNCTION {
  return esimd::simd<float, N>(x) * n;
}
...
auto ndr = nd_range<1>{range<1>{global_size}, range<1>{N * num_sub_groups}};
q.parallel_for(ndr, sycl::nd_item<1> it) [[sycl::reqd_sub_group_size(N)]] {
  sycl::sub_group sg = it.get_sub_group();
  float x = ...;
  float n = ...;

  // Invoke SIMD function:
  // `x` values from each work-item are grouped into a simd<float, N>.
  // `n` is passed as a uniform scalar.
  // The vector result simd<float, N> is split into N scalar elements,
  // then assigned to each `y` of each corresponding N work-items.
  float y = seoe::invoke_simd(sg, esimd_scale, x, seoe::uniform(n));
});

Currently, compiling programs with invoke_simd calls requires a few additional compilation options. Also, running such programs may require setting additional parameters for the GPU driver:

# compile: pass -fsycl-allow-func-ptr because by default the function pointers
# are not allowed in SYCL/ESIMD programs;
# also pass -fno-sycl-device-code-split-esimd to keep invoke_simd() caller
# and callee in the same module.
clang++ -fsycl -fno-sycl-device-code-split-esimd -Xclang -fsycl-allow-func-ptr -o invoke_simd

# run the program:
IGC_VCSaveStackCallLinkage=1 IGC_VCDirectCallsOnly=1 ./invoke_simd

Restrictions

This section contains lists of the main restrictions that apply when using the ESIMD extension.

NOTE:
The compiler does not enforce some extensions, which may lead to undefined program behavior.
  • Features not supported with ESIMD:
  • Unsupported standard SYCL APIs:
    • 2D and 3D accessors.
    • Constant accessors.
    • sycl::accessor::get_pointer() and sycl::accessor::operator[] are supported only with -fsycl-esimd-force-stateless-mem. Otherwise, all memory accesses through an accessor are done via explicit APIs, for example: sycl::ext::intel::esimd::block_store(acc, offset)
    • Accessors with offsets and/or access range specified.
    • sycl::sampler and sycl::stream classes.
  • Other restrictions:
    • Only Intel GPU devices are supported.
    • Interoperability between regular SYCL and ESIMD kernels is only supported one way. Regular SYCL kernels can call ESIMD functions but not vice-versa.