2024.1 Intel® oneAPI DPC++/C++ Compiler – SYCL Runtime Compilation with the kernel_compiler Extension

Get the Latest on All Things CODE

author-image

By

One of the new capabilities in the fully SYCL2020 conformant Intel® oneAPI DPC++/C++ Compiler  2024.1 is the kernel_compiler, introduced as an experimental feature. Another example of how Intel contributes at the leading edge of LLVM and SYCL standards evolution. This extension can take OpenCL™ C strings and compile them at runtime to kernels that can then be executed on a device.   

It is available in addition to the more commonly used modes of Ahead-of-Time (AOT), SYCL runtime and directed runtime compilation for offload target hardware specific SYCL kernels. 

Normally, the kernel_compiler extension should be considered a last resort!  

However, there may well be some very interesting reasons for creation of SYCL Kernels from OpenCL™ C or SPIR-V code stubs using this new extension. 

Before we dive into the details and why there usually – but not always – are better approaches, let us take a quick tour of the different late- and early-compile options available with SYCL.  

Three Different Types of Compilation 

What SYCL brings to your application is the ability to offload computational work to kernels that are executed on another computation device that might be resident on the machine, like a GPU or an FPGA. Do you have thousands of numbers that need to be calculated? Send it to the GPU! 

This opens a door to power and performance, but also a door to further questions:  

  • Do you know what device you’ll be targeting? Will that change in the future?  
  • Do you know the full domain parameter value for that kernel execution, or could it be more performant if tailored to parameters that only the running application will know? 

SYCL has several options that can address those questions: 

  1. Ahead-of-Time (AoT) Compile – Your kernels are compiled to machine code at the same time as you compile your application. 
  2. SYCL Runtime Compilation – The kernel is compiled when your app is running and the kernel is actually used.   
  3. Directed Runtime Compilation – You program your app to compile a kernel at the time of your choosing.   

Let us look at each of these: 

1. Ahead of Time (AoT) Compile 

When you are compiling your application, you can also precompile the kernels right then. To do so, you just need to know what devices you want the kernels compiled for. Pass them to the compiler with the -fsycl-targets flag and you’re set.  Done!  The kernels will be compiled now and those binaries will be used when your app is run. 

The nice thing about AoT compilation is that it is familiar to a C++ programmer and easy to grok. Additionally, for some devices (like FPGAs and some GPUs) it is the only option. 

Another advantage is that the runtime doesn’t need to stop and compile your kernel before it can be executed; that’s been done already, so the kernel can simply be loaded, passed to the device and executed.  

There are a lot more options you can use to control AoT compile, but they are outside the scope of this blog post.  If you want to read more, refer to the -fsycl-targets entry in Intel’s GitHub LLVM User Manual or to this section for compiler and runtime design.  

2. SYCL Runtime Compilation (via SPIR-V) 

This is the default mode for SYCL if no target devices are specified (or possibly if an app with precompiled kernels is run on a machine with target devices different than what was specified).   

By default, SYCL compiles your kernel C++ code to an intermediate representation known as SPIR-V (Standard Portable Intermediate Representation). That SPIR-V kernel is stored inside your application and then passed to the driver of whatever target device is encountered when the kernel is first needed.  The device driver then takes the SPIR-V kernel and compiles it to machine code for that device.   

There are two primary advantages to the default runtime compilation: 

  • First, you don’t need to worry in advance about what specific target device your kernel will run on. So long as there is one, it’ll run.  
  • Second, if there is a new performance optimization put into a GPU driver, then your app will get the advantage of it when your kernel is run on that GPU with the shiny new driver – without needing to recompile your app.  

Note, however, that there may be a small penalty (compared to AoT) because when your app first sends the kernel to the device, it will have to be compiled from SPIR-V to machine code. But this typically occurs before the kernel is looped by parallel_for, meaning it is outside the critical performance path.  

In practice, this compilation time is negligible and the flexibility afforded by runtime compilation outweighs the alternative. Furthermore, SYCL can cache compiled kernels between invocations of your app, further negating any cost. For more on caching see kernel programming cache and environment variables.  

But if you don’t like the default SYCL behavior yet want the flexibility of runtime compilation, then read on! 

3. Directed Runtime Compilation (via kernel_bundles) 

SYCL has a class called kernel_bundle, a programmatic interface that gives you access and control over the kernels bundled within your application.  

Of interest here are the kernel_bundle methods .compile() .link() and .build().  These allow you as the app programmer to determine exactly when and how a kernel might be built — no need to wait until the kernel is needed.  

More information about kernel_bundles can be found in the SYCL 2020 spec and in an example of controlling compilation.  

Specialization Constants 

Let’s pretend you are writing a kernel that operates on the many many pixels of some input image. Your kernel needs to substitute the pixels matching some key color with a replacement. You know the kernel could be faster if that key color and replacement color were constants, rather than parameter variables. But at the time of writing your app, there is no way of knowing what those color values might be. Maybe they depend on user input or some other calculation. 

This is where specialization constants come into play.  

That’s what the name means: constants in your kernel that you will specialize at runtime before runtime compiling the kernel.  With specialization constants, your app can set the key and replacement colors, then they are compiled as constants into the binary of the kernel by the device driver.  For kernels that can leverage this, there are huge performance advantages.   

For more information see the SYCL 2020 Specification, with a usage example here.  

The Last Resort – the kernel_compiler 

The options that we’ve touched on so far all interoperate and play together nicely. But between AoT compilation, the default SYCL compile-at-runtime behavior, caching, directed compilation, and specialization constants, there is a very rich and powerful set of options available to you.  

It’s easy to have your application select a particular kernel at runtime or prepare it to be performant with specialization constants. But maybe that’s not enough. Maybe your app just needs to write a kernel from scratch.  

To most easily demonstrate this, here is some source code. I’ve tried to write it so it reads clearly from top to bottom.

/*
    // Compile with DPCPP
    icpx -fsycl -o skc.exe simple_kernel_compiler.cpp

    // Run
    ./skc.exe
    120 122 124 126
*/

#include <sycl/sycl.hpp>

// The kernel in OpenCL C.
// read value from input ptr, double it and add 100, store in output ptr
auto constexpr CLSource = R"===(
__kernel void my_kernel(__global int *in, __global int *out) {
  size_t i = get_global_id(0);
  out[i] = in[i]*2 + 100;
}
)===";


void test_build_and_run() {
  namespace syclex = sycl::ext::oneapi::experimental;
  using source_kb = sycl::kernel_bundle<sycl::bundle_state::ext_oneapi_source>;
  using exe_kb = sycl::kernel_bundle<sycl::bundle_state::executable>;

  // setup a device, context and queue.
  sycl::device d{sycl::default_selector_v};
  sycl::context ctx{d};
  sycl::queue q{ctx, d};

  // check to see if backend supports creation of kernel_bundles from OpenCL C.
  bool ok = syclex::is_source_kernel_bundle_supported(ctx.get_backend(), syclex::source_language::opencl);
  if (!ok) {
    std::cout << "Apparently this backend does not support OpenCL C source kernel bundle extension: "
              << ctx.get_backend() << std::endl;
    return;
  }

  source_kb kbSrc = syclex::create_kernel_bundle_from_source(ctx, syclex::source_language::opencl, CLSource);
  // compile and build it!
  exe_kb kbExe = syclex::build(kbSrc);

  // check for my_kernel
  bool hasMyKernel = kbExe.ext_oneapi_has_kernel("my_kernel");
  if(!hasMyKernel){
    std::cout << "my_kernel should exist, but doesn't" << std::endl;
    return;
  }

  // get the kernel
  sycl::kernel my_kernel = kbExe.ext_oneapi_get_kernel("my_kernel");

  // check the number of args
  auto my_num_args = my_kernel.get_info<sycl::info::kernel::num_args>();
  if(my_num_args != 2){
    std::cout << "my_kernel should take 2 args" << std::endl;
    return;
  } 

  // allright, let's try this kernel out!!
  constexpr int N = 4;
  cl_int InputArray[N] = {10, 11, 12, 13};
  cl_int OutputArray[N] = {};

  {
    sycl::buffer InputBuf(InputArray, sycl::range<1>(N));
    sycl::buffer OutputBuf(OutputArray, sycl::range<1>(N));

    q.submit([&](sycl::handler &cgh) {
        // set the args to be the accessors to the InputBuf and OutputBuf
        cgh.set_arg(0, InputBuf.get_access<sycl::access::mode::read>(cgh));
        cgh.set_arg(1, OutputBuf.get_access<sycl::access::mode::write>(cgh));
        // now call it.
        cgh.parallel_for(sycl::range<1>{N}, my_kernel);
    });
    q.wait();
  } // when the buffer goes out of scope, the data is copied back to the source array.

  for (int i = 0; i < N; i++)
    std::cout << OutputArray[i] << " ";
  std::cout << std::endl;
  
}

Additional Considerations: 

  • The CLSource string in this example only has one kernel (my_kernel), but it could contain others. 
  • This example uses the simplest form of build(), but there are overloads that take devices, build flags, and even a save_log if you need to capture any compilation errors. 
  • At the time of this writing, kernels compiled via the kernel_compiler: 
    • are not cached and do not participate in the earlier mentioned caching of kernels and;  
    • do not support specialization constants and;  
    • have limited to no ability to be debugged. 
  • This has a double performance penalty. The string of the kernel is first compiled to SPIR-V and then finally to binary.  So take care to call build() well outside any potential bottleneck.  

The kernel_compiler replaces the older (also experimental) online_compiler. The main advantage the kernel_compiler has over its predecessor is easier coding and linking. The online_compiler requires the user to directly use backend interops like clProgram or zeModule.  But as you can see in the code above, none of those are necessary when using the new kernel_compiler extension.  

You can read more the kernel_compiler here and it’s OpenCL C support specifically.  It also supports SPIR-V as a source, which you can read more here.  

When is It Beneficial to Use kernel_compiler? 

Glad you asked! Some SYCL users already have large pre-existing libraries of kernels in OpenCL C or SPIR-V.  For those, the kernel_compiler isn’t a tool of last resort but a very useful extension that lets them leverage those libraries.  

All the Cool Compilation Options! 

I hope you’ve enjoyed this tour of SYCL kernel compilation. Who knew it could be so fascinating? Offloading computation to GPUs, FPGAs or other devices is just so cool, and with SYCL there are so many ways these pieces can be arranged. Hopefully you have found something that will help you reimagine your app or its workflows.  Thanks for reading.  

Check out more kernel_compiler Extension Details  

Find all the information you need to get started: 

Download the Compiler 

If you haven’t yet, consider downloading the latest release of the Intel oneAPI DPC++/C++ Compiler, which includes the kernel_compiler experimental feature. Get it standalone for Linux* or Windows*, from popular package managers (Linux only), or as part of the 2024.1 Intel® oneAPI Base Toolkit.  

We are looking forward to your feedback

Learn more about the 2024.1 Release