Smaller Release Packages: Discover Device Image Compression with Intel® oneAPI DPC++/C++ Compiler 2025.0.1

December 4, 2024

Device image compression enables SYCL* Kernels (device code) compression during compilation and seamlessly decompresses them on-demand at application runtime. This reduces the size of fat binaries for both Just-in-Time (JIT) and Ahead-of-Time (AOT) compilation, where the device code is embedded as SPIR-V* or target-specific ISA, respectively.

  • Are you struggling with bloated release packages?
  • Do you dream of adding more AOT compilation targets without turning your binaries into behemoths?
  • Do you want to reduce the strain on your customers’ download bandwidth?

The newly added device image compression support in the Intel® oneAPI DPC++/C++ Compiler 2025.0.1 could be just what you've been looking for.

Before we dive into the details, let us explore how device code is generated and stored with the different types of SYCL compilation.

Compilation Types and Device Code Management with SYCL

With functional and performance portability being the key tenets, C++ with SYCL, and its Intel implementation, DPC++, offers you the ability to offload SYCL kernels to different devices, like GPUs and FPGA. Depending on whether you know what devices your customers have or the performance and storage requirements of your application, you can select between JIT or AOT compilation.

With JIT, you are not required to know beforehand what devices your customers have. During compilation, the SYCL kernels are stored in the fat multiarchitecture binary in an intermediate format called Standard Portable Intermediate Representation (SPIR-V).

When the application is executed, the C++ with SYCL runtime figures out the devices available on the customer’s machine, and just before the kernels are needed, SPIR-V kernels are passed to the driver of the target device. The device driver then converts the SPIR-V code into the machine code. It should be noted that for JIT:

1. Device code is stored in SPIR-V format in the fat binary.
2. The overhead of compiling SPIR-V device machine code is incurred at runtime.

With AOT, you can leverage knowledge about your customers’ machine to improve the runtime performance of your application. During compilation, SYCL kernels are compiled down to the machine code of the target devices and then embedded into the fat binary. For instance, if the Intel® Data Center GPU Max Series and the integrated GPU in the Intel® Core™ Ultra Processor are among the commonly used GPUs by your customers, you can AOT compile your application using the following compiler flag:

-fsycl-targets= "intel_gpu_pvc,intel_gpu_mtl_h"

This will ensure that SYCL kernels are compiled down to machine code for both GPU targets ahead-of-time. Note that, for AOT:

1. Device code is stored as machine code of the target devices in the fat binary.
2. The overhead of compiling to the target devices’ machine code is incurred during the application’s compilation.

Unlike JIT, where SYCL kernels are stored once in SPIR-V format, AOT compilation stores each SYCL kernel multiple times as device-specific code for different target devices. Therefore, as you add support for more AOT targets, the size of the fat binary increases significantly. For instance, compiling intel/torch-xpu-ops (SYCL implementation of common PyTorch* Torch Compile operators) with three AOT targets results in truly large fat binaries of 1.8GB and 1.1GB size.

And that’s where device image compression comes to the rescue: the idea is to compress the device code during compilation and then embed the compressed device code into the fat binary. The device code will be decompressed at runtime as needed.

Compiling intel/torch-xpu-ops with image compression results in a staggering ~85% reduction in fat binary sizes!

Of course, there are drawbacks to consider. You’ll need to account for the decompression overhead at runtime. There is no easy, definitive answer as to whether the decompression overhead will be significant or where in the application’s execution flow it will have the biggest impact. It depends on your application. In the section “Device Image Compression in Real-World Applications,” we will explore this further with real-world case studies on device image compression. This will give us an idea, what to look out for.

But before we dive into case studies, let’s ensure we are on the same page with a quick overview of device images and the corresponding fundamental unit of (de)compression.

Device Images in SYCL

The SYCL 2020 specification defines device images as a representation of one or more SYCL kernels in an implementation-defined format. With the Intel oneAPI DPC++/C++ Compiler, the “implementation-defined format” could be either SPIR-V or machine code of the target device for JIT and AOT compilation, respectively.

I would like to draw your attention to another key aspect of device images: They can consist of one or more SYCL kernels!

During compilation, the compiler calculates which and how many SYCL kernels to combine in a device image.

But why can we not have one kernel per device image or all kernels in just one device image?

Consider the following pseudocode with three SYCL kernels:

. . . void foo(int val) {} void bar() {} void KernelFoo1() { foo(1); } void KernelFoo2() { foo(2); } void KernelBar3() { bar(); } . . .

KernelFoo1() and KernelFoo2() call foo() with different arguments, while KernelBar3() calls bar(). If we place each kernel into a separate device image, the definition of foo() will be included in both the device image for KernelFoo1() and the device image for KernelFoo2(). This duplication will increase the size of the fat binary.

On the other hand, combining all kernels into a single device image might unnecessarily increase the JIT overhead. For instance, if the application needs to invoke only KernelBar3, it will still incur the JIT overhead for KernelFoo1 and KernelFoo2 because they are bundled together. Another issue with a single device image is that kernels can be specialized for specific target devices. For example, kernels designed for FPGA might use FPGA-specific extensions, leading to JIT compilation failures on other devices, even if those kernels are never called on them.

By default, the Intel® Compiler automatically splits SYCL kernels across device images based on how kernels are used and if they use any target device-specific extension.

Note:
You can override the default splitting of SYCL kernels across device images using the
-fsycl-device-code-split=<kernel|source|off> 
flag.

Read more about it here in the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference.

To reduce the size of fat binaries, we compress device images during compilation and embed compressed images into binaries instead. Now let’s explore two real-world case studies to see how device image compression has been successfully integrated and the tangible benefits it has brought to these applications.

Device Image Compression in Real-World Applications

Case Study #1: Intel/torch-xpu-ops

Intel/torch-xpu-ops consists of SYCL implementation of PyTorch’s ATen operators. These are used when offloading PyTorch applications to Intel® GPUs. With AOT compilation for three targets, all the SYCL kernels are embedded into two fat binaries: libtorch_xpu_ops_sycl_unary_binary_kernels.so and libtorch_xpu_ops_sycl_kernels.so.

Fat binary’s name

Size before compression

Size after compression

% size reduction

libtorch_xpu_ops_sycl_unary
_binary_kernels.so

1.1GB

149MB

86.3%

libtorch_xpu_ops_sycl_kernels.so

1.8GB

256MB

85.7%

 

As evident from the table above, with device image compression, the size of these binaries is reduced by more than 85%. Great! Now, let’s talk about the decompression overhead. The decompression overhead occurs at runtime when the application first uses the SYCL kernel. For example, consider the following PyTorch code:

. . . import torch // No overhead b = torch.zeros(2, 3, device=torch.device('xpu')) // 2.3ms overhead c = torch.zeros(2, 3, device=torch.device('xpu')) // No overhead . . .

The decompression overhead is incurred when the Torch operators are called for the first time. In this case, the decompression overhead is miniscule – executing the zeros Torch operator on a single-tile Intel Data Center GPU Max Series card took about 0.2 seconds while the decompression overhead is just 1%  (2.3ms). Once the SYCL runtime decompresses a device image containing the kernel for the first time, it stores the decompressed image. This means that any subsequent kernel invocation will not incur additional decompression overhead.

Case study #2: Blender*

Starting with Blender 3.3, oneAPI can be used to offload complex path-tracing scenes, rendering of geometric figures, etc., to Intel GPUs. Blender’s oneAPI backend consists of SYCL kernels for common tasks like ray tracing, and with AOT compilation for five targets, all the SYCL kernels are embedded into a fat binary named cycles_kernel_oneapi_aot.dll.

Without compression, the size of cycles_kernel_oneapi_aot.dll was 197 MB. After device image compression, the size was reduced to just 38 MB, an impressive 80.7% size reduction. The decompression overhead, however, was ~0.7-0.9 seconds per benchmark in the cycles benchmarking suite. In the worst case, the decompression overhead was about 40% of the total benchmark execution time, which is prohibitively high.

But why is the decompression overhead high for Blender?

One probable reason is that Blender has a dedicated ‘warm-up’ phase where SYCL kernels are speculatively loaded into memory to ensure a seamless user interface experience. During this phase, SYCL kernels are brought into memory even if they are not used by the benchmark, resulting in a high decompression overhead relative to the actual benchmark execution time.

This also implies that once the warm-up phase is over, there is no additional decompression overhead, which leads to a net performance benefit and improved end-user experience.

Key Takeaways from the Case Studies:

  1. Device image compression can significantly reduce the size of your binaries: In both case studies, the reduction in binary sizes was more than 80%.
  2. Whether the decompression overhead is significant depends on how your application manages SYCL kernel life cycles.
  3. The decompression overhead occurs only the first time a SYCL kernel is executed. The SYCL runtime caches the decompressed kernel and there is no decompression overhead in subsequent calls to the same SYCL kernel.

Pro-Tip:
You can control when the decompression happens during the execution of your application. Blender, for example, uses kernel_bundle to trigger the building of SYCL kernels during a dedicated warm-up phase. You can learn more about kernel_bundles in the SYCL 2020 spec and in an example of controlling compilation.  

Using Device Image Compression with the Latest Compiler

Without further ado, let us talk about how to use the device image compression with your application to shrink your release packages. The answer is simple:

Just add the --offload-compress flag  to the Intel oneAPI DPC++/C++ Compiler invocation! 

Irrespective of whether you are doing JIT or AOT compilation, whether you are offloading SYCL kernels to Intel GPUs or to AMD* and Nvidia* GPUs (using CodePlay plugins), to use device image compression, all you need is to add the —offload-compress flag to the compilation command of your application.

If you separately compile and link device images in your application, you need to specify the —offload-compress option during the device code linking step, along with -fsycl-link.

For instance:

icpx --offload-compress -fsycl -fsycl-link device_image1.o device_image2.o -o linked_device_images.o

 

Pro-Tip:
The Intel oneAPI DPC++/C++ Compiler gives you fine-grained control over the (de)compression algorithm.

It uses the Zstandard* fast real-time compression algorithm underneath as the (de)compression algorithm. Zstandard has different compression levels to tune the compression ratio, compression, and decompression speed. The compiler uses Zstandard compression level 10 as default, but you can override this using the —offload-compression-level=<> flag during compiler invocation.

Intel’s LLVM* GitHub project compression test example, provides a code sample for how to use it.

Download the Compiler Now

You can download the Intel oneAPI DPC++/C++ Compiler and the Intel® C++ Essentials on Intel’s oneAPI Developer Tools product page

This version is also in the Intel® Toolkits, which include an advanced set of foundational tools, libraries, analysis, debug and code migration tools.

You may also want to check out our contributions to the LLVM compiler project on GitHub.

Questions?

  • If you have questions about image compression or the case studies presented in this blog, please don’t hesitate to start a discussion on the Intel LLVM GitHub project and tag me (@uditagarwal97).
  • If you have a feature request or want to report a bug, please file a feature request or bug report on the issues board.

Additional Resources