oneAPI Debug Tools for SYCL* and OpenMP* Development

Intel® oneAPI Programming Guide

Download PDF

ID 771723

Date 3/31/2025

Version

Public

oneAPI Debug Tools for SYCL* and OpenMP* Development

The following tools are available to help with debugging the SYCL* and OpenMP* offload process.

Tools to debug SYCL* and OpenMP* offload process
Tool	When to Use
Environment variables	Environment variables allow you to gather diagnostic information from the OpenMP and SYCL runtimes at program execution with no modifications to your program.
The onetrace tool from Profiling Tools Interfaces for GPU (PTI for GPU)	When using the Intel® oneAPI Level Zero and OpenCL™ backends for SYCL and OpenMP Offload, this tool can be used to debug backend errors and for performance profiling on both the host and device. Learn more: Onetrace tool GitHub PTI for GPU GitHub GPU Compute/Media Hotspots Analysis
Intercept Layer for OpenCL™ Applications	When using the OpenCL™ backend for SYCL and OpenMP Offload, this library can be used to debug backend errors and for performance profiling on both the host and device (has wider functionality comparing with onetrace).
Intel® Distribution for GDB*	Used for source-level debugging of the application, typically to inspect logical bugs, on the host and any devices you are using (CPU, GPU, FPGA emulation).
Intel® Inspector	This tool helps to locate and debug memory and threading problems, including those that can cause offloading to fail. NOTE: Intel Inspector is included in the Intel® HPC Toolkit.
In-application debugging	In addition to these tools and runtime based approaches, the developer can locate problems using other approaches. For example: Comparing kernel output to expected output Sending intermediate results back by variables they create for debugging purposes Printing results from within kernels NOTE: Both SYCL and OpenMP allow printing to stdout from within an offload region - be sure to note which SIMD lane or thread is providing the output.
SYCL Exception Handler	Some DPC++ programming errors are returned as exceptions by the SYCL runtime during program execution. They can help you diagnose errors in your code that are flagged at runtime. For more details and examples, refer to <link> Using SYCL Exceptions </link>. For Samples that demonstrate SYCL Exceptions, refer to: * Guided Matrix Multiplication Exception * Guided Matrix Multiplication Invalid Contexts * Guided Matrix Multiplication Race Condition
Intel® Advisor	Use to ensure Fortran, C, C++, OpenCL™, and SYCL applications realize full performance potential on modern processors.
Intel® VTune ^TM Profiler	Use to gather performance data either on the native system or on a remote system.
OpenMP* directives	Offload and Optimize OpenMP* Applications with Intel Tools describes how to use OpenMP* directives to add parallelism to your application.

Debug Environment Variables

Both the OpenMP* and SYCL offload runtimes, as well as Level Zero, OpenCL, and the Shader Compiler, provide environment variables that help you understand the communication between the host and offload device. The variables also allow you to discover or control the runtime chosen for offload computations.

OpenMP* Offload Environment Variables

There are several environment variables that you can use to understand how OpenMP Offload works and control which backend it uses.

NOTE:

OpenMP is not supported for FPGA devices.

OpenMP* Offload Environment Variables
Environment Variable	Description
LIBOMPTARGET_DEBUG=<Num>	Controls whether or not debugging information will be displayed. See details in Runtimes This environment variable enables debug output from the OpenMP Offload runtime. It reports: The available runtimes detected and used (1,2) When the chosen runtime is started and stopped (1,2) Details on the offload device used (1,2) Support libraries loaded (1,2) Size and address of all memory allocations and deallocations (1,2) Information on every data copy to and from the device, or device mapping in the case of unified shared memory (1,2) When each kernel is launched and details on the launch (arguments, SIMD width, group information, etc.) (1,2) Which Level Zero/OpenCL API functions are invoked (function name, arguments/parameters) (2) Values: `<Num>=0:` Disabled `<Num>=1:` Displays basic debug information from the plugin actions such as device detection, kernel compilation, memory copy operations, kernel invocations, and other plugin-dependent actions. `<Num>=2:` Additionally displays which GPU runtime API functions are invoked with which arguments/parameters. Default: 0
LIBOMPTARGET_INFO=<Num>	This variable controls whether basic offloading information will be displayed from the offload runtime. Allows the user to request different types of runtime information from libomptarget. See details in Runtimes Prints all data arguments upon entering an OpenMP device kernel (1) Indicates when a mapped address already exists in the device mapping table (2) Dumps the contents of the device pointer map if target offloading fails (4) Indicates when an entry is changed in the device mapping table (8) Indicates when data is copied to and from the device (32) Values: (0, 1, 2, 4, 8, 32) Default: 0
LIBOMPTARGET_PLUGIN_PROFILE=<Enable>[,<Unit>]	This variable enables the display of performance data for offloaded OpenMP code. It displays: Total data transfer times (read and write) Data allocation times Module build times (just-in-time compile) The execution time of each kernel. Values: `F` - disabled `T` - enabled with timings in milliseconds `T,usec` - enabled with timings in microseconds Default: `F` Example: `export LIBOMPTARGET_PLUGIN_PROFILE=T,usec` `<Enable> := 1 \| T <Unit> := usec \| unit_usec` Enables basic plugin profiling and displays the result when program finishes. Microsecond is the default unit if `<Unit>` is not specified.
LIBOMPTARGET_PLUGIN=<Name>	This environment variable allows you to choose the backend used for OpenMP offload execution. NOTE: The Level Zero backend is only supported for GPU devices. `<Name> := LEVEL0 \| OPENCL \| CUDA \| X86_64 \| NIOS2 \| level0 \| opencl \| cuda \| x86_64 \| nios2 \|` Designates offload plugin name to use. Offload runtime does not try to load other RTLs if this option is used. Values: `LEVEL0` or `LEVEL_ZERO` - uses the Level Zero backend `OPENCL` - uses the OpenCL™ backend Default: For GPU offload devices: `LEVEL0` For CPU or FPGA offload devices: `OPENCL`
LIBOMPTARGET_PROFILE=<FileName>	Allows libomptarget to generate time profile output similar to Clang’s `-ftime-trace` option. See details in Runtimes
LIBOMPTARGET_DEVICES=<DeviceKind>	`<DeviceKind> := DEVICE \| SUBDEVICE \| SUBSUBDEVICE \| ALL \| device \| subdevice \| subsubdevice \| all` Controls how subdevices are exposed to users. `DEVICE/device`: Only top-level devices are reported as OpenMP devices, and `subdevice` clause is supported. `SUBDEVICE/subdevice`: Only 1st-level subdevices are reported as OpenMP devices, and `subdevice` clause is ignored. `SUBSUBDEVICE/subsubdevice`: Only 2nd-level subdevices are reported as OpenMP devices, and `subdevice` clause is ignored. On Intel GPU using Level Zero backend, limiting the `subsubdevice` to a single compute slice within a stack also requires setting additional GPU compute runtime environment variable `CFESingleSliceDispatchCCSMode=1`. `ALL/all`: All top-level devices and their subdevices are reported as OpenMP devices, and `subdevice` clause is ignored. This is not supported on Intel GPU and is being deprecated. Default: Equivalent to `<DeviceKind>=device`
LIBOMPTARGET_LEVEL0_MEMORY_POOL=<Option>	`<Option> := 0 \| <PoolInfoList> <PoolInfoList> := <PoolInfo>[,<PoolInfoList>] <PoolInfo> := <MemType>[,<AllocMax>[,<Capacity>[,<PoolSize>]]] <MemType> := all \| device \| host \| shared <AllocMax> := positive integer or empty, max allocation size in MB <Capacity> := positive integer or empty, number of allocations from a single block <PoolSize> := positive integer or empty, max pool size in MB` Controls how reusable memory pool is configured. Pool is a list of memory blocks that can serve at least `<Capacity>` allocations of up to `<AllocMax>` size from a single block, with total size not exceeding `<PoolSize>`. Default: Equivalent to `<Option>=device,1,4,256,host,1,4,256,shared,8,4,256`
LIBOMPTARGET_LEVEL0_STAGING_BUFFER_SIZE=<Num>	Sets the staging buffer size to `<Num>` KB. Staging buffer is used in copy operations between host and device as a temporary storage for two-step copy operation. The buffer is only used for discrete devices. Default: 16
LIBOMPTARGET_LEVEL_ZERO_COMMAND_BATCH=<Value>	<Value> := <Type>[,<Count>] <Type> := none \| NONE \| copy \| COPY \| compute \| COMPUTE <Count> := maximum number of commands to batch Enables command batching for a target region. ``<Type>=none\|NONE``: Disables command batching. ``<Type>=copy\|COPY``: Enables command batching for a target region for data transfer. ``<Type>=compute\|COMPUTE``: Enables command batching for a target region for data transfer and compute, disabling use of copy engine. If ``<Type>`` is either ``copy`` or ``compute`` (enabled) and ``<Count>`` is not specified, batching is performed for all eligible commands for the target region. Default: ``<Type>=none`` (Disabled)
LIBOMPTARGET_LEVEL_ZERO_USE_IMMEDIATE_COMMAND_LIST=<Value>	`<True> := 1 \| T \| t <False>:= 0 \| F \| f <Bool>:= <True> \| <False> <Value> := <Bool> \| compute \| COMPUTE \| copy \| COPY \| all \| ALL` `compute`: Enables immediate command list for kernel submission `copy`: Enables immediate command list for memory copy operations `all`: Enables immediate command list for kernel submission and memory copy operations `<True>`: Equivalent to `compute` `<False>`: Immediate command list is disabled. Default: “all”
OMP_TARGET_OFFLOAD=MANDATORY	This is defined by the OpenMP Standard : https://www.openmp.org/spec-html/5.1/openmpse74.html#x340-5150006.17
ONEAPI_DEVICE_SELECTOR	This device selection environment variable can be used to limit the choice of devices available when the OpenMP application is run. Useful for limiting devices to a certain type (like GPUs or accelerators) or backends (like Level Zero or OpenCL). The ONEAPI_DEVICE_SELECTOR syntax is shared with OpenMP and also allows devices to be chosen. See oneAPI DPC++ Compiler documentation for a full description. See oneAPI DPC++ Compiler documentation for a full description.

SYCL* and DPC++ Environment Variables

The oneAPI DPC++ Compiler supports all standard SYCL environment variables. The full list is available from GitHub. Of interest for debugging are the following SYCL environment variables, plus an additional Level Zero environment variable.

SYCL* and DPC++ Environment Variables
Environment Variable	Description
ONEAPI_DEVICE_SELECTOR	This complex environment variable allows you to limit the runtimes, compute device types, and compute device IDs used by the runtime to a subset of all available combinations. The compute device IDs correspond to those returned by the SYCL API, `clinfo`, or `sycl-ls` (with the numbering starting at 0) and have no relation to whether the device with that ID is of a certain type or supports a specific runtime. Using a programmatic special selector (like `gpu_selector`) to request a device filtered out by `ONEAPI_DEVICE_SELECTOR` will cause an exception to be thrown. Refer to the Environment Variables descriptions in GitHub for additional details: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md Example values include: `opencl:cpu` - use only the OpenCL™ runtime on all available CPU devices `opencl:gpu` - use only the OpenCL runtime on all available GPU devices `opencl:gpu:2` - use only the OpenCL runtime on only the third device, which also has to be a GPU `level_zero:gpu:1` - use only the Level Zero runtime on only the second device, which also has to be a GPU `opencl:cpu,level_zero` - use only the OpenCL runtime on the CPU device, or the Level Zero runtime on any supported compute device Default: use all available runtimes and devices
ONEAPI_DEVICE_SELECTOR	This device selection environment variable can be used to limit the choice of devices available when the SYCL-using application is run. Useful for limiting devices to a certain type (like GPUs or accelerators) or backends (like Level Zero or OpenCL). This device selection mechanism is replacing ONEAPI_DEVICE_SELECTOR . The ONEAPI_DEVICE_SELECTOR syntax is shared with OpenMP and also allows devices to be chosen. Refer to oneAPI DPC++ Compiler documentation for a full description: https://intel.github.io/llvm/EnvironmentVariables.html
SYCL_UR_TRACE	This environment variable enables debug output from the runtime. Values: 1 - report SYCL plugins and devices discovered and used 2 - report SYCL API calls made, including arguments and result values -1 - provides all available tracing Default:disabled
ZE_DEBUG	This environment variable enables debug output from the Level Zero backend when used with the runtime. It reports: Level Zero APIs called Level Zero event information Value: variable defined with any value - enabled Default: disabled

Environment Variables that Produce Diagnostic Information for Support

The Level Zero backend provides a few environment variables that can be used to control behavior and aid in diagnosis.

An additional source of debug information comes from the Intel® Graphics Compiler, which is called by the Level Zero or OpenCL backends (used by both the OpenMP Offload and SYCL/DPC++ Runtimes) at runtime or during Ahead-of-Time (AOT) compilation. Intel® Graphics Compiler creates the appropriate executable code for the target offload device. The full list of these environment variables can be found at https://github.com/intel/intel-graphics-compiler/blob/master/documentation/configuration_flags.md. The two that are most often needed to debug performance issues are:

IGC_ShaderDumpEnable=1 (default=0) causes all LLVM, assembly, and ISA code generated by the Intel® Graphics Compiler to be written to /tmp/IntelIGC/<application_name>
IGC_DumpToCurrentDir=1 (default=0) writes all the files created by IGC_ShaderDumpEnable to your current directory instead of /tmp/IntelIGC/<application_name>. Since this is potentially a lot of files, it is recommended to create a temporary directory just for the purpose of holding these files.

If you have a performance issue with your OpenMP offload or SYCL offload application that arises between different versions of Intel® oneAPI, when using different compiler options, when using the debugger, and so on, then you may be asked to enable IGC_ShaderDumpEnable and provide the resulting files. For more information on compatibility, see oneAPI Library Compatibility.

Offload Intercept Tools

In addition to debuggers and diagnostics built into the offload software itself, it can be quite useful to monitor offload API calls and the data sent through the offload pipeline. For Level Zero, if your application is run as an argument to the onetrace and ze_tracer tools, they will intercept and report on various aspects of Level Zero made by your application. For OpenCL™, you can add a library to LD_LIBRARY_PATH that will intercept and report on all OpenCL calls, and then use environment variables to control what diagnostic information to report to a file. You can also use onetrace or cl_tracer to report on various aspects of OpenCL API calls made by your application. Once again, your application is run as an argument to the onetrace or cl_tracer tool.

Intercept Layer for OpenCL™ Applications

This library collects debugging and performance data when OpenCL is used as the backend to your SYCL or OpenMP offload program. When OpenCL is used as the backend to your SYCL or OpenMP offload program, this tool can help you detect buffer overwrites, memory leaks, mismatched pointers, and can provide more detailed information about runtime error messages (allowing you to diagnose these issues when either CPU, FPGA, or GPU devices are used for computation). Note that you will get nothing useful if you use ze_tracer on a program that uses the OpenCL backend, or the Intercept Layer for OpenCL™ Applications library and cl_tracer on a program that uses the Level Zero backend.

Additional resources:

Extensive information on building and using the Intercept Layer for OpenCL Applications is available from https://github.com/intel/opencl-intercept-layer.

NOTE:
For best results, run cmake with the following flags: -DENABLE_CLIPROF=TRUE -DENABLE_CLILOADER=TRUE
Information about a similar tool (CLIntercept) is available from https://github.com/gmeeker/clintercept and https://sourceforge.net/p/clintercept/wiki/Home/.
Information on the controls for the Intercept Layer for OpenCL™ Applications can be found at https://github.com/intel/opencl-intercept-layer/blob/master/docs/controls.md.
Information about optimizing for GPUs is available from the Intel oneAPI GPU Optimization Guide.

Profiling Tools Interfaces for GPU (onetrace, cl_tracer, and ze_trace)

Like the Intercept Layer for OpenCL™ Applications, these tools collect debugging and performance data from applications that use the OpenCL and Level Zero offload backends for offload via OpenMP* or SYCL. Note that Level Zero can only be used as the backend for computations that happen on the GPU (there is no Level Zero backend for the CPU or FPGA at this time). The onetrace tool is part of the Profiling Tools Interfaces for GPU (PTI for GPU) project, found at https://github.com/intel/pti-gpu. This project also contains the ze_tracer and cl_tracer tools, which trace just activity from the Level Zero or OpenCL offload backends respectively. The ze_tracer and cl_tracer tools will produce no output if they are used with the application using the other backend, while onetrace will provide output no matter which offload backend you use.

The onetrace tool is distributed as source. Instructions for how to build the tool are available from https://github.com/intel/pti-gpu/tree/master/tools/onetrace. The tool provides the following features:

Call logging: This mode allows you to trace all standard Level Zero (L0) and OpenCL™ API calls along with their arguments and return values annotated with time stamps. Among other things, this can give you supplemental information on any failures that occur when a host program tries to make use of an attached compute device.
Host and device timing: These provide the duration of all API calls, the duration of each kernel, and application runtime for the entire application.
Device Timeline mode: Gives time stamps for each device activity. All the time stamps are in the same (CPU) time scale.
Chrome Call Logging mode: Dumps API calls to JSON format that can be opened in chrome://tracing browser tool.

These data can help debug offload failures or performance issues.

Additional resources:

Intel® Distribution for GDB*

The Intel® Distribution for GDB* is an application debugger that allows you to inspect and modify the program state. With the debugger, both the host part of your application and kernels that are offloaded to a device can be debugged seamlessly in the same debug session. The debugger supports the CPU, GPU, and FPGA-emulation devices. Major features of the tool include:

Automatically attaching to the GPU device to listen to debug events
Automatically detecting JIT-compiled, or dynamically loaded, kernel code for debugging
Defining breakpoints (both inside and outside of a kernel) to halt the execution of the program
Listing the threads; switching the current thread context
Listing active SIMD lanes; switching the current SIMD lane context per thread
Evaluating and printing the values of expressions in multiple thread and SIMD lane contexts
Inspecting and changing register values
Disassembling the machine instructions
Displaying and navigating the function call-stack
Source- and instruction-level stepping
Non-stop and all-stop debug mode
Recording the execution using Intel Processor Trace (CPU only)

For more information and links to full documentation for Intel Distribution for GDB, see Get Started with Intel® Distribution for GDB onLinux* Host|Windows* Host.

Intel® Inspector for Offload

Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multithreaded applications. It can be used to verify correctness of the native part of the application as well as dynamically generated offload code.

Unlike the tools and techniques above, Intel Inspector cannot be used to catch errors in offload code that is communicating with a GPU or an FPGA. Instead, Intel Inspector requires that the SYCL or OpenMP runtime needs to be configured to execute kernels on CPU target. In general, it requires definition of the following environment variables prior to an analysis run.

To configure a SYCL application to run kernels on a CPU device
```
export ONEAPI_DEVICE_SELECTOR=opencl:cpu
```

To configure an OpenMP application to run kernels on a CPU device


export OMP_TARGET_OFFLOAD=MANDATORY
export LIBOMPTARGET_DEVICETYPE=cpu

To enable code analysis and tracing in JIT compilers or runtimes


export CL_CONFIG_USE_VTUNE=True
export CL_CONFIG_USE_VECTORIZER=false

Use one of the following commands to start analysis from the command line. You can also start from the Intel Inspector graphical user interface.

Memory: inspxe-cl -c mi3 -- <app> [app_args]
Threading: inspxe-cl -c ti3 -- <app> [app_args]

View the analysis result using the following command: inspxe-cl -report=problems -report-all

If your SYCL or OpenMP Offload program passes bad pointers to the OpenCL™ backend, or passes the wrong pointer to the backend from the wrong thread, Intel Inspector should flag the issue. This may make the problem easier to find than trying to locate it using the intercept layers or the debugger.

Additional details are available from the Intel Inspector User Guide forLinux* OS|Windows* OS.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

oneAPI Debug Tools for SYCL* and OpenMP* Development

Debug Environment Variables

Offload Intercept Tools

Intel® Distribution for GDB*

Intel® Inspector for Offload