oneAPI Debug Tools for SYCL* and OpenMP* Development
The following tools are available to help with debugging the SYCL* and OpenMP* offload process.
Tool |
When to Use |
---|---|
Environment variables |
Environment variables allow you to gather diagnostic information from the OpenMP and SYCL runtimes at program execution with no modifications to your program. |
The onetrace tool from Profiling Tools Interfaces for GPU (PTI for GPU) |
When using the Intel® oneAPI Level Zero and OpenCL™ backends for SYCL and OpenMP Offload, this tool can be used to debug backend errors and for performance profiling on both the host and device.
|
Intercept Layer for OpenCL™ Applications |
When using the OpenCL™ backend for SYCL and OpenMP Offload, this library can be used to debug backend errors and for performance profiling on both the host and device (has wider functionality comparing with onetrace). |
Intel® Distribution for GDB* |
Used for source-level debugging of the application, typically to inspect logical bugs, on the host and any devices you are using (CPU, GPU, FPGA emulation). |
Intel® Inspector |
This tool helps to locate and debug memory and threading problems, including those that can cause offloading to fail.
NOTE:
Intel Inspector is included in the Intel® HPC Toolkit.
|
In-application debugging |
In addition to these tools and runtime based approaches, the developer can locate problems using other approaches. For example:
NOTE:
Both SYCL and OpenMP allow printing to stdout from within an offload region - be sure to note which SIMD lane or thread is providing the output.
|
SYCL Exception Handler |
Some DPC++ programming errors are returned as exceptions by the SYCL runtime during program execution. They can help you diagnose errors in your code that are flagged at runtime. For more details and examples, refer to <link> Using SYCL Exceptions </link>. For Samples that demonstrate SYCL Exceptions, refer to: * Guided Matrix Multiplication Exception * Guided Matrix Multiplication Invalid Contexts * Guided Matrix Multiplication Race Condition |
Intel® Advisor |
Use to ensure Fortran, C, C++, OpenCL™, and SYCL applications realize full performance potential on modern processors. |
Intel® VTune TM Profiler |
Use to gather performance data either on the native system or on a remote system. |
OpenMP* directives |
Offload and Optimize OpenMP* Applications with Intel Tools describes how to use OpenMP* directives to add parallelism to your application. |
Debug Environment Variables
Both the OpenMP* and SYCL offload runtimes, as well as Level Zero, OpenCL, and the Shader Compiler, provide environment variables that help you understand the communication between the host and offload device. The variables also allow you to discover or control the runtime chosen for offload computations.
OpenMP* Offload Environment Variables
There are several environment variables that you can use to understand how OpenMP Offload works and control which backend it uses.
Environment Variable |
Description |
---|---|
LIBOMPTARGET_DEBUG=<Num> |
Controls whether or not debugging information will be displayed. See details in Runtimes This environment variable enables debug output from the OpenMP Offload runtime. It reports:
Values: <Num>=0: Disabled <Num>=1: Displays basic debug information from the plugin actions such as device detection, kernel compilation, memory copy operations, kernel invocations, and other plugin-dependent actions. <Num>=2: Additionally displays which GPU runtime API functions are invoked with which arguments/parameters. Default: 0 |
LIBOMPTARGET_INFO=<Num> |
This variable controls whether basic offloading information will be displayed from the offload runtime. Allows the user to request different types of runtime information from libomptarget. See details in Runtimes
Values: (0, 1, 2, 4, 8, 32) Default: 0 |
LIBOMPTARGET_PLUGIN_PROFILE=<Enable>[,<Unit>] |
This variable enables the display of performance data for offloaded OpenMP code. It displays:
Values:
Default: F Example: export LIBOMPTARGET_PLUGIN_PROFILE=T,usec
Enables basic plugin profiling and displays the result when program finishes. Microsecond is the default unit if <Unit> is not specified. |
LIBOMPTARGET_PLUGIN=<Name> |
This environment variable allows you to choose the backend used for OpenMP offload execution.
NOTE:
The Level Zero backend is only supported for GPU devices.
Designates offload plugin name to use. Offload runtime does not try to load other RTLs if this option is used. Values:
Default:
|
LIBOMPTARGET_PROFILE=<FileName> |
Allows libomptarget to generate time profile output similar to Clang’s -ftime-trace option. See details in Runtimes |
LIBOMPTARGET_DEVICES=<DeviceKind> |
Controls how subdevices are exposed to users. DEVICE/device: Only top-level devices are reported as OpenMP devices, and subdevice clause is supported. SUBDEVICE/subdevice: Only 1st-level subdevices are reported as OpenMP devices, and subdevice clause is ignored. SUBSUBDEVICE/subsubdevice: Only 2nd-level subdevices are reported as OpenMP devices, and subdevice clause is ignored. On Intel GPU using Level Zero backend, limiting the subsubdevice to a single compute slice within a stack also requires setting additional GPU compute runtime environment variable CFESingleSliceDispatchCCSMode=1. ALL/all: All top-level devices and their subdevices are reported as OpenMP devices, and subdevice clause is ignored. This is not supported on Intel GPU and is being deprecated. Default: Equivalent to <DeviceKind>=device |
LIBOMPTARGET_LEVEL0_MEMORY_POOL=<Option> |
Controls how reusable memory pool is configured. Pool is a list of memory blocks that can serve at least <Capacity> allocations of up to <AllocMax> size from a single block, with total size not exceeding <PoolSize>. Default: Equivalent to <Option>=device,1,4,256,host,1,4,256,shared,8,4,256 |
LIBOMPTARGET_LEVEL0_STAGING_BUFFER_SIZE=<Num> |
Sets the staging buffer size to <Num> KB. Staging buffer is used in copy operations between host and device as a temporary storage for two-step copy operation. The buffer is only used for discrete devices. Default: 16 |
LIBOMPTARGET_LEVEL_ZERO_COMMAND_BATCH=<Value> |
|
LIBOMPTARGET_LEVEL_ZERO_USE_IMMEDIATE_COMMAND_LIST=<Value> |
compute: Enables immediate command list for kernel submission Default: “all” |
OMP_TARGET_OFFLOAD=MANDATORY |
This is defined by the OpenMP Standard : https://www.openmp.org/spec-html/5.1/openmpse74.html#x340-5150006.17 |
ONEAPI_DEVICE_SELECTOR |
This device selection environment variable can be used to limit the choice of devices available when the OpenMP application is run. Useful for limiting devices to a certain type (like GPUs or accelerators) or backends (like Level Zero or OpenCL). The ONEAPI_DEVICE_SELECTOR syntax is shared with OpenMP and also allows devices to be chosen. See oneAPI DPC++ Compiler documentation for a full description. See oneAPI DPC++ Compiler documentation for a full description. |
SYCL* and DPC++ Environment Variables
The oneAPI DPC++ Compiler supports all standard SYCL environment variables. The full list is available from GitHub. Of interest for debugging are the following SYCL environment variables, plus an additional Level Zero environment variable.
Environment Variable |
Description |
---|---|
ONEAPI_DEVICE_SELECTOR |
This complex environment variable allows you to limit the runtimes, compute device types, and compute device IDs used by the runtime to a subset of all available combinations. The compute device IDs correspond to those returned by the SYCL API, clinfo, or sycl-ls (with the numbering starting at 0) and have no relation to whether the device with that ID is of a certain type or supports a specific runtime. Using a programmatic special selector (like gpu_selector) to request a device filtered out by ONEAPI_DEVICE_SELECTOR will cause an exception to be thrown. Refer to the Environment Variables descriptions in GitHub for additional details: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md Example values include:
Default: use all available runtimes and devices |
ONEAPI_DEVICE_SELECTOR |
This device selection environment variable can be used to limit the choice of devices available when the SYCL-using application is run. Useful for limiting devices to a certain type (like GPUs or accelerators) or backends (like Level Zero or OpenCL). This device selection mechanism is replacing ONEAPI_DEVICE_SELECTOR . The ONEAPI_DEVICE_SELECTOR syntax is shared with OpenMP and also allows devices to be chosen. See oneAPI DPC++ Compiler documentation for a full description. |
SYCL_PI_TRACE |
This environment variable enables debug output from the runtime. Values:
Default:disabled |
ZE_DEBUG |
This environment variable enables debug output from the Level Zero backend when used with the runtime. It reports:
Value: variable defined with any value - enabled Default: disabled |
Environment Variables that Produce Diagnostic Information for Support
The Level Zero backend provides a few environment variables that can be used to control behavior and aid in diagnosis.
Level Zero Specification, core programming guide: https://spec.oneapi.com/level-zero/latest/core/PROG.html#environment-variables
Level Zero Specification, tool programming guide: https://spec.oneapi.com/level-zero/latest/tools/PROG.html#environment-variables
An additional source of debug information comes from the Intel® Graphics Compiler, which is called by the Level Zero or OpenCL backends (used by both the OpenMP Offload and SYCL/DPC++ Runtimes) at runtime or during Ahead-of-Time (AOT) compilation. Intel® Graphics Compiler creates the appropriate executable code for the target offload device. The full list of these environment variables can be found at https://github.com/intel/intel-graphics-compiler/blob/master/documentation/configuration_flags.md. The two that are most often needed to debug performance issues are:
IGC_ShaderDumpEnable=1 (default=0) causes all LLVM, assembly, and ISA code generated by the Intel® Graphics Compiler to be written to /tmp/IntelIGC/<application_name>
IGC_DumpToCurrentDir=1 (default=0) writes all the files created by IGC_ShaderDumpEnable to your current directory instead of /tmp/IntelIGC/<application_name>. Since this is potentially a lot of files, it is recommended to create a temporary directory just for the purpose of holding these files.
If you have a performance issue with your OpenMP offload or SYCL offload application that arises between different versions of Intel® oneAPI, when using different compiler options, when using the debugger, and so on, then you may be asked to enable IGC_ShaderDumpEnable and provide the resulting files. For more information on compatibility, see oneAPI Library Compatibility.
Offload Intercept Tools
In addition to debuggers and diagnostics built into the offload software itself, it can be quite useful to monitor offload API calls and the data sent through the offload pipeline. For Level Zero, if your application is run as an argument to the onetrace and ze_tracer tools, they will intercept and report on various aspects of Level Zero made by your application. For OpenCL™, you can add a library to LD_LIBRARY_PATH that will intercept and report on all OpenCL calls, and then use environment variables to control what diagnostic information to report to a file. You can also use onetrace or cl_tracer to report on various aspects of OpenCL API calls made by your application. Once again, your application is run as an argument to the onetrace or cl_tracer tool.
Intercept Layer for OpenCL™ Applications
This library collects debugging and performance data when OpenCL is used as the backend to your SYCL or OpenMP offload program. When OpenCL is used as the backend to your SYCL or OpenMP offload program, this tool can help you detect buffer overwrites, memory leaks, mismatched pointers, and can provide more detailed information about runtime error messages (allowing you to diagnose these issues when either CPU, FPGA, or GPU devices are used for computation). Note that you will get nothing useful if you use ze_tracer on a program that uses the OpenCL backend, or the Intercept Layer for OpenCL™ Applications library and cl_tracer on a program that uses the Level Zero backend.
Additional resources:
Extensive information on building and using the Intercept Layer for OpenCL Applications is available from https://github.com/intel/opencl-intercept-layer.
NOTE:For best results, run cmake with the following flags: -DENABLE_CLIPROF=TRUE -DENABLE_CLILOADER=TRUEInformation about a similar tool (CLIntercept) is available from https://github.com/gmeeker/clintercept and https://sourceforge.net/p/clintercept/wiki/Home/.
Information on the controls for the Intercept Layer for OpenCL™ Applications can be found at https://github.com/intel/opencl-intercept-layer/blob/master/docs/controls.md.
Information about optimizing for GPUs is available from the Intel oneAPI GPU Optimization Guide.
Profiling Tools Interfaces for GPU (onetrace, cl_tracer, and ze_trace)
Like the Intercept Layer for OpenCL™ Applications, these tools collect debugging and performance data from applications that use the OpenCL and Level Zero offload backends for offload via OpenMP* or SYCL. Note that Level Zero can only be used as the backend for computations that happen on the GPU (there is no Level Zero backend for the CPU or FPGA at this time). The onetrace tool is part of the Profiling Tools Interfaces for GPU (PTI for GPU) project, found at https://github.com/intel/pti-gpu. This project also contains the ze_tracer and cl_tracer tools, which trace just activity from the Level Zero or OpenCL offload backends respectively. The ze_tracer and cl_tracer tools will produce no output if they are used with the application using the other backend, while onetrace will provide output no matter which offload backend you use.
The onetrace tool is distributed as source. Instructions for how to build the tool are available from https://github.com/intel/pti-gpu/tree/master/tools/onetrace. The tool provides the following features:
Call logging: This mode allows you to trace all standard Level Zero (L0) and OpenCL™ API calls along with their arguments and return values annotated with time stamps. Among other things, this can give you supplemental information on any failures that occur when a host program tries to make use of an attached compute device.
Host and device timing: These provide the duration of all API calls, the duration of each kernel, and application runtime for the entire application.
Device Timeline mode: Gives time stamps for each device activity. All the time stamps are in the same (CPU) time scale.
Chrome Call Logging mode: Dumps API calls to JSON format that can be opened in chrome://tracing browser tool.
These data can help debug offload failures or performance issues.
Additional resources:
Intel® Distribution for GDB*
The Intel® Distribution for GDB* is an application debugger that allows you to inspect and modify the program state. With the debugger, both the host part of your application and kernels that are offloaded to a device can be debugged seamlessly in the same debug session. The debugger supports the CPU, GPU, and FPGA-emulation devices. Major features of the tool include:
Automatically attaching to the GPU device to listen to debug events
Automatically detecting JIT-compiled, or dynamically loaded, kernel code for debugging
Defining breakpoints (both inside and outside of a kernel) to halt the execution of the program
Listing the threads; switching the current thread context
Listing active SIMD lanes; switching the current SIMD lane context per thread
Evaluating and printing the values of expressions in multiple thread and SIMD lane contexts
Inspecting and changing register values
Disassembling the machine instructions
Displaying and navigating the function call-stack
Source- and instruction-level stepping
Non-stop and all-stop debug mode
Recording the execution using Intel Processor Trace (CPU only)
For more information and links to full documentation for Intel Distribution for GDB, see Get Started with Intel® Distribution for GDB onLinux* Host|Windows* Host.
Intel® Inspector for Offload
Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multithreaded applications. It can be used to verify correctness of the native part of the application as well as dynamically generated offload code.
Unlike the tools and techniques above, Intel Inspector cannot be used to catch errors in offload code that is communicating with a GPU or an FPGA. Instead, Intel Inspector requires that the SYCL or OpenMP runtime needs to be configured to execute kernels on CPU target. In general, it requires definition of the following environment variables prior to an analysis run.
To configure a SYCL application to run kernels on a CPU device
export ONEAPI_DEVICE_SELECTOR=opencl:cpu
To configure an OpenMP application to run kernels on a CPU device
export OMP_TARGET_OFFLOAD=MANDATORY export LIBOMPTARGET_DEVICETYPE=cpu
To enable code analysis and tracing in JIT compilers or runtimes
export CL_CONFIG_USE_VTUNE=True export CL_CONFIG_USE_VECTORIZER=false
Use one of the following commands to start analysis from the command line. You can also start from the Intel Inspector graphical user interface.
Memory: inspxe-cl -c mi3 -- <app> [app_args]
Threading: inspxe-cl -c ti3 -- <app> [app_args]
View the analysis result using the following command: inspxe-cl -report=problems -report-all
If your SYCL or OpenMP Offload program passes bad pointers to the OpenCL™ backend, or passes the wrong pointer to the backend from the wrong thread, Intel Inspector should flag the issue. This may make the problem easier to find than trying to locate it using the intercept layers or the debugger.
Additional details are available from the Intel Inspector User Guide forLinux* OS|Windows* OS.