Debug the Offload Process
Run with Different Runtimes or Compute Devices
When an offload program fails to run correctly or produces incorrect results, a relatively quick sanity check is to run the application on a different runtime (OpenCL™ vs. Level Zero) or compute device (CPU vs. GPU) using LIBOMPTARGET_PLUGIN and OMP_TARGET_OFFLOAD for OpenMP* applications, and ONEAPI_DEVICE_SELECTOR for SYCL* applications. Errors that reproduce across runtimes mostly eliminate the runtime as being a problem. Errors that reproduce on all available devices mostly eliminates bad hardware as the problem.
Debug CPU Execution
Offload code has two options for CPU execution: either a “host” implementation, or the CPU version of OpenCL. A “host” implementation is a truly native implementation of the offloaded code, meaning it can be debugged like any of the non-offloaded code. The CPU version of OpenCL, while it goes through the OpenCL runtime and code generation process, eventually ends up as normal parallel code running under a TBB runtime. Again, this provides a familiar debugging environment with familiar assembly and parallelism mechanisms. Pointers have meaning through the entire stack, and data can be directly inspected. There are also no memory limits beyond the usual limits for any operating system process.
Finding and fixing errors in CPU offload execution may solve errors seen in GPU offload execution with less pain, and without requiring use of a system with an attached GPU or other accelerator.
For OpenMP applications, to get a “host” implementation, remove the “target” or “device” constructs, replacing them with normal host OpenMP code. If LIBOMPTARGET_PLUGIN=OPENCL and offload to the GPU is disabled, then the offloaded code runs under the OpenMP runtime with TBB providing parallelism.
For SYCL applications, with ONEAPI_DEVICE_SELECTOR=host the “host” device is actually single-threaded, which may help you determine if threading issues, such as data races and deadlocks, are the source of execution errors. Setting ONEAPI_DEVICE_SELECTOR=opencl:cpu uses the CPU OpenCL runtime, which also uses TBB for parallelism.
Debug GPU Execution Using Intel® Distribution for GDB* on compatible GPUs
Intel® Distribution for GDB* is extensively documented in Get Started with Intel Distribution for GDB onLinux* Host|Windows* Host. Useful commands are briefly described in the Intel Distribution for GDBReference Sheet. However, since debugging applications with GDB* on a GPU differs slightly from the process on a host (some commands are used differently and you might see some unfamiliar output), some of those differences are summarized here.
The Debugging with Intel Distribution for GDB on Linux OS Host Tutorial shows a sample debug session where we start a debug session of a SYCL program, define a breakpoint inside the kernel, run the program to offload to the GPU, print the value of a local variable, switch to the SIMD lane 5 of the current thread, and print the variable again.
As in normal GDB*, for a command <CMD>, use the help <CMD> command of GDB to read the information text for <CMD>. For example:
(gdb) help info threads
Display currently known threads.
Usage: info threads [OPTION]... [ID]...
If ID is given, it is a space-separated list of IDs of threads to display.
Otherwise, all threads are displayed.
Options:
-gid
Show global thread IDs.
Inferiors, Threads, and SIMD Lanes Referencing in GDB*
The threads of the application can be listed using the debugger. The printed information includes the thread ids and the locations that the threads are currently stopped at. For the GPU threads, the debugger also prints the active SIMD lanes.
In the example referenced above, you may see some unfamiliar formatting used when threads are displayed via the GDB “info threads” command:
Id Target Id Frame
1.1 Thread <id omitted> <frame omitted>
1.2 Thread <id omitted> <frame omitted>
* 2.1:1 Thread 1073741824 <frame> at array-transform.cpp:61
2.1:[3 5 7] Thread 1073741824 <frame> at array-transform.cpp:61
2.2:[1 3 5 7] Thread 1073741888 <frame> at array-transform.cpp:61
2.3:[1 3 5 7] Thread 1073742080 <frame> at array-transform.cpp:61
Here, GDB is displaying the threads with the following format: <inferior_number>.<thread_number>:<SIMD Lane/s>
So, for example, the thread id “2.3:[1 3 5 7]” refers to SIMD lanes 1, 3, 5, and 7 of thread 3 running on inferior 2.
An “inferior” in the GDB terminology is the process that is being debugged. In the debug session of a program that offloads to the GPU, there will typically be two inferiors; one “native” inferior representing a host part of the program (inferior 1 above), and another “remote” inferior representing the GPU device (inferior 2 above). Intel Distribution for GDB automatically creates the GPU inferior - no extra steps are required.
When you print the value of an expression, the expression is evaluated in the context of the current thread’s current SIMD lane. You can switch the thread as well as the SIMD lane to change the context using the “thread” command such as “thread 3:4 “, “thread :6 “, or “thread 7 “. The first command makes a switch to the thread 3 and SIMD lane 4. The second command switches to SIMD lane 6 within the current thread. The third command switches to thread 7. The default lane selected will either be the previously selected lane, if it is active, or the first active lane within the thread.
The “thread apply command” may be similarly broad or focused (which can make it easier to limit the output from, for example, a command to inspect a variable). For more details and examples about debugging with SIMD lanes, see the Debugging with Intel Distribution for GDB on Linux OS Host Tutorial.
More information about threads and inferiors in GDB can be found from https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html and https://sourceware.org/gdb/current/onlinedocs/gdb/Inferiors-Connections-and-Programs.html#Inferiors-Connections-and-Programs.
Controlling the Scheduler
By default, when a thread hits a breakpoint, the debugger stops all the threads before displaying the breakpoint hit event to the user. This is the all-stop mode of GDB. In the non-stop mode, the stop event of a thread is displayed while the other threads run freely.
In all-stop mode, when a thread is resumed (for example, to resume normally with the continue command, or for stepping with the next command), all the other threads are also resumed. If you have some breakpoints set in threaded applications, this can quickly get confusing, as the next thread that hits the breakpoint may not be the thread you are following.
You can control this behavior using the set scheduler-locking command to prevent resuming other threads when the current thread is resumed. This is useful to avoid intervention of other threads while only the current thread executes instructions. Type help set scheduler-locking for the available options, and see https://sourceware.org/gdb/current/onlinedocs/gdb/Thread-Stops.html for more information. Note that SIMD lanes cannot be resumed individually; they are resumed together with their underlying thread.
In non-stop mode, by default, only the current thread is resumed. To resume all threads, pass the “-a” flag to the continue command.
Dumping Information on One or More Threads/Lanes (Thread Apply)
Commands for inspecting the program state are typically executed in the context of the current thread’s current SIMD lane. Sometimes it is desired to inspect a value in multiple contexts. For such needs, the thread apply command can be used. For instance, the following executes the print element command for the SIMD lanes 3-5 of Thread 2.5:
(gdb) thread apply 2.5:3-5 print element
Similarly, the following runs the same command in the context of SIMD lane 3, 5, and 6 of the current thread:
(gdb) thread apply :3 :5 :6 print element
Stepping GPU Code After a Breakpoint
To stop inside the kernel that is offloaded to the GPU, simply define a breakpoint at a source line inside the kernel. When a GPU thread hits that source line, the debugger stops the execution and shows the breakpoint hit. To single-step a thread over a source-line, use the step or next commands. The step commands steps into functions while next steps over calls. Before stepping, we recommend to set scheduler-locking step to prevent intervention of other threads.
Building a SYCL Executable for Use with Intel® Distribution for GDB*
Much like when you want to debug a host application, you need to set some additional flags to create a binary that can be debugged on the GPU. See Get Started with Intel Distribution for GDB on Linux* Host for details.
For a smooth debug experience when using the just-in-time (JIT) compilation flow, enable debug information emission from the compiler via the -g flag, and disable optimizations via the -O0 flag for both a host and JIT-compiled kernel of the application. The flags for the kernel are taken during link time. For example:
Compile your program using: icpx -fsycl -g -O0 -c myprogram.cpp
Link your program using: icpx -fsycl -g -O0 myprogram.o
If you are using CMake to configure the build of your program, use the Debug type for the CMAKE_BUILD_TYPE, and append -O0 to the CMAKE_CXX_FLAGS_DEBUG variable. For example: set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0")
Applications that are built for debugging may take a little longer to start up than when built with the usual “release” level of optimization. Thus, your program may appear to run a little more slowly when started in the debugger. If this causes problems, developers of larger applications may want to use ahead-of-time (AOT) compilation to JIT the offload code when their program is built, rather than when it is run (warning, this may also take longer to build when using -g -O0). For more information, see Compilation Flow Overview.
When doing ahead-of-time compilation for GPU, you must use a device type that fits your target device. Run the following command to see the available GPU device options on your current machine: ocloc compile --help
Additionally, the debug mode for the kernel must be enabled. The following example AoT compilation command targets the KBL device:
dpcpp -g -O0 -fsycl-targets=spir64_gen-unknown-unknown-sycldevice \
-Xs "-device kbl -internal_options -cl-kernel-debug-enable -options -cl-opt-disable" myprogram.cpp
Building an OpenMP* Executable for use with Intel® Distribution for GDB*
Compile and link your program using the -g -O0 flags. For example:
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -c -g myprogram.cpp
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -g myprogram.o
Set the following environment variables to disable optimizations and enable debug info for the kernel:
export LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS="-g -cl-opt-disable"
export LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-g -cl-opt-disable"
Debugging GPU Execution
A common issue with offload programs is that they may to fail to run at all, instead giving a generic OpenCL™ error with little additional information. The Intercept Layer for OpenCL Applications along with onetrace, ze_tracer, and cl_tracer can be used to get more information about these errors, often helping the developer identify the source of the problem.
Intercept Layer for OpenCL Applications
Using this library, in particular the Buildlogging, ErrorLogging, and USMChecking=1 options, you can often find the source of the error.
Create a clintercept.conf file in the home directory with the following content:
SimpleDumpProgramSource=1 CallLogging=1 LogToFile=1 //KernelNameHashTracking=1 BuildLogging=1 ErrorLogging=1 USMChecking=1 //ContextCallbackLogging=1 // Profiling knobs KernelInfoLogging=1 DevicePerformanceTiming=1 DevicePerformanceTimeLWSTracking=1 DevicePerformanceTimeGWSTracking=1
Run the application with cliloader as follows:
<OCL_Intercept_Install_Dir>/bin/cliloader/cliloader -d ./<app_name> <app_args>
Review the following results in the ~CLIntercept_Dump/<app_name> directory:
clintercept_report.txt: Profiling results
clintercept_log.txt: Log of OpenCL™ calls used to debug OpenCL issues
The following snippet is from an example log file generated by a program that returned the runtime error: CL_INVALID_ARG_VALUE (-50)
...
<<<< clSetKernelArgMemPointerINTEL -> CL_SUCCESS
>>>> clGetKernelInfo( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): param_name = CL_KERNEL_CONTEXT (1193)
<<<< clGetKernelInfo -> CL_SUCCESS
>>>> clSetKernelArgMemPointerINTEL( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): kernel = 0xa2d51a0, index = 3, value = 0x41995e0
mem pointer 0x41995e0 is an UNKNOWN pointer and no device support shared system pointers!
ERROR! clSetKernelArgMemPointerINTEL returned CL_INVALID_ARG_VALUE (-50)
<<<< clSetKernelArgMemPointerINTEL -> CL_INVALID_ARG_VALUE
In this example, the following values help with debugging the error:
ZTSZZ10outer_coreiP5mesh
index = 3, value = 0x41995e0
Using this data, you can identify which kernel had the problems, what argument was problematic, and why.
onetrace, ze_tracer, and cl_tracer
Similar to Intercept Layer for OpenCL Applications, the onetrace, ze_tracer and cl_tracer tools can help find the source of errors detected by the Level Zero and OpenCL™ runtimes.
To use the onetrace or ze_tracer tools to root-cause Level Zero issues (cl_tracer would be used the same way to root-cause OpenCL issues):
Use Call Logging mode to run the application. Redirecting the tool output to a file is optional, but recommended.
./onetrace -c ./<app_name> <app_args> [2> log.txt]
The command for ze_tracer is the same - just substitute “ze_tracer” for “onetrace”.
Review the call trace to figure out the error (log.txt). For example:
>>>> [102032049] zeKernelCreate: hModule = 0x55a68c762690 desc = 0x7fff865b5570 {29 0 0 GEMM} phKernel = 0x7fff865b5438 (hKernel = 0) <<<< [102060428] zeKernelCreate [28379 ns] hKernel = 0x55a68c790280 -> ZE_RESULT_SUCCESS (0) ... >>>> [102249951] zeKernelSetGroupSize: hKernel = 0x55a68c790280 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1 <<<< [102264632] zeKernelSetGroupSize [14681 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102278558] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 0 argSize = 8 pArgValue = 0x7fff865b5440 <<<< [102294960] zeKernelSetArgumentValue [16402 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102308273] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 1 argSize = 8 pArgValue = 0x7fff865b5458 <<<< [102321981] zeKernelSetArgumentValue [13708 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924) >>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60 <<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c <<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0) ...
The example log data shows:
A level zero API call that causes the problem (zeKernelSetArgumentValue)
The problem reason (ZE_RESULT_ERROR_INVALID_ARGUMENT)
The argument index (argIndex = 1)
An invalid value location (pArgValue = 0x7fff865b5458)
A kernel handle (hKernel = 0x55a68c790280), which provides the name of the kernel for which this issue is observed (GEMM)
More information could be obtained by omitting the “redirection to file” option and dumping all the output (application output + tool output) into one stream. Dumping to one stream may help determine the source of the error in respect to application output (for example, you can find that the error happens between application initialization and the first phase of computations):
Level Zero Matrix Multiplication (matrix size: 1024 x 1024, repeats 4 times)
Target device: Intel® Graphics [0x3ea5]
...
>>>> [104131109] zeKernelCreate: hModule = 0x55af5f39ca10 desc = 0x7ffe289c7f80 {29 0 0 GEMM} phKernel = 0x7ffe289c7e48 (hKernel = 0)
<<<< [104158819] zeKernelCreate [27710 ns] hKernel = 0x55af5f3ca600 -> ZE_RESULT_SUCCESS (0)
...
>>>> [104345820] zeKernelSetGroupSize: hKernel = 0x55af5f3ca600 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1
<<<< [104360082] zeKernelSetGroupSize [14262 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104373679] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 0 argSize = 8 pArgValue = 0x7ffe289c7e50
<<<< [104389443] zeKernelSetArgumentValue [15764 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104402448] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 1 argSize = 8 pArgValue = 0x7ffe289c7e68
<<<< [104415871] zeKernelSetArgumentValue [13423 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924)
>>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60
<<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c
<<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0)
...
Matrix multiplication time: 0.0427564 sec
Results are INCORRECT with accuracy: 1
...
Matrix multiplication time: 0.0430995 sec
Results are INCORRECT with accuracy: 1
...
Total execution time: 0.381558 sec
Correctness
Offload code is often used for kernels that can efficiently process large amounts of information on the attached compute device, or to generate large amounts of information from some input parameters. If these kernels are running without crashing, this can often mean that you learn that they are not producing the correct results much later in program execution.
In these cases, it can be difficult to identify which kernel is producing incorrect results. One technique for finding the kernel producing incorrect data is to run the program twice, once using a purely host-based implementation, and once using an offload implementation, capturing the inputs and outputs from every kernel (often to individual files). Now compare the results and see which kernel call is producing unexpected results (within a certain epsilon - the offload hardware may have a different order of operation or native precision that causes the results to differ from the host code in the last digit or two).
Once you know which kernel is producing incorrect results, and you are working with a compatible GPU, use Intel Distribution for GDB to determine the reason. See the Debugging with Intel Distribution for GDB on Linux OS Host Tutorial for basic information and links to more detailed documentation.
Both SYCL and OpenMP* also allow for the use of standard language print mechanisms (printf for SYCL and C++ OpenMP offload, print *, ... for Fortran OpenMP offload) within offloaded kernels, which you can use to verify correct operation while they run. Print the thread and SIMD lane the output is coming from and consider adding synchronization mechanisms to ensure printed information is in a consistent state when printed. Examples for how to do this in SYCL using the stream class can be found in the Intel oneAPI GPU Optimization Guide. You could use a similar approach to the one described for SYCL for OpenMP offload.
For more information about using OpenMP directives to add parallelism to your application, see Offload and Optimize OpenMP* Applications with Intel Tools <https://www.intel.com/content/www/us/en/developer/tools/oneapi/training/offload-optimize-openmp-applications.html>`_
#ifdef __SYCL_DEVICE_ONLY__
#define CL_CONSTANT __attribute__((opencl_constant))
#else
#define CL_CONSTANT
#endif
#define PRINTF(format, ...) { \
static const CL_CONSTANT char _format[] = format; \
sycl::ONEAPI::experimental::printf(_format, ## __VA_ARGS__); }
Usage example: PRINTF("My integer variable:%d\n", (int) x);
Failures
Just-in-time (JIT) compilation failures that occur at runtime due to incorrect use of the SYCL or OpenMP* offload languages will cause your program to exit with an error.
In the case of SYCL, if you cannot find these using ahead-of-time compilation of your SYCL code, selecting the OpenCL backend, setting SimpleDumpProgramSource and BuildLogging, and using the Intercept Layer for OpenCL Applications may help identify the kernel with the syntax error.
Logic errors can also result in crashes or error messages during execution. Such issues can include:
Passing a buffer that belongs to the wrong context to a kernel
Passing the “this” pointer to a kernel rather than a class element
Passing a host buffer rather than a device buffer
Passing an uninitialized pointer, even if it is not used in the kernel
Using the Intel® Distribution for GDB* (or even the native GDB), if you watch carefully, you can record the addresses of all contexts created and verify that the address being passed to an offload kernel belongs to the correct context. Likewise, you can verify that the address of a variable passed matches that of the variable itself, and not its containing class.
It may be easier to track buffers and addresses using the Intercept Layer for OpenCL™ allocation or onetrace/cl_tracer and choosing the appropriate backend. When using the OpenCL backend, setting CallLogging, BuildLogging, ErrorLogging, and USMChecking and running your program should produce output that explains what error in your code caused the generic OpenCL error to be produced.
Using onetrace or ze_tracer’s Call Logging or Device Timeline should give additional enhanced error information to help you better understand the source of generic errors from the Level Zero backend. This can help locate many of the logic errors mentioned above.
If the code is giving an error when offloading to a device using the Level Zero backend, try using the OpenCL backend. If the program works, report an error against the Level Zero backend. If the error reproduces in the OpenCL backend to the device, try using the OpenCL CPU backend. In OpenMP offload, this can be specified by setting OMP_TARGET_OFFLOAD to CPU. For SYCL, this can be done by setting ONEAPI_DEVICE_SELECTOR=opencl:cpu. Debugging with everything on the CPU can be easier, and removes complications caused by data copies and translation of the program to a non-CPU device.
As an example of a logic issue that can get you in trouble, consider what is captured by the lambda function used to implement the parallel_for in this SYCL code snippet.
class MyClass {
private:
int *data;
int factor;
:
void run() {
:
auto data2 = data;
auto factor2 = factor;
{
dpct::get_default_queue_wait().submit([&](cl::sycl::handler &cgh)
{
auto dpct_global_range = grid * block;
auto dpct_local_range = block;
cgh.parallel_for<dpct_kernel_name<class kernel_855a44>>(
cl::sycl::nd_range<1>(
cl::sycl::range<1> dpct_global_range.get(0)),
cl::sycl::range<1>( dpct_local_range.get(0))),
[=](cl::sycl::nd_item<3> item_ct1)
{
kernel(data, b, factor, LEN, item_ct1); // This blows up
});
});
}
} // run
} // MyClass
In the above code snippet, the program crashes because [=] will copy by value all variables used inside the lambda. In the example it may not be obvious that “factor” is really “this->factor” and “data” is really “this->data,” so “this” is the variable that is captured for the use of “data” and “factor” above. OpenCL or Level Zero will crash with an illegal arguments error in the “kernel(data, b, factor, LEN, item_ct1)” call.
The fix is the use of local variables auto data2 and auto factor2. “auto factor2 = factor” becomes “int factor2 = this->factor” so using factor2 inside the lambda with [=] would capture an “int”. We would rewrite the inner section as “kernel(data2, b, factor2, LEN, item_ct1);” .
Using the Intercept Layer for OpenCL™ allocation or onetrace or ze_tracer, you would see that the kernel was called with two identical addresses, and the extended error information would tell you that you are trying to copy a non-trivial data structure to the offload device.
Note that if you are using unified shared memory (USM), and “MyClass” is allocated in USM, the above code will work. However, if only “data” is allocated in USM, then the program will crash for the above reason.
In this example, note that you can also re-declare the variables in local scope with the same name so that you don’t need to change everything in the kernel call.
Intel® Inspector can also help diagnose these sorts of failures. If you set the following environment variables and then run Memory Error Analysis on offload code using the CPU device, Intel Inspector will flag many of the above issues:
OpenMP*
export OMP_TARGET_OFFLOAD=CPU
export OMP_TARGET_OFFLOAD=MANDATORY
export LIBOMPTARGET_PLUGIN=OPENCL
SYCL
export ONEAPI_DEVICE_SELECTOR=opencl:cpu
Or initialize your queue with a CPU selector to force use of the OpenCL CPU device: cl::sycl::queue Queue(cl::sycl::cpu_selector{});
Both
export CL_CONFIG_USE_VTUNE=True
export CL_CONFIG_USE_VECTORIZER=false