User Guide
What is Intel® Distribution for GDB*?
The Intel® Distribution for GDB* is an application debugger that allows you to inspect and modify the program state. With the debugger, both the host part of your application and kernels that are offloaded to a device can be debugged seamlessly in the same debug session. The debugger supports the CPU, GPU, and FPGA-emulation devices. Major features of the tool include:
Automatically attaching to the GPU device to listen to debug events
Automatically detecting JIT-compiled, or dynamically loaded, kernel code for debugging
Defining breakpoints (both inside and outside of a kernel) to halt the execution of the program
Listing the threads; switching the current thread context
Listing SIMD lanes; switching the current SIMD lane context per thread
Evaluating and printing the values of expressions in multiple thread and SIMD lane contexts
Inspecting and changing register values
Disassembling the machine instructions
Displaying and navigating the function call-stack
Source- and instruction-level stepping
Non-stop and all-stop debug mode
Recording the execution using Intel Processor Trace (CPU only)
Printing of Intel PT PTWRITE payloads and asynchronous events in the instruction history and function-call history
Reading and writing Intel® Advanced Matrix Extensions (Intel® AMX) registers
Reading and writing of the Intel® CET Shadow Stack Pointer (pl3_ssp) register
Reading and writing of the Intel® APX registers (Extended GPRs $r16 - $r31) including byte, word and dword pseudo registers
For more information and links to full documentation for Intel Distribution for GDB, see Get Started with Intel® Distribution for GDB on a Linux* host and Get Started with Intel® Distribution for GDB on a Windows* host.
Debug GPU Execution Using Intel® Distribution for GDB* on compatible GPUs
Intel® Distribution for GDB* is extensively documented in Get Started with Intel® Distribution for GDB onLinux* host|Windows* host. Useful commands are briefly described in the Intel® Distribution for GDBCheat Sheet. However, since debugging applications with GDB* on a GPU differs slightly from the process on a host (some commands are used differently and you might see some unfamiliar output), some of those differences are summarized here.
The Debug Examples in Linux shows sample debug sessions where we start a debug session of a SYCL program, define a breakpoint inside the kernel, run the program to offload to the GPU, print the value of a local variable, switch to the SIMD lane 5 of the current thread, and print the variable again.
As in normal GDB*, for a command <CMD>, use the help <CMD> command of GDB to read the information text for <CMD>. For example:
(gdb) help info threads
Display currently known threads.
Usage: info threads [OPTION]... [ID]...
If ID is given, it is a space-separated list of IDs of threads to display.
Otherwise, all threads are displayed.
Options:
-gid
Show global thread IDs.
-stopped
Show stopped threads only.
Inferiors, Threads, and SIMD Lanes Referencing in GDB*
The threads of the application can be listed using the debugger. The printed information includes the thread ids and the locations that the threads are currently stopped at. For the GPU threads, the debugger also prints the active SIMD lanes.
In the example referenced above, you may see some unfamiliar formatting used when threads are displayed via the GDB info threads -stopped command:
(gdb) info threads -stopped
Id Target Id Frame
1.1 Thread 0x7ffff502ccc0 (LWP 124773) "array-transorm." 0x00008000001c3a10 in clock_gettime ()
1.3 Thread 0x7fffe8e6e640 (LWP 124787) "array-transorm." __futex_abstimed_wait_common64 (private=1, cancel=true, abstime=0x7fffe8e6dd30, op=137,
expected=0, futex_word=0x1bdaa60) at ./nptl/futex-internal.c:57
2.1:[0-15] ZE 0.0.0.0 main::{lambda(auto:1&)#1}::operator()<sycl::_V1::handler>(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<1>)#1}::operator()(sycl::_V1::id<1>) const (this=0xff000000002e0590, index=sycl::id = 16) at array-transform.cpp:61
* 2.9:[*0 1-15] ZE 0.0.1.0 main::{lambda(auto:1&)#1}::operator()<sycl::_V1::handler>(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<1>)#1}::operator()(sycl::_V1::id<1>) const (this=0xff000000002fc590, index=sycl::id = 48) at array-transform.cpp:61
2.33:[0-15] ZE 0.0.4.0 main::{lambda(auto:1&)#1}::operator()<sycl::_V1::handler>(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<1>)#1}::operator()(sycl::_V1::id<1>) const (this=0xff00000000350590, index=sycl::id = 0) at array-transform.cpp:61
2.41:[0-15] ZE 0.0.5.0 main::{lambda(auto:1&)#1}::operator()<sycl::_V1::handler>(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<1>)#1}::operator()(sycl::_V1::id<1>) const (this=0xff0000000036c590, index=sycl::id = 32) at array-transform.cpp:61
Here, GDB is displaying the threads with the following format: <inferior_number>.<thread_number>:<SIMD Lane/s>
So, for example, the thread id “2.33:[0-15]” refers to SIMD lanes 0, 1, …, 15 of thread 33 running on inferior 2. In the thread id “2.9:[*0 1-15]”, the selected lane 0 is additionally marked with an asterisk *.
An “inferior” in the GDB* terminology is the process that is being debugged. In the debug session of a program that offloads to the GPU, there will typically be two inferiors; one “native” inferior representing a host part of the program (inferior 1 above), and another “remote” inferior representing the GPU device (inferior 2 above). Intel® Distribution for GDB* automatically creates the GPU inferior - no extra steps are required.
When you print the value of an expression, the expression is evaluated in the context of the current thread’s current SIMD lane. You can switch the thread as well as the SIMD lane to change the context using the “thread” command such as “thread 3:4 “, “thread :6 “, or “thread 7 “. The first command makes a switch to the thread 3 and SIMD lane 4. The second command switches to SIMD lane 6 within the current thread. The third command switches to thread 7. The default lane selected will either be the previously selected lane, if it is active, or the first active lane within the thread.
The thread apply command may be similarly broad or focused (which can make it easier to limit the output from, for example, a command to inspect a variable). For more details and examples about debugging with SIMD lanes, see Debug Examples in Linux.
More information about threads and inferiors in GDB can be found at https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html and https://sourceware.org/gdb/current/onlinedocs/gdb/Inferiors-Connections-and-Programs.html#Inferiors-Connections-and-Programs.
Controlling the Scheduler
By default, when a thread hits a breakpoint, the debugger stops all the threads before displaying the breakpoint hit event to the user. This is the all-stop mode of GDB. In the non-stop mode, the stop event of a thread is displayed while the other threads run freely.
In all-stop mode, when a thread is resumed (for example, to resume normally with the continue command, or for stepping with the next command), all the other threads are also resumed. If you have some breakpoints set in threaded applications, this can quickly get confusing, as the next thread that hits the breakpoint may not be the thread you are following.
You can control this behavior using the set scheduler-locking command to prevent resuming other threads when the current thread is resumed. This is useful to avoid intervention of other threads while only the current thread executes instructions. Type help set scheduler-locking for the available options, and see Stopping and Starting Multi-thread Programs for more information. Note that SIMD lanes cannot be resumed individually; they are resumed together with their underlying thread.
In non-stop mode, by default, only the current thread is resumed. To resume all threads, pass the “-a” flag to the continue command.
Dumping Information about One or More Threads/Lanes (Thread Apply)
Commands for inspecting the program state are typically executed in the context of the current thread’s current SIMD lane. Sometimes it is desired to inspect a value in multiple contexts. For such needs, the thread apply command can be used. For instance, the following executes the print element command for the SIMD lanes 3-5 of Thread 2.5:
(gdb) thread apply 2.5:3-5 print element
Similarly, the following runs the same command in the context of SIMD lane 3, 5, and 6 of the current thread:
(gdb) thread apply :3 :5 :6 print element
Stepping GPU Code After a Breakpoint
To stop inside the kernel that is offloaded to the GPU, simply define a breakpoint at a source line inside the kernel. When a GPU thread hits that source line, the debugger stops the execution and shows the breakpoint hit. To single-step a thread over a source-line, use the step or next commands. The step command steps into functions while next steps over calls. Before stepping, we recommend to set scheduler-locking step to prevent intervention of other threads.
Building a SYCL Executable for Use with Intel® Distribution for GDB*
Much like when you want to debug a host application, you need to set some additional flags to create a binary that can be debugged on the GPU. See Get Started with Intel® Distribution for GDB on Linux* Host for details.
For a smooth debug experience when using the just-in-time (JIT) compilation flow, enable debug information emission from the compiler via the -g flag, and disable optimizations via the -O0 flag for both a host and JIT-compiled kernel of the application. The flags for the kernel are taken during link time. For example:
Compile your program using: icpx -fsycl -g -O0 -c myprogram.cpp
Link your program using: icpx -fsycl -g -O0 myprogram.o
If you are using CMake to configure the build of your program, use the Debug type for the CMAKE_BUILD_TYPE, and append -O0 to the CMAKE_CXX_FLAGS_DEBUG variable. For example: set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0")
Applications that are built for debugging may take a little longer to start up than when built with the usual “release” level of optimization. Thus, your program may appear to run a little more slowly when started in the debugger. If this causes problems, developers of larger applications may want to use ahead-of-time (AOT) compilation to JIT the offload code when their program is built, rather than when it is run (warning, this may also take longer to build when using -g -O0). For more information, see Compilation Flow Overview.
Building an OpenMP* Executable for use with Intel® Distribution for GDB*
Compile and link your program using the -g -O0 flags. For example:
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -c -g myprogram.cpp
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -g myprogram.o
Set the following environment variables to disable optimizations and enable debug info for the kernel: