Debug a SYCL* Application on a GPU
Tutorial: Debugging with Intel® Distribution for GDB*
This section describes a basic scenario of debugging a SYCL* program with the kernel offloaded to the GPU.
Before you proceed, make sure you have completed all necessary setup steps described in the Get Started Guide.
Basic Debugging
Consider the array-transform.cpp example again:
54 h.parallel_for(data_range, [=](id<1> index) { 55 size_t id0 = GetDim(index, 0); 56 int element = in[index]; // breakpoint-here 57 int result = element + 50; 58 if (id0 % 2 == 0) { 59 result = result + 50; // then-branch 60 } else { 61 result = -1; // else-branch 62 } 63 out[index] = result; 64 });
If you have not already done so, start the debugger:
gdb-oneapi array-transform
Start the debugger, set two breakpoints inside the kernel (one for each conditional branch) as follows:
break 59
Expected output:
Breakpoint 1 at 0x40583c: file /path/to/array-transform.cpp, line 59.
break 61
Expected output:
Breakpoint 2 at 0x40584a: file /path/to/array-transform.cpp, line 61.
To start the program, execute:
run gpu
You should see the following output:
Starting program: /path/to/array-transform gpu [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". intelgt: gdbserver-ze started for process 8194. [New Thread 0x7fffed706700 (LWP 8213)] [SYCL] Using device: [Intel(R) Data Center GPU Flex Series 140 [0x56c1]] from [Intel(R) Level-Zero] [Switching to Thread 1.129 lane 1] Thread 2.129 hit Breakpoint 2, with SIMD lanes [1 3 5 7], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:61 61 result = -1; // else-branch (gdb)
The debugger has a mechanism called “auto-attach” that spawns an instance of gdbserver-ze to listen to and control the GPU for debug. In the example above, the auto-attach mechanism is triggered and the gdbserver-ze is added to the debugger as an inferior. An inferior in GDB represents the unit under debug. In our case, the host application process and the GPU device each correspond to an inferior.
Check the presence of gdbserver-ze as follows:
info inferiors
Expected output:
(gdb) info inferiors Num Description Connection Executable 1 process 8194 1 (native) /path/to/array-transform * 2 device [37:00.0] 2 (remote | gdbserver-ze --attach - 8194) Type "info devices" to see details of the devices.
Execute the info devices command to see the further details of the device.
info devices
Expected output:
Location Sub-device Vendor Id Target Id Cores Device Name * [37:00.0] - 0x8086 0x56c1 128 Intel(R) Data Center GPU Flex Series 140 [0x56c1]
The breakpoint event is received from Inferior 2, which represents the GPU. The thread ID 2.129:1 points to the thread 129 of the inferior 2 and indicates that the first active SIMD lane is now in focus.
The breakpoint at line 61 is hit first. The order of branch execution is defined by the Intel® Graphics Compiler.
Check which SIMD lanes are currently active with the following command. The -stopped flag filters out GPU threads that are currently unavailable (e.g. not utilized by the program). We recommend using it to obtain a more concise output.
info threads -stopped
In the example, thread 2.129 has 4 active SIMD lanes: 1, 3, 5, and 7. The asterisk ‘*’ marks the current SIMD lane. See the expected output below.
(gdb) info threads -stopped Id Target Id Frame 1.1 Thread 0x7ffff598fb80 (LWP 8194) "array-transform" [...] 1.2 Thread 0x7fffed706700 (LWP 8213) "array-transform" [...] * 2.129:1 Thread 1.129 <frame> at array-transform.cpp:61 2.129:[3 5 7] Thread 1.129 <frame> at array-transform.cpp:61 2.137:[1 3 5 7] Thread 1.137 <frame> at array-transform.cpp:61 2.145:[1 3 5 7] Thread 1.145 <frame> at array-transform.cpp:61 2.153:[1 3 5 7] Thread 1.153 <frame> at array-transform.cpp:61 2.193:[1 3 5 7] Thread 1.193 <frame> at array-transform.cpp:61 2.201:[1 3 5 7] Thread 1.201 <frame> at array-transform.cpp:61 2.209:[1 3 5 7] Thread 1.209 <frame> at array-transform.cpp:61 2.217:[1 3 5 7] Thread 1.217 <frame> at array-transform.cpp:61
To switch the focus to a different SIMD lane, use the thread <thread_ID> command. Thread ID is specified by a triple: inferior.thread:lane. See examples of working with particular lanes:
-
thread 2.129:3
Example output:
[Switching to thread 2.129:3 (Thread 1.129 lane 3)] #0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:61 61 result = -1; // else-branch
print element
Example output:
$1 = 111
-
thread 2.129:5
Example output:
[Switching to thread 2.129:5 (Thread 1.129 lane 5)] #0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:61 61 result = -1; // else-branch
print element
Example output:
$2 = 113
thread :7
Expected output:
[Switching to thread 2.129:7 (Thread 1.129 lane 7)] #0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:61 61 result = -1; // else-branch
As you are now inside the kernel running on the GPU, you can look into the assembly code and GPU registers, for example, to understand the cause of unexpected application behavior. Get the GPU assembly code to inspect generated instructions by executing the following command:
disassemble
See an example output below:
Dump of assembler code for function _ZZZ4mainENKUlRT_E_clIN4sycl3_V17handlerEEEDaS0_ENKUlNS4_2idILi1EEEE_clES7_: 0xffff8000ffe87200 <+0>: (W) shr (1|M16) a0.2<1>:ud r126.7<0;1,0>:ud 0x4:ud {F@1} 0xffff8000ffe87210 <+16>: (W) add (1|M16) r126.0<1>:ud r125.2<0;1,0>:ud 0x0:ud 0xffff8000ffe87220 <+32>: (W) send.ugm (1|M16) null r126 r125:1 a0.2 0x4200C504 {ExBSO,A@1,$0} // wr:1+1, rd:0; store.ugm.d32x8t.a32.ss[a0.2] 0xffff8000ffe87230 <+48>: (W) mov (1|M16) r125.3<1>:ud r125.2<0;1,0>:ud {$0.src} 0xffff8000ffe87240 <+64>: (W) add (1|M16) r125.2<1>:ud r125.2<0;1,0>:ud 0x180:ud 0xffff8000ffe87250 <+80>: (W) add (1|M16) r126.0<1>:ud r125.3<0;1,0>:ud 0x40:ud {I@2} 0xffff8000ffe87260 <+96>: (W) send.ugm (1|M16) null r126 r60:4 a0.2 0x4200E504 {ExBSO,A@1,$1} // wr:1+4, rd:0; store.ugm.d32x32t.a32.ss[a0.2] 0xffff8000ffe87270 <+112>: (W) add (1|M16) r126.0<1>:ud r125.3<0;1,0>:ud 0xC0:ud {$1.src} 0xffff8000ffe87280 <+128>: (W) send.ugm (1|M16) null r126 r64:4 a0.2 0x4200E504 {ExBSO,A@1,$2} // wr:1+4, rd:0; store.ugm.d32x32t.a32.ss[a0.2]
To learn more about GEN assembly and registers, refer to the “Introduction to GEN assembly” article.
To display a list of GPU registers, run the following command:
info registers
You can use registers to see the state of the application or inspect arithmetic instructions: which operands are used and where the result is located.
Additionally, you can inspect the execution mask ($emask register), which shows active lanes. To print the result in binary format, use the /t format flag as follows:
print/t $emask
Example output:
$3 = 10101010
Recall that you have stopped at line 61: the else-branch of the condition that checks evenness of the work item index. Hence, every other SIMD lane is inactive, as indicated by the $emask bit pattern.
To move forward and stop at the then-branch, set the scheduler-locking mode to step and execute the next command. The set scheduler-locking step command keeps the other threads stopped while the current thread is stepping:
set scheduler-locking step
next
You should see the following output:
[Switching to SIMD lane 0] Thread 2.129 hit Breakpoint 1, with SIMD lanes [0 2 4 6], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59 59 result = result + 50; // then-branch
Due to the breakpoint event, the SIMD lane focus switches to the first active lane in the then-branch, which is SIMD lane 0. Other threads of inferior 2 stayed at the line 61:
info threads -stopped
Example output:
Id Target Id Frame 1.1 Thread 0x7ffff598fb80 (LWP 8194) "array-transform" [...] 1.2 Thread 0x7fffed706700 (LWP 8213) "array-transform" [...] * 2.129:0 Thread 1.129 <frame> at array-transform.cpp:59 2.129:[2 4 6] Thread 1.129 <frame> at array-transform.cpp:59 2.137:[1 3 5 7] Thread 1.137 <frame> at array-transform.cpp:61 2.145:[1 3 5 7] Thread 1.145 <frame> at array-transform.cpp:61 2.153:[1 3 5 7] Thread 1.153 <frame> at array-transform.cpp:61 2.193:[1 3 5 7] Thread 1.193 <frame> at array-transform.cpp:61 2.201:[1 3 5 7] Thread 1.201 <frame> at array-transform.cpp:61 2.209:[1 3 5 7] Thread 1.209 <frame> at array-transform.cpp:61 2.217:[1 3 5 7] Thread 1.217 <frame> at array-transform.cpp:61
Since the thread is vectorized, you can also inspect the vector of a local variable:
x /8dw &result
Example output:
0xffffd556ab1627e0: 158 -1 160 -1 0xffffd556ab1627f0: 162 -1 164 -1
SIMD Lanes
To investigate the program state from the point of view of SIMD lanes without switching, use the thread apply command. You can specify a SIMD lane as a number:
thread apply 2.129:2 print element
Example output:
Thread 2.129:2 (Thread 1.129 lane 2): $5 = 110
You can also specify a SIMD lane as a range. In this case, only active SIMD lanes from the range are considered:
thread apply 2.129:2-5 print element
Example output:
Thread 2.129:2 (Thread 1.129 lane 2): $11 = 110 warning: SIMD lane 3 is inactive in thread 2.129 Thread 2.129:4 (Thread 1.129 lane 4): $12 = 112 warning: SIMD lane 5 is inactive in thread 2.129
To denote all active SIMD lanes, use the wildcard:
thread apply 2.129:* print element
Example output:
Thread 2.129:0 (Thread 1.129 lane 0): $13 = 108 Thread 2.129:2 (Thread 1.129 lane 2): $14 = 110 Thread 2.129:4 (Thread 1.129 lane 4): $15 = 112 Thread 2.129:6 (Thread 1.129 lane 6): $16 = 114
To apply the command to all active SIMD lanes of all threads, use the all-lanes parameter:
thread apply all-lanes print element
Example output:
Thread 2.217:7 (Thread 1.217 lane 7): $17 = 155 Thread 2.217:5 (Thread 1.217 lane 5): $18 = 153 [...] Thread 2.129:2 (Thread 1.129 lane 2): $47 = 110 Thread 2.129:0 (Thread 1.129 lane 0): $48 = 108 Thread 1.2 (Thread 0x7fffed706700 (LWP 8213) "array-transform"): No symbol "element" in current context.
You can mix SIMD lane ranges with thread ranges and the thread wildcard. For example, to apply the command to all active lanes of all threads of inferior 2, you can use any of the following commands:
thread apply 2.127-129:*
thread apply 2.*:*
If the current inferior is 2, the inferior number can be skipped:
thread apply 127-129:*
thread apply *:*
Breakpoint Actions
You can define a set of actions for a breakpoint to be executed when the breakpoint is hit. By default, the actions are executed in the context of the SIMD lane selected after the hit.
Quit the current debugging session and start a new one:
quit
gdb-oneapi array-transform
Define two temporary breakpoints with actions for the if and else branches:
-
Set a temporary breakpoint:
tbreak 61
Example output:
Temporary breakpoint 1 at 0x40584a: file /path/to/array-transform.cpp, line 61.
Define an action:
commands
When you are asked to type commands, enter the following:
print element end
When you are done with each command, finish with the end keyword.
-
Set another temporary breakpoint:
tbreak 59
Example output:
Temporary breakpoint 2 at 0x40583c: file /path/to/array-transform.cpp, line 59.
Define an action to be executed for all SIMD lines by adding the /a modifier:
commands /a
When you are asked to type commands, enter the following:
print element end
-
Start the program:
run gpu
Example output:
[...] Thread 2.129 hit Temporary breakpoint 1, with SIMD lanes [1 3 5 7], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:61 61 result = -1; // else-branch $1 = 109
Continue to hit both breakpoints:
continue
Example output:
Continuing. Thread 2.129 hit Temporary breakpoint 2, with SIMD lanes [0 2 4 6], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59 59 result = result + 50; // then-branch $2 = 108 $3 = 110 $4 = 112 $5 = 114
The action for the breakpoint at the else branch was executed for a single SIMD lane 1, while the action at the then branch was executed for all active SIMD lanes.
Conditional Breakpoints
Quit the debugging session and start the program from the beginning:
quit
gdb-oneapi array-transform
This time set a breakpoint at line 59 with the condition element==106:
break 59 if element == 106
Example output:
Breakpoint 1 at 0x40583c: file /path/to/array-transform.cpp, line 59.
Run the program (execute the run gpu command) and check if the output looks as follows:
Starting program: <path_to_array-transform> gpu [...] [Switching to Thread 1.193 lane 6] Thread 2.193 hit Breakpoint 1, with SIMD lane 6, main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59 59 result = result + 50; // then-branch (gdb)
The condition is true for the lane 6 in thread 2.193.