Debug a SYCL* Application on a GPU
Tutorial: Debugging with Intel® Distribution for GDB*
Use a simple SYCL application named Array Transform application to perform basic debugging operations, such as break, run, print, continue, info, disassemble, and next. This tutorial describes how to interact with SIMD lanes, as additional thread elements. The application being debugged is instructed to run on a GPU by setting the ONEAPI_DEVICE_SELECTOR=level_zero:gpu environment variable.
The debug array transform application used in this tutorial can be found in the Intel oneAPI sample repo or by way of the oneapi-cli sample browser tool. After you have installed and initialized the Intel oneAPI Base Toolkit (sourced setvars.sh), run oneapi-cli --help in your terminal command line. The sample includes a build script to create an application that can be debugged and run on either a CPU or a GPU (the compiler debug flags are set during the build).
Before you proceed, make sure you have completed all necessary setup steps described in the Get Started Guide.
Basic Debugging
If you have not already done so, start the debugger.
You must set the following environment variables to ensure that the kernel is offloaded to the correct device and that GPU debugging is enabled:
Example output:
Exit gdb-oneapi by typing: quit
Consider the array-transform.cpp example again:
The code processes elements of the input array depending on whether they are even or odd and produces an output array.
Start gdb-oneapi again and set the required environment variables:
Set two breakpoints inside the kernel (one for each conditional branch) as follows:
Expected output:
Expected output:
To start the program, execute:
You should see the following output:
The debugger has a mechanism called Auto-Attach that spawns an instance of gdbserver-ze to listen to and control the GPU for debug. In the example above, the auto-attach mechanism is triggered and the gdbserver-ze is added to the debugger as an inferior. An inferior in GDB represents the unit under debug. In our case, the host application process and the GPU device each correspond to an inferior.
Check the presence of gdbserver-ze as follows:
Expected output:
Execute the info devices command to see the further details of the device.
Expected output:
The breakpoint event is received from Inferior 2, which represents the GPU. The thread ID 2.129:1 points to the thread 129 of the inferior 2 and indicates that the first active SIMD lane is now in focus.
The breakpoint at line 59 is hit first. The order of branch execution is defined by the Intel® Graphics Compiler.
Check which SIMD lanes are currently active with the following command. The -stopped flag filters out GPU threads that are currently unavailable (e.g. not utilized by the program). We recommend using it to obtain a more concise output. We also recommend using the with print frame-arguments none -- prefix to reduce the overhead of the command, which can be noticeably large because of having to fetch the state of a large number of GPU threads.
In the example, thread 2.129 has 4 active SIMD lanes: 1, 3, 5, and 7. The asterisk ‘*’ marks the current SIMD lane. See the expected output below.
To switch the focus to a different SIMD lane, use the thread <thread_ID> command. Thread ID is specified by a triple: inferior.thread:lane. See examples of working with particular lanes:
Example output:
Example output:
Example output:
Example output:
Obtain the corresponding inferior number via info inferiors command:
Run the info threads command and supply the obtained inferior numbers, followed by a star-wildcard thread range .*:
Expected output:
As you are now inside the kernel running on the GPU, you can look into the assembly code and GPU registers, for example, to understand the cause of unexpected application behavior. Get the GPU assembly code to inspect generated instructions by executing the following command:
See an example output below:
To learn more about GEN assembly and registers, refer to the “Introduction to GEN assembly” article.
To display a list of GPU registers, run the following command:
You can use registers to see the state of the application or inspect arithmetic instructions, such as which operands are used and where the result is located.
Additionally, you can inspect the execution mask ($emask register), which shows active lanes. To print the result in binary format, use the /t format flag as follows:
Example output:
Recall that you have stopped at line 59, the else-branch of the condition that checks evenness of the work-item index. Hence, every other SIMD lane is inactive, as indicated by the $emask bit pattern.
To move forward and stop at the then-branch, set the scheduler-locking mode to step and execute the next command. The set scheduler-locking step command keeps the other threads stopped while the current thread is stepping:
You should see the following output:
Due to the breakpoint event, the SIMD lane focus switches to the first active lane in the then-branch, which is SIMD lane 0. Other threads of inferior 2 stayed at the line 59:
Example output:
Since the thread is vectorized, you can also inspect the vector of a local variable:
Example output:
SIMD Lanes
To investigate the program state from the point of view of SIMD lanes without switching, use the thread apply command. You can specify a SIMD lane as a number:
Example output:
You can also specify a SIMD lane as a range. In this case, only active SIMD lanes from the range are considered:
Example output:
To denote all active SIMD lanes, use the wildcard:
Example output:
To apply the command to all active SIMD lanes of all threads, use the all-lanes parameter:
Example output:
You can mix SIMD lane ranges with thread ranges and the thread wildcard. For example, to apply the command to all active lanes of all threads of inferior 2, you can use any of the following commands:
If the current inferior is 2, the inferior number can be skipped:
If you need a formatted output for a set of threads, thread apply might be used together with the printf command, as in the following examples.
A more compact output in comparison to thread apply *:* print element:
In the above command, -q flag is used to suspend the standard thread information, usually printed by thread apply. To print the thread context in a compact way, three convenience variables were used:
$_inferior to get the inferior number;
$_thread to get the thread number within the inferior;
$_simd_lane to get the SIMD lane.
To get a more hierarchical view, you can combine thread apply *
(which applies a command to all threads of the current inferior) with
the command thread apply :* <printf>. The latter applies the
<printf> command to every active SIMD lane of a thread, selected by thread apply *. The result might look as follows:
Work-Item Coordinates
The GPGPU execution model defines a work-item as parallel executions of a kernel function.
Use the convenience variables $_thread_workgroup, $_workitem_local_id, and $_workitem_global_id to get the coordinates of the work-item processed by the current context, defined by the current thread and its current lane.
Please note that the above convenience variables show work-item coordinates using X-Y-Z notation, as per execution model of the device, while SYCL execution model defines coordinates in notation of dimensions 1-2-3. SYCL RT often performs an optimization, such that SYCL dimensions are transposed and 1-2-3 corresponds to Z-Y-X.
Find a Specific Work-item
Using the convenience variables you can find a thread and its lane, which works on a specific work-item.
The first option to find the work-item is to define a conditional breakpoint. However, for a program with many threads, it could take time, till the breakpoint is hit. In the following example, we set the conditional breakpoint for the work-item with the global ID {37,0,0}:
The second option is to use the thread apply command and store the found thread ID and lane number into the convenience variables $thr and $lane. The $found variable shows whether the search was successful. In the following example, we search for a work-item with the global ID {47,0,0}, and then switch to the found thread and lane:
Filter Threads by a Work-group
By combining thread apply and eval, we can filter threads by a specific expression. In the following, we filter by $_thread_workgroup=={0,0,0}.
First, we construct a convenience variable $ids that holds a stringified list of qualified ids (<inferior num>.<thread num>), which belong to the work-group:
Note that the variable $ids must be initialized with an empty string first. The eval GDB command is used here to append the list of already found ids to the newly found one, or leave it without change, if the condition does not hold.
Now the convenience variable $ids contains the list of filtered thread ids.
To call info threads for these ids, we need to use eval again, since the info threads command cannot take a list of threads stored in a convenience variable:
Breakpoint Actions
You can define a set of actions for a breakpoint to be executed when the breakpoint is hit. By default, the actions are executed in the context of the SIMD lane selected after the hit.
Quit the current debugging session and start a new one:
Define two temporary breakpoints with actions for the if and else branches:
Set a temporary breakpoint:
Example output:
Define an action:
When you are asked to type commands, enter the following:
When you are done with each command, finish with the end keyword.
Set another temporary breakpoint:
Example output:
Define an action to be executed for all SIMD lines by adding the /a modifier:
When you are asked to type commands, enter the following:
Start the program:
Example output:
Continue to hit both breakpoints:
Example output:
The action for the breakpoint at the else branch was executed for a single SIMD lane 1, while the action at the then branch was executed for all active SIMD lanes.
NOTE:For conditional breakpoints, the actions are executed only for SIMD lanes that meet the condition.
Conditional Breakpoints
Quit the debugging session and start the program from the beginning:
This time set a breakpoint at line 57 with the condition element==106:
Example output:
Run the program (execute the run command) and check if the output looks as follows:
The condition is true for the lane 6 in thread 2.193.