Multi-Device Debugging
Debugging applications on systems with multiple GPUs and/or sub-devices is supported by the Intel® Distribution for GDB (aka gdb-oneapi), with some important restrictions and limitations.
When debugging an application that includes GPU “offload kernels,” each kernel uses an entire GPU sub-device, even if that kernel only utilizes a subset of the sub-device.
When a kernel being debugged is paused (at a breakpoint, single-stepping, etc.), the kernel remains in place on the GPU, preventing other kernels from using the GPU sub-device.
Enabling debug (ZET_ENABLE_PROGRAM_DEBUGGING=1) of your application’s offload kernels blocks parallel execution of the kernels on the sub-device, which may result in your application taking a longer time to run. When the kernel being debugged is paused it may appear as if the GPU is hung.
There are essentially three multi-device debug scenarios to be aware of:
An application submits kernels to multiple devices.
Multiple applications submit kernels to different devices or sub-devices.
Multiple applications submit kernels to the same sub-device.
The number and type of GPUs available in a system can be listed using the sycl-ls command. The output below shows a system that has two GPU cards, which are available for use by “offload” kernels running on either the OpenCL™ backend or the Intel® oneAPI Level Zero backend.
$ sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
The example below shows the output of the sycl-ls command when the ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:* (in this example, restricting the application’s offload kernels to any GPU devices available to the Level Zero backend):
$ export ONEAPI_DEVICE_SELECTOR=level_zero:*
$ sycl-ls
Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:*.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
Scenario 1: An Application Uses Multiple Devices
The debugger supports debugging a program that offloads multiple kernels to multiple GPU devices and/or sub-devices. Each sub-device appears in the debugger as a separate inferior. The auto-attach feature initializes the devices for debugging and creates the corresponding inferiors.
A possible output is as follows:
$ gdb-oneapi -q --args ./multi-device
Reading symbols from ./multi-device...
(gdb) break get_transformed
Breakpoint 1 at 0x40431a: file multi-device.cpp, line 27.
(gdb) run
Starting program: /path/to/multi-device
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
intelgt: gdbserver-ze started for process 581849.
[New Thread 0x7fffe4645700 (LWP 581871)]
[Switching to Thread 1.97 lane 0]
Thread 2.97 hit Breakpoint 1, with SIMD lanes [0-15], get_transformed (data=1, device_idx=0) at multi-device.cpp:27
27 return data * 3 + 11 * (device_idx + 1);
We can check the devices’ inferiors using the following command:
info inferiors
The output below presents four inferiors, one for each sub-device. The following format is used in device enumeration [<pci-location>].<sub-device-id>.
Num Description Connection Executable
1 process 581849 1 (native) /path/to/multi-device
* 2 device [0000:3a:00.0].0 2 (extended-remote | gdbserver-ze --multi --once -)
3 device [0000:3a:00.0].1 2 (extended-remote | gdbserver-ze --multi --once -)
4 device [0000:9a:00.0].0 2 (extended-remote | gdbserver-ze --multi --once -)
5 device [0000:9a:00.0].1 2 (extended-remote | gdbserver-ze --multi --once -)
Type "info devices" to see details of the devices.
We can display further information using the following command:
info devices
A possible output is shown below:
Num Location Sub-device Vendor Id Target Id Cores Device Name
* 1 [0000:3a:00.0] 0 0x8086 0x0bd5 512 Intel(R) Graphics [0x0bd5]
2 [0000:3a:00.0] 1 0x8086 0x0bd5 512 Intel(R) Graphics [0x0bd5]
3 [0000:9a:00.0] 0 0x8086 0x0bd5 512 Intel(R) Graphics [0x0bd5]
4 [0000:9a:00.0] 1 0x8086 0x0bd5 512 Intel(R) Graphics [0x0bd5]
Applications can be limited to a specific set of GPU devices and sub-devices by using the ZE_AFFINITY_MASK environment variable. For example, the same debug session above gives the output below, if run under the environment variable ZE_AFFINITY_MASK=0.0:
(gdb) info inferiors
Num Description Connection Executable
1 process 581966 1 (native) /path/to/multi-device
* 2 device [0000:3a:00.0] 2 (extended-remote | gdbserver-ze --multi --once -)
Type "info devices" to see details of the devices.
(gdb) info devices
Num Location Sub-device Vendor Id Target Id Cores Device Name
* 1 [0000:3a:00.0] - 0x8086 0x0bd5 512 Intel(R) Graphics [0x0bd5]
See the Level Zero Specification Environment Variables documentation for more details about the usage of the ZE_AFFINITY_MASK environment variable.
Scenario 2: Multiple Applications Use Different Devices and Sub-Devices
Simultaneous debugging of applications, where each application runs under a separate instance of the debugger, is supported. For example, the Array Transform application from the Basic Debugging section can be started to utilize sub-device 0 of GPU 0 as follows:
$ ZE_AFFINITY_MASK=0.0 gdb-oneapi array-transform
...
(gdb) run gpu
...
While this first application is being debugged (e.g., GPU threads hit a breakpoint and the application’s state is under investigation), another process of the same or a different user can freely utilize another sub-device and/or GPU, e.g. sub-device 1 of GPU 0 (note the change in the affinity mask compared to the previous example):
$ ZE_AFFINITY_MASK=0.1 gdb-oneapi array-transform
...
(gdb) run gpu
...
As long as the applications use different sub-devices, simultaneous debugging works.
As an alternative to using the ZE_AFFINITY_MASK above, the applications may also select GPUs and sub-devices programmatically.
Scenario 3: Multiple Applications Use the Same Sub-Device
A restriction to multi-device debugging occurs when different applications utilize the same sub-device. In this case, the kernel submitted by the application under debug occupies the entire sub-device during the debug session, until the kernel finishes. No other kernels can be run on the same sub-device while a kernel is being debugged. Hence, other applications submitting kernels to that sub-device may appear to be waiting indefinitely.
When debugging an MPI application it is recommended to assign at most one rank to a sub-device. Assigning more than one rank to a sub-device will serialize the ranks, resulting in pausing those ranks that are waiting in the queue during an interactive debug session.