Multi-Device Debugging

Debug with Intel® Distribution for GDB* on Linux* OS Host

Download PDF

ID 766459

Date 10/31/2024

Version

Public

Visible to Intel only — GUID: GUID-589781D4-8A7B-4A42-B864-F0C221781C9C

View Details

Multi-Device Debugging

NOTE:

This video shows how to use the Intel® Distribution for GDB* to debug multiple applications running on Intel® architecture GPUs.

Debugging applications on systems with multiple GPUs and/or sub-devices is supported by the Intel® Distribution for GDB (aka gdb-oneapi), with some important restrictions and limitations.

When debugging an application that includes GPU “offload kernels,” each kernel uses an entire GPU sub-device, even if that kernel only utilizes a subset of the sub-device.
When a kernel being debugged is paused (at a breakpoint, single-stepping, etc.), the kernel remains in place on the GPU, preventing other kernels from using the GPU sub-device.

Enabling debug (ZET_ENABLE_PROGRAM_DEBUGGING=1) of your application’s offload kernels blocks parallel execution of the kernels on the sub-device, which may result in your application taking a longer time to run. When the kernel being debugged is paused it may appear as if the GPU is hung.

There are essentially three multi-device debug scenarios to be aware of:

An application submits kernels to multiple devices.
Multiple applications submit kernels to different devices or sub-devices.
Multiple applications submit kernels to the same sub-device.

The number and type of GPUs available in a system can be listed using the sycl-ls command. The output below shows a system that has two GPU cards, which are available for use by “offload” kernels running on either the OpenCL™ backend or the Intel® oneAPI Level Zero backend.

$ sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]

NOTE:

As of the 2023.0 oneAPI product release, debugging GPU kernels with the Intel® Distribution for GDB (gdb-oneapi) is only supported on Level Zero backends. Debugging GPU kernels on OpenCL backends is no longer supported by the gdb-oneapi debugger. The ONEAPI_DEVICE_SELECTOR environment variable can be used to restrict which GPU devices, sub-devices and backends are used by your application during a debugging session.

The example below shows the output of the sycl-ls command when the ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:* (in this example, restricting the application’s offload kernels to any GPU devices available to the Level Zero backend):

$ export ONEAPI_DEVICE_SELECTOR=level_zero:*
$ sycl-ls
Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:*.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]

Scenario 1: An Application Uses Multiple Devices

The debugger supports debugging a program that offloads multiple kernels to multiple GPU devices and/or sub-devices. Each sub-device appears in the debugger as a separate inferior. The auto-attach feature initializes the devices for debugging and creates the corresponding inferiors.

A possible output is as follows:

$ gdb-oneapi -q --args ./multi-device
Reading symbols from ./multi-device...
(gdb) break get_transformed
Breakpoint 1 at 0x40431a: file multi-device.cpp, line 27.
(gdb) run
Starting program: /path/to/multi-device
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
intelgt: gdbserver-ze started for process 581849.
[New Thread 0x7fffe4645700 (LWP 581871)]
[Switching to Thread 1.97 lane 0]

Thread 2.97 hit Breakpoint 1, with SIMD lanes [0-15], get_transformed (data=1, device_idx=0) at multi-device.cpp:27
27        return data * 3 + 11 * (device_idx + 1);

We can check the devices’ inferiors using the following command:

info inferiors

The output below presents four inferiors, one for each sub-device. The following format is used in device enumeration [<pci-location>].<sub-device-id>.

  Num  Description                   Connection                                  Executable
  1    process 581849                1 (native)                                  /path/to/multi-device
* 2    device [0000:3a:00.0].0       2 (extended-remote | gdbserver-ze --multi --once -)
  3    device [0000:3a:00.0].1       2 (extended-remote | gdbserver-ze --multi --once -)
  4    device [0000:9a:00.0].0       2 (extended-remote | gdbserver-ze --multi --once -)
  5    device [0000:9a:00.0].1       2 (extended-remote | gdbserver-ze --multi --once -)
Type "info devices" to see details of the devices.

We can display further information using the following command:

info devices

A possible output is shown below:

  Num   Location        Sub-device   Vendor Id   Target Id   Cores   Device Name
* 1     [0000:3a:00.0]  0            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  2     [0000:3a:00.0]  1            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  3     [0000:9a:00.0]  0            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  4     [0000:9a:00.0]  1            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]

NOTE:

Switching between the inferiors and threads is the same as explained in the Basic Debugging section.

Applications can be limited to a specific set of GPU devices and sub-devices by using the ZE_AFFINITY_MASK environment variable. For example, the same debug session above gives the output below, if run under the environment variable ZE_AFFINITY_MASK=0.0:

(gdb) info inferiors
  Num  Description                   Connection                                  Executable
  1    process 581966                1 (native)                                  /path/to/multi-device
* 2    device [0000:3a:00.0]         2 (extended-remote | gdbserver-ze --multi --once -)
Type "info devices" to see details of the devices.

(gdb) info devices
  Num   Location   Sub-device        Vendor Id   Target Id   Cores   Device Name
* 1     [0000:3a:00.0]  -            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]

See the Level Zero Specification Environment Variables documentation for more details about the usage of the ZE_AFFINITY_MASK environment variable.

Scenario 2: Multiple Applications Use Different Devices and Sub-Devices

Simultaneous debugging of applications, where each application runs under a separate instance of the debugger, is supported. For example, the Array Transform application from the Basic Debugging section can be started to utilize sub-device 0 of GPU 0 as follows:

$ ZE_AFFINITY_MASK=0.0 gdb-oneapi array-transform
...
(gdb) run gpu
...

While this first application is being debugged (e.g., GPU threads hit a breakpoint and the application’s state is under investigation), another process of the same or a different user can freely utilize another sub-device and/or GPU, e.g. sub-device 1 of GPU 0 (note the change in the affinity mask compared to the previous example):

$ ZE_AFFINITY_MASK=0.1 gdb-oneapi array-transform
...
(gdb) run gpu
...

As long as the applications use different sub-devices, simultaneous debugging works.

As an alternative to using the ZE_AFFINITY_MASK above, the applications may also select GPUs and sub-devices programmatically.

Scenario 3: Multiple Applications Use the Same Sub-Device

A restriction to multi-device debugging occurs when different applications utilize the same sub-device. In this case, the kernel submitted by the application under debug occupies the entire sub-device during the debug session, until the kernel finishes. No other kernels can be run on the same sub-device while a kernel is being debugged. Hence, other applications submitting kernels to that sub-device may appear to be waiting indefinitely.

When debugging an MPI application it is recommended to assign at most one rank to a sub-device. Assigning more than one rank to a sub-device will serialize the ranks, resulting in pausing those ranks that are waiting in the queue during an interactive debug session.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Debug with Intel® Distribution for GDB* on Linux* OS Host

Multi-Device Debugging

Scenario 1: An Application Uses Multiple Devices

Scenario 2: Multiple Applications Use Different Devices and Sub-Devices

Scenario 3: Multiple Applications Use the Same Sub-Device