Multi-Device Debugging
Debugging applications on systems with multiple GPUs and/or sub-devices is supported by the Intel® Distribution for GDB (aka gdb-oneapi), with some important restrictions and limitations.
When debugging an application that includes GPU “offload kernels,” each kernel uses an entire GPU sub-device, even if that kernel only utilizes a subset of the sub-device.
When a kernel being debugged is paused (at a breakpoint, single-stepping, etc.), the kernel remains in place on the GPU, preventing other kernels from using the GPU sub-device.
Enabling debug (ZET_ENABLE_PROGRAM_DEBUGGING=1) of your application’s offload kernels blocks parallel execution of the kernels on the sub-device, which may result in your application taking a longer time to run. When the kernel being debugged is paused it may appear as if the GPU is hung.
There are essentially three multi-device debug scenarios to be aware of:
An application submits kernels to multiple devices.
Multiple applications submit kernels to different devices or sub-devices.
Multiple applications submit kernels to the same sub-device.
The number and type of GPUs available in a system can be listed using the sycl-ls command. The output below shows a system that has two GPU cards, which are available for use by “offload” kernels running on either the OpenCL™ backend or the Intel® oneAPI Level Zero backend.
The example below shows the output of the sycl-ls command when the ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:* (in this example, restricting the application’s offload kernels to any GPU devices available to the Level Zero backend):
Scenario 1: An Application Uses Multiple Devices
The debugger supports debugging a program that offloads multiple kernels to multiple GPU devices and/or sub-devices. Each sub-device appears in the debugger as a separate inferior. The auto-attach feature initializes the devices for debugging and creates the corresponding inferiors.
A possible output is as follows:
We can check the devices’ inferiors using the following command:
The output below presents four inferiors, one for each sub-device. The following format is used in device enumeration [<pci-location>].<sub-device-id>.
We can display further information using the following command:
A possible output is shown below:
Applications can be limited to a specific set of GPU devices and sub-devices by using the ZE_AFFINITY_MASK environment variable. For example, the same debug session above gives the output below, if run under the environment variable ZE_AFFINITY_MASK=0.0:
See the Level Zero Specification Environment Variables documentation for more details about the usage of the ZE_AFFINITY_MASK environment variable.
Scenario 2: Multiple Applications Use Different Devices and Sub-Devices
Simultaneous debugging of applications, where each application runs under a separate instance of the debugger, is supported. For example, the Array Transform application from the Basic Debugging section can be started to utilize sub-device 0 of GPU 0 as follows:
While this first application is being debugged (e.g., GPU threads hit a breakpoint and the application’s state is under investigation), another process of the same or a different user can freely utilize another sub-device and/or GPU, e.g. sub-device 1 of GPU 0 (note the change in the affinity mask compared to the previous example):
As long as the applications use different sub-devices, simultaneous debugging works.
As an alternative to using the ZE_AFFINITY_MASK above, the applications may also select GPUs and sub-devices programmatically.
Scenario 3: Multiple Applications Use the Same Sub-Device
A restriction to multi-device debugging occurs when different applications utilize the same sub-device. In this case, the kernel submitted by the application under debug occupies the entire sub-device during the debug session, until the kernel finishes. No other kernels can be run on the same sub-device while a kernel is being debugged. Hence, other applications submitting kernels to that sub-device may appear to be waiting indefinitely.
When debugging an MPI application it is recommended to assign at most one rank to a sub-device. Assigning more than one rank to a sub-device will serialize the ranks, resulting in pausing those ranks that are waiting in the queue during an interactive debug session.