Debug a Page Fault on GPU
A page fault occurs when a thread attempts to access a memory location, but the driver fails to map the request to an available page. For example, if a thread reads through a nullptr, this triggers a page fault. On platforms that support page fault detection, the debugger reports a page fault as a segmentation fault. If the debugger is not attached, behavior depends on whether debugging is enabled. If not enabled, the application terminates and “Segmentation fault” error is printed on stdout. If enabled, the error may be “DEVICE_LOST”.
If there is a “DEVICE_LOST” error while debugger is not attached, you may run the application once with “ZET_ENABLE_PROGRAM_DEBUGGING=0” to see if there is a “Segmentation fault” error. If so, attach the debugger to find the page fault.
Environment variables
Examples
Memory access requests of a GPU thread are asynchronous. While a request is processed, the thread that triggered the request may continue execution, up to the point where the requested memory would be used or the thread would exit, which ever comes first. That means, when the exception is triggered and the thread is stopped, thread IP may have proceeded further from the instruction that triggered the faulting request.
In the following example, the kernel attempts to read through a nullptr:
Note the warning “The location reported for the signal may be inaccurate”. The faulting read is requested at sourceline 36, but the value is used only at sourceline 37. Thus, it depends on timing whether the thread reaches the line 37 before the thread is stopped and the SIGSEGV signal is triggered.
In the next example, kernel attempts to write through a nullptr, and then exits without further accessing that variable:
This time the thread has already returned from its main function and is preparing to exit, before the thread is stopped.
Identify the instruction triggering the faulting request
In case of a failing read request, the instruction triggering the faulting request can be identifyed by setting the following environment variable:
With this setting, the thread stops immediatelly after executing the failing request.
In this example, thread stops at 0x8000ffe58c00. This is the address of the instruction following immediately the one that triggers the exception.
However, write faults may be detected much later. In some cases, the fault is detected after the thread returned from the kernel function and is about to exit, and finding the instruction that caused the page fault may require stepping from an earlier breakpoint.
All pending read and write requested of a thread are completed before the thread is stopped. Thus, we can use stepping to find the sourceline and the exact instruction that triggered the faulting request. If we have multiple threads, we should also set ‘scheduler-locking’ to avoid switching threads.
In the above example, we hit a breakpoint at sourceline 37, continue stepping by instruction until the exception occurs, and find that it was triggered by the instruction at 0x8000ffe18eb0.
Identify the faulting kernel
If the program has multiple kernels, we may need further steps to identify the faulting one. Above examples only had a single kernel, but they can be used to show the required steps.
First we use the command ‘info shared’ to get the addresses of loaded modules. The thread IP tells the address where the faulting thread was stopped, so we use that to identify the right module. Then we use ‘info line’ to map sources to modules.
In the above example, the thread stops at 0x8000ffe80980. This address matches with the module loaded in memory from 0x8000ffe70000 to 0x8000ffe90000, so we know the kernel was loaded there. Then we find that sourceline 35 is in memory from 0x8000ffe88b90 to 0x8000ffe88c40. This matches with the memory range of the module, so we know the sourceline 35 was linked to a kernel in that module.
Note that a module may contain multiple kernels, in which case the above method can only give a set of candidate kernels. To further decrease the number of candidates, you may try setting breakpoints in or before each such kernel, in order to find which one triggers the exception.
Another method is to use the kernel starting address in register DBG0.1. In the following example the faulting kernel was launced in function ‘run3’.