Debug a Page Fault on GPU
A page fault occurs when a thread attempts to access a memory location, but the driver fails to map the request to an available page. For example, if a thread reads through a nullptr, this triggers a page fault. On platforms that support page fault detection, the debugger reports a page fault as a segmentation fault.
Environment variables
# Turn sync.bar into a polling loop and insert tile fence at the end
export IGC_VISAOptions="-enableBarrierWA"
# Disable caching to ensure environment variables take effect
export NEO_CACHE_PERSISTENT=0
export SYCL_CACHE_PERSISTENT=0
Examples
Memory access requests of a GPU thread are asynchronous. While a request is processed, the thread that triggered the request may continue execution, up to the point where the requested memory would be used or the thread would exit, which ever comes first. That means, when the exception is triggered and the thread is stopped, thread IP may have proceeded further from the instruction that triggered the faulting request.
In the following example, the kernel attempts to read through a nullptr:
(gdb) list 37
32
33 cgh.single_task<> ([=] ()
34 {
35 int *src = nullptr;
36 int num = *src;
37 numbers[0] = num;
38 });
39 });
40 deviceQueue.wait ();
41
(gdb) run
Thread 2.97 received signal SIGSEGV, Segmentation fault
Warning: The location reported for the signal may be inaccurate.
[Switching to thread 2.97:0 (ZE 0.1.4.0 lane 0)]
main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda()#1}:: ...
at kernel-pagefault-read.cpp:37
37 numbers[0] = num;
(gdb)
Note the warning “The location reported for the signal may be inaccurate”. The faulting read is requested at sourceline 36, but the value is used only at sourceline 37. Thus, it depends on timing whether the thread reaches the line 37 before the thread is stopped and the SIGSEGV signal is triggered.
In the next example, kernel attempts to write through a nullptr, and then exits without further accessing that variable:
(gdb) list 37
32
33 cgh.single_task<> ([=] ()
34 {
35 int *p = nullptr;
36 int num = numbers[0];
37 *p = num;
38 });
39 });
40 deviceQueue.wait ();
41
(gdb) run
Thread 2.97 received signal SIGSEGV, Segmentation fault
Warning: The location reported for the signal may be inaccurate.
[Switching to thread 2.97:0 (ZE 0.1.4.0 lane 0)]
0x00008000ffe80980 in _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlvE_ (_arg_numbers=sycl: ...
at /opt/intel/oneapi/compiler/2024.2/bin/compiler/../../include/sycl/handler.hpp:1535
1535 }
(gdb)
This time the thread has already returned from its main function and is preparing to exit, before the thread is stopped.
Identify the instruction triggering the faulting request
In case of a failing read request, the instruction triggering the faulting request can be identifyed by setting the following environment variable:
# Add a sync after every read request
export IGC_EnableForceDebugSWSB=1
With this setting, the thread stops immediatelly after executing the failing request.
(gdb) run
Thread 2.97 received signal SIGSEGV, Segmentation fault
Warning: The location reported for the signal may be inaccurate.
[Switching to thread 2.97:0 (ZE 0.1.4.0 lane 0)]
0x00008000ffe58c00 in main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) ...
at kernel-pagefault-read.cpp:36
36 int num = *src;
(gdb)
In this example, thread stops at 0x8000ffe58c00. This is the address of the instruction following immediately the one that triggers the exception.
However, write faults may be detected much later. In some cases, the fault is detected after the thread returned from the kernel function and is about to exit, and finding the instruction that caused the page fault may require stepping from an earlier breakpoint.
All pending read and write requested of a thread are completed before the thread is stopped. Thus, we can use stepping to find the sourceline and the exact instruction that triggered the faulting request. If we have multiple threads, we should also set ‘scheduler-locking’ to avoid switching threads.
(gdb) break 37
Breakpoint 1 at 0x40625b: file kernel-pagefault-write.cpp, line 37.
(gdb) run
Thread 2.97 hit Breakpoint 1.2, with SIMD lane 0, main::{lambda(sycl::_V1::handler&)#1}::ope ...
at kernel-pagefault-write.cpp:37
37 *p = num; /* pagefault-line */
(gdb) set scheduler-locking step
(gdb) stepi
0x00008000ffe18e90 37 *p = num;
(gdb) stepi
0x00008000ffe18ea0 37 *p = num;
(gdb) stepi
0x00008000ffe18eb0 37 *p = num;
(gdb) stepi
Thread 2.97 received signal SIGSEGV, Segmentation fault
Warning: The location reported for the signal may be inaccurate.
main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda()#1}:: ...
at kernel-pagefault-write.cpp:38
38 });
(gdb)
In the above example, we hit a breakpoint at sourceline 37, continue stepping by instruction until the exception occurs, and find that it was triggered by the instruction at 0x8000ffe18eb0.
Identify the faulting kernel
If the program has multiple kernels, we may need further steps to identify the faulting one. Above examples only had a single kernel, but they can be used to show the required steps.
First we use the command ‘info shared’ to get the addresses of loaded modules. The thread IP tells the address where the faulting thread was stopped, so we use that to identify the right module. Then we use ‘info line’ to map sources to modules.
Thread 2.97 received signal SIGSEGV, Segmentation fault
Warning: The location reported for the signal may be inaccurate.
[Switching to thread 2.97:0 (ZE 0.1.4.0 lane 0)]
0x00008000ffe80980 in _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlvE_ (_arg_numbers=sycl: ...
at /opt/intel/oneapi/compiler/2024.2/bin/compiler/../../include/sycl/handler.hpp:1535
1535 }
(gdb) info shared
From To Syms Read Shared Object Library
0x00008000ffe70000 0x00008000ffe90000 Yes <in-memory@0x26d3710-0x272e158>
(gdb) info line kernel-pagefault-write.cpp:35
Line 35 of "kernel-pagefault-write.cpp" starts at address 0x8000ffe88b90 <_ZZZ4mainENKUlRN4s ...
and ends at 0x8000ffe88c40 <_ZZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_ENKUlvE_clEv+784>.
(gdb)
In the above example, the thread stops at 0x8000ffe80980. This address matches with the module loaded in memory from 0x8000ffe70000 to 0x8000ffe90000, so we know the kernel was loaded there. Then we find that sourceline 35 is in memory from 0x8000ffe88b90 to 0x8000ffe88c40. This matches with the memory range of the module, so we know the sourceline 35 was linked to a kernel in that module.
Note that a module may contain multiple kernels, in which case the above method can only give a set of candidate kernels. To further decrease the number of candidates, one may try setting breakpoints in or before each such kernel, in order to find which one triggers the exception.