Visible to Intel only — GUID: GUID-2D7C7797-6F75-4CBE-A220-C84BF5390E22
Visible to Intel only — GUID: GUID-2D7C7797-6F75-4CBE-A220-C84BF5390E22
Utilizing Hardware Kernel Invocation Queue
Kernel invocation queue is a first-in first-out (FIFO) buffer used by the SYCL* runtime to store arguments for multiple kernel invocations on the device. Once the kernel finishes execution, the invocation queue allows the next invocation of the kernel to start immediately after. SYCL kernels are built with invocation queue to enable immediate launch of the next invocation.
As illustrated in the following figure, when the invocation queue is used, system and SYCL runtime environment overheads (from responding to the finish and sending in the next set of invocation arguments) are overlapped with the kernel executions. This allows kernels to execute continuously, maximizing the system level throughput.
SYCL kernel invocations are queued in hardware when the same SYCL kernel function is already running on the device, and the following are true:
- SYCL kernel was not compiled with hardware kernel invocation buffer disabled (-Xsno-hardware-kernel-invocation-queue). See Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)
- SYCL kernel was not compiled with performance counters (-Xsprofile). See Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)
- Any host to device synchronization operation (such as, host accessor, buffer destruction, and so on) is done between sequential kernel enqueues that requires first enqueue to finish.
Consider the following definitions of simple_kernel() and check_output() functions:
simple_kernel()
void simple_kernel(queue &deviceQueue,
buffer<cl_float, 1> &bufferA,
buffer<cl_float, 1> &bufferC)
{
deviceQueue.submit([&](handler& cgh) {
accessor accessorA(bufferA, cgh, read_only);
accessor accessorC(bufferC, cgh, read_write, no_init);
cgh.single_task<class SimpleAdd>([=]() {
for (int i = 0; i < N; i++) {
accessorC[i] = accessorA[i] + accessorA[i];
}
});
});
}
check_output()
void check_output(buffer<cl_float, 1> &outBuffer) {
accessor output_buf_acc(outBuffer, read_only);
...
// Check output
...
}
Based on the function definitions of simple_kernel() and check_output(), consider the following example code snippet where the kernel enqueue can be queued on the hardware kernel invocation queue:
// Example 1
main()
{
...
simple_kernel(device_queue, bufferA, bufferC);
simple_kernel(device_queue, bufferX, bufferZ);
check_output(bufferC);
check_output(bufferZ);
...
}
As soon as the first enqueue of SimpleAdd kernel is running, the second enqueue can be queued since they have no dependency.
Now, consider the following example code where kernel invocation cannot be queued on hardware:
// Example 2
main()
{
...
simple_kernel(device_queue, bufferA, bufferC);
check_output(bufferC);
simple_kernel(device_queue, bufferX, bufferZ);
check_output(bufferZ);
...
}
Creating the output_buf_acc accessor for the output buffer in check_output() function is a synchronization point that blocks the SYCL runtime until the first enqueue of SimpleAdd kernel is complete.