Summary of Default Behavior Change
Starting with the 2023.2 compiler release the use of immediate command lists is the default submission mode on Intel® Data Center GPU Max Series running Linux. On all other platforms the default continues to be regular command lists. The content applies to SYCL and C/C++/Fortran OpenMP-offload programs using the Level Zero plugin.
Level Zero Immediate Command Lists
The Level Zero API provides two modes of submitting work to the GPU:
1. Regular command lists in combination with command queues
2. Immediate command lists where the command queue is implicit
In the first mode, programming (e.g. zeCommandListAppendLaunchKernel) and submission (zeCommandQueueExecuteCommandList) are decoupled. Upper layers of software such as the SYCL Level Zero plugin and the OpenMP target runtime control when the actual submission occurs.
The advantage of this mode for Intel® Data Center GPU Max Series is:
- Submissions can be batched on the host, i.e., many operations may be collected in a command list and then submitted together, thus dividing the submission cost across many operations.
The disadvantages are:
- Multiple command lists cannot run concurrently when only a single hardware queue is used.
- Dependencies between operations in a one SYCL queue can impede progress of GPU operations in a different SYCL queue even when there are no dependences across the queues, when both SYCL queues are mapped to the same underlying hardware queue.
- Cache invalidation occurs at each submission.
- Requires managing command lists (create/reset) which has higher overhead.
In the second mode using immediate command lists, programming and submission occur together. The tradeoffs are different.
The advantages are:
- Multiple command lists can run concurrently on a single hardware queue.
- Allows batching of kernels on the GPU.
The disadvantage is:
- Has more host overhead on appending an operation to the command list because actual submission to the GPU occurs immediately.
Why Immediate Command Lists are Default on Intel® Data Center GPU Max Series/Linux Only
Immediate command lists are supported on Intel GPUs using a GPU Driver feature known as Ultra Low Latency Submission (ULLS). This feature is currently enabled only on Linux platforms. Lacking ULLS support, submissions would take longer.
On Intel® Arc™ Graphics, Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series, a custom kernel with VM_BIND is used to support ULLS. There has not been much performance tuning of this feature on Intel® Arc™ Graphics and Intel® Data Center GPU Flex Series, with the focus being on Intel® Data Center GPU Max Series.
On Intel® Data Center GPU Max Series, in most cases, using of immediate command lists improves performance. If an application (typical in some AI workloads) uses only one SYCL queue (in-order or out-of-order) and has very short-running kernels (of the order of < 10 microseconds), then host submission time becomes very important and immediate command may cause performance regressions. In 2023.2, we recommend using the environment variables to go back to using regular command lists if you encounter this problem (see below).
Forcing Use of Immediate Command Lists
The platform defaults can be over-ridden by setting environment variables to enable immediate command lists for SYCL and OpenMP offload programs (including OpenMP applications that use “omp dispatch” to call MKL).
SYCL control:
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
LIBOMPTARGET_LEVEL_ZERO_USE_IMMEDIATE_COMMAND_LIST=all
Forcing Use of Regular Command Lists
The platform defaults can be over-ridden by setting environment variables to use regular command lists for SYCL and OpenMP offload programs (including OpenMP applications that use “omp dispatch” to call MKL).
SYCL control:
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=0
LIBOMPTARGET_LEVEL_ZERO_USE_IMMEDIATE_COMMAND_LIST=0
Future Plans for Immediate Command Lists
Performance analysis and tuning of immediate command lists is ongoing. There are known optimization opportunities throughout the software stack that are being addressed. There is also work in progress to enable immediate command lists on other Intel® GPUs.
Future releases will also support a SYCL language-level queue property applicable to an individual SYCL queue to choose between regular and immediate command lists. Applications may use the property instead of relying on environment variables to select the submission model.