Visible to Intel only — GUID: GUID-8ED6C894-0723-4C1A-9439-82EC36963A79
Execution Model Overview
Thread Mapping and GPU Occupancy
Kernels
Using Libraries for GPU Offload
Host/Device Memory, Buffer and USM
Host/Device Coordination
Using Multiple Heterogeneous Devices
Compilation
OpenMP Offloading Tuning Guide
Multi-GPU, Multi-Stack and Multi-C-Slice Architecture and Programming
Level Zero
Performance Profiling and Analysis
Configuring GPU Device
Sub-Groups and SIMD Vectorization
Removing Conditional Checks
Registerization and Avoiding Register Spills
Small Register Mode vs. Large Register Mode
Shared Local Memory
Pointer Aliasing and the Restrict Directive
Synchronization among Threads in a Kernel
Considerations for Selecting Work-Group Size
Reduction
Kernel Launch
Executing Multiple Kernels on the Device at the Same Time
Submitting Kernels to Multiple Queues
Avoiding Redundant Queue Constructions
Programming Intel® XMX Using SYCL Joint Matrix Extension
Doing I/O in the Kernel
Visible to Intel only — GUID: GUID-8ED6C894-0723-4C1A-9439-82EC36963A79
Implicit Scaling
A root-device is built using multiple sub-devices, also known as stacks. These stacks form a shared memory space which allows to treat a root-device as a monolithic device without the requirement of explicit communication between stacks. This section covers multi-stack programming principles using implicit scaling. When using implicit scaling, the root-device driver is responsible for distributing work to all stacks when application code launches a kernel.