Visible to Intel only — GUID: GUID-71B65346-B703-49F5-85D9-577AD34A98A0
Visible to Intel only — GUID: GUID-71B65346-B703-49F5-85D9-577AD34A98A0
Introduction
Usage
Implicit scaling can be enabled by exporting below environment variable:
export EnableImplicitScaling=1
This environment variable changes the meaning of a SYCL/OpenMP device to root-device. No change in application code is required. A kernel submitted to SYCL/OpenMP device will utilize all stacks. Similarly, memory allocation on a SYCL/OpenMP device will span across all stacks. The driver behavior is described in Work Scheduling and Memory Distribution.
Note:
EnableImplicitScaling=1 is set by default.
Implicit scaling should not be combined with SYCL/OpenMP sub-device semantics.
Do not use sub-device syntax in ZE_AFFINITY_MASK. I.e. instead of exposing stack-0 from root-device-0 (ZE_AFFINITY_MASK=0.0), you must expose entire root-device to driver via ZE_AFFINITY_MASK=0 or by unsetting ZE_AFFINITY_MASK.
Only 1 Compute Command Streamers (CCS) is available with implicit scaling as it is using all VEs.
Only copy engines from stack-0 are used with implicit scaling. This may change in future driver versions.
Performance Expectations
Implicit scaling exposes resources of all stacks to a single kernel launch. For root-device with 2 stacks, a kernel has access to 2x compute peak, 2x memory bandwidth and 2x memory capacity. In the ideal case, workload performance increases by 2x. However, cache size and cache bandwidth are increased by 2x as well which can lead to better-than-linear scaling if workload fits in increased cache capacity.
Each stack is equivalent to a NUMA domain and therefore memory access pattern and memory allocation are a crucial part to achieve optimal implicit scaling performance. Workloads with a concept of locality are expected to work best with this programming model as cross-stack memory accesses are naturally minimized. Note that compute bound kernels are not impacted by NUMA domains, thus are expected to easily scale to multiple stacks with implicit scaling.
MPI applications are more efficient with implicit scaling compared to an explicit scaling approach. A single rank can utilize the entire root-device which eliminates explicit synchronization and communication between stacks. Implicit scaling automatically overlaps local memory accesses and cross-stack memory accesses in a single kernel launch.
Implicit scaling improves kernel execution time only. Serial bottlenecks will not speed up. Applications will observe no speed-up with implicit scaling if a large serial bottleneck is present. Common serial bottlenecks are:
high CPU usage
kernel launch latency
PCIe transfers
These will become more pronounced as kernel execution time reduces with implicit scaling. Note that only stack-0 has PCIe connection to the host. On Intel® Data Center GPU Max with implicit scaling enabled, kernel launch latency increases by about 3 microseconds.