Visible to Intel only — GUID: GUID-B95C88FD-251A-430F-A5C2-2BEEB12CC699
Visible to Intel only — GUID: GUID-B95C88FD-251A-430F-A5C2-2BEEB12CC699
Introduction
Implicit scaling applies to multiple stacks inside a single GPU card only.
In COMPOSITE mode, if the program offloads to a device that is the entire card, then the driver and language runtime are, by default, responsible for work distribution and multi-stack memory placement.
For implicit scaling, no change in application code is required. An OpenMP/SYCL kernel submitted to a device will utilize all the stacks on that device. Similarly, memory allocated on the device will be accessible across all the stacks. The driver behavior is described in Work Scheduling and Memory Distribution.
Notes on implicit scaling:
Set ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE to allow implicit scaling.
Implicit scaling should not be combined with SYCL/OpenMP sub-device semantics.
Do not use sub-device syntax in ZE_AFFINITY_MASK. That is, instead of exposing stack 0 in root device 0 (ZE_AFFINITY_MASK=0.0), you must expose the entire root device to the driver via ZE_AFFINITY_MASK=0 or by unsetting ZE_AFFINITY_MASK.
Performance Expectations
In implicit scaling, the resources of all the stacks are exposed to a kernel. When using a root device with 2 stacks, a kernel can achieve 2x compute peak, 2x memory bandwidth, and 2x memory capacity. In the ideal case, workload performance increases by 2x. However, cache size and cache bandwidth are increased by 2x as well, which can lead to better-than-linear scaling if the workload fits in the increased cache capacity.
Each stack is equivalent to a NUMA domain and therefore memory access pattern and memory allocation are crucial to achieving optimal implicit scaling performance. Workloads with a concept of locality are expected to work best with this programming model as cross-stack memory accesses are naturally minimized. Note that compute-bound kernels are not impacted by NUMA domains, thus they are expected to easily scale to multiple stacks with implicit scaling. If the algorithm has a lot of cross-stack memory accesses, the performance will be impacted negatively. Minimize cross-stack memory accesses by exploiting locality in algorithm.
MPI applications are more efficient with implicit scaling compared to an explicit scaling approach. A single MPI rank can utilize the entire root device which eliminates explicit synchronization and communication between stacks. Implicit scaling automatically overlaps local memory accesses and cross-stack memory accesses in a single kernel launch.
Implicit scaling improves kernel execution time only. Serial bottlenecks will not speed up. Applications will observe no speedup with implicit scaling if a large serial bottleneck is present. Common serial bottlenecks are:
high CPU usage
kernel launch latency
PCIe transfers
These will become more pronounced as kernel execution time is reduced with implicit scaling. Note that only stack 0 has PCIe connection to the host. On Intel® Data Center GPU Max with implicit scaling enabled, kernel launch latency increases by about 3 microseconds.