Introduction

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-C7F00B05-EEAD-47AE-B618-E529FF3BF649

View Details

Introduction

Usage

Implicit scaling can be enabled by exporting below environment variable:

export EnableImplicitScaling=1

This environment variable changes the meaning of a SYCL/OpenMP device to root-device. No change in application code is required. A kernel submitted to SYCL/OpenMP device will utilize all stacks. Similarly, memory allocation on a SYCL/OpenMP device will span across all stacks. The driver behavior is described in Work Scheduling and Memory Distribution.

Note:

EnableImplicitScaling=1 is set by default.
Implicit scaling should not be combined with SYCL/OpenMP sub-device semantics.
Do not use sub-device syntax in ZE_AFFINITY_MASK. I.e. instead of exposing stack-0 from root-device-0 (ZE_AFFINITY_MASK=0.0), you must expose entire root-device to driver via ZE_AFFINITY_MASK=0 or by unsetting ZE_AFFINITY_MASK.
Only 1 Compute Command Streamers (CCS) is available with implicit scaling as it is using all VEs.
Only copy engines from stack-0 are used with implicit scaling. This may change in future driver versions.

Performance Expectations

Implicit scaling exposes resources of all stacks to a single kernel launch. For root-device with 2 stacks, a kernel has access to 2x compute peak, 2x memory bandwidth and 2x memory capacity. In the ideal case, workload performance increases by 2x. However, cache size and cache bandwidth are increased by 2x as well which can lead to better-than-linear scaling if workload fits in increased cache capacity.

Each stack is equivalent to a NUMA domain and therefore memory access pattern and memory allocation are a crucial part to achieve optimal implicit scaling performance. Workloads with a concept of locality are expected to work best with this programming model as cross-stack memory accesses are naturally minimized. Note that compute bound kernels are not impacted by NUMA domains, thus are expected to easily scale to multiple stacks with implicit scaling.

MPI applications are more efficient with implicit scaling compared to an explicit scaling approach. A single rank can utilize the entire root-device which eliminates explicit synchronization and communication between stacks. Implicit scaling automatically overlaps local memory accesses and cross-stack memory accesses in a single kernel launch.

Implicit scaling improves kernel execution time only. Serial bottlenecks will not speed up. Applications will observe no speed-up with implicit scaling if a large serial bottleneck is present. Common serial bottlenecks are:

high CPU usage
kernel launch latency
PCIe transfers

These will become more pronounced as kernel execution time reduces with implicit scaling. Note that only stack-0 has PCIe connection to the host. On Intel^® Data Center GPU Max with implicit scaling enabled, kernel launch latency increases by about 3 microseconds.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Introduction

Usage

Performance Expectations