Visible to Intel only — GUID: GUID-534457DD-0147-4832-A9FA-5E79CC50A0C5
Visible to Intel only — GUID: GUID-534457DD-0147-4832-A9FA-5E79CC50A0C5
Exposing the Device Hierarchy
A multi-stack GPU card can be exposed as a single root device, or each stack can be exposed as a root device. This can be controlled via the environment variable ZE_FLAT_DEVICE_HIERARCHY. The allowed values for ZE_FLAT_DEVICE_HIERARCHY are FLAT, COMPOSITE, or COMBINED.
Our focus in this Guide is on FLAT and COMPOSITE modes.
Note that, in a system with one stack per GPU card, FLAT and COMPOSITE are the same.
ZE_FLAT_DEVICE_HIERARCHY=FLAT (Default)
The FLAT mode is the default mode if ZE_FLAT_DEVICE_HIERARCHY is not set. In FLAT mode, each stack is exposed as a root device. The recommendation is to use FLAT mode. The FLAT mode performs well for most applications.
In FLAT mode, the driver and language runtime provide tools that expose each stack as a root device that can be programmed independently of all the other stacks.
In FLAT mode, offloading is done using explicit scaling.
On a single or multiple GPU card system, the user can use all the stacks in all the GPU cards in FLAT mode, and offload to all the stacks (devices) simultaneously.
In OpenMP, the device clause on the target construct can be used to specify to which stack (device) the kernel should be offloaded.
In SYCL, platform::get_devices() can be called to get the stacks (devices) exposed.
For more information about the FLAT mode, refer to the FLAT Mode Programming section.
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
In COMPOSITE mode, each GPU card is exposed as a root device. If the card contains more than one stack, then the stacks on the GPU card are exposed as subdevices.
In COMPOSITE mode, offloading can be done using either explicit or implicit scaling.
Note that in earlier GPU drivers, the default was COMPOSITE mode and implicit scaling. Now the default is FLAT mode.
Explicit Scaling in COMPOSITE Mode:
In COMPOSITE mode, the driver and language runtime provide tools that expose each GPU card as a root device and the stacks as subdevices that can be programmed independently.
In OpenMP, the device and subdevice clauses on the target construct can be used to specify to which stack (subdevice) the kernel should be offloaded. (Note that the subdevice clause is an Intel extension to OpenMP.)
In SYCL, the device::create_sub_devices() can be called to get the subdevices or the stacks on each card device.
For more information about explicit scaling in COMPOSITE mode, refer to the Explicit Scaling section.
Implicit Scaling in COMPOSITE Mode:
In COMPOSITE mode, if the program offloads to a device that is the entire card, then the driver and language runtime are, by default, responsible for work distribution and multi-stack memory placement.
The recommendation is to use explicit scaling. However, if the memory requirement is more than what is available in a single stack, then implicit scaling may be used.
For more information about implicit scaling in COMPOSITE mode, refer to the Implicit Scaling section.
MPI Considerations
In an MPI application, each MPI rank may be configured to run on a GPU card or a GPU stack. Each rank can use OpenMP and SYCL to use the assigned GPU cards or stacks.
The table below shows common device configurations for MPI + OpenMP applications.
FLAT or COMPOSITE |
Device Exposed |
MPI Rank Assignment |
OpenMP Devices(s) View |
Implicit Scaling? |
Recommended? |
---|---|---|---|---|---|
FLAT |
Stack |
1 rank per stack, 2*N ranks in total |
1 stack as device0 |
No |
Yes |
COMPOSITE |
Card |
1 rank per stack, 2*N ranks in total |
1 stack as device0 |
No |
For expert users |
COMPOSITE |
Card |
1 rank per card, N ranks in total |
2 stacks as device0 and device1 |
No |
Yes |
COMPOSITE |
Card |
1 rank per card, N ranks in total |
1 card as device0 |
Yes |
If single stack memory is not sufficient |
Obtaining System and Debugging Information
The following two schemes can be used to obtain information about the system and devices.
Before you run an application, it is recommended that you run the sycl-ls command on the command line to find out which devices are available on the platform. This information is especially useful when doing performance measurements.
Note that sycl-ls shows devices seen or managed by all backends. For example, running on a system with a single GPU card with 2 stacks in total, sycl-ls shows that there are 2 devices (corresponding to the 2 stacks) managed by either the Level Zero or OpenCL backend.
$ sycl-ls
[level_zero:gpu][level_zero:0] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:1] ... Intel(R) Data Center GPU Max 1550 1.3
[opencl:gpu][opencl:0] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:1] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
Set the environment variable LIBOMPTARGET_DEBUG to 1 so the runtime would display debugging information, including information about which devices were found and used. Note that LIBOMPTARGET_DEBUG is OpenMP-specific (it does not apply to SYCL). See example in the FLAT Mode Example - OpenMP section.
Environment Variables to Control Device Exposure
The following environment variables can be used to control the hardware or devices that are exposed to the application.
ZE_FLAT_DEVICE_HIERARCHY=FLAT or COMPOSITE (default is FLAT). See Device Hierarchy in Level Zero Specification Documentation.
ONEAPI_DEVICE_SELECTOR. This environment variable is OpenMP-specific (does not apply in SYCL). The environment variable controls what hardware is exposed to the application. For details, see ONEAPI_DEVICE_SELECTOR in _oneAPI DPC++ Compiler documentation.
ZE_AFFINITY_MASK. This environment variable control what hardware is exposed by the Level-Zero User-Mode Driver (UMD). For details, see Affinity Mask in Level Zero Specification Documentation.
LIBOMPTARGET_DEVICES=DEVICE or SUBDEVICE or SUBSUBDEVICE. This environment variable is OpenMP-specific (does not apply in SYCL). It can be used to map an OpenMP “device” to a GPU card (device), a stack (subdevice), or a Compute Command Streamer (subsubdevice). See Compiling and Running an OpenMP Application in oneAPI GPU Optimization Guide.
ZEX_NUMBER_OF_CCS. See example in the Advanced Topics section.