Visible to Intel only — GUID: GUID-5BBCC96F-67BA-4004-93E4-9665375A5E9F
Visible to Intel only — GUID: GUID-5BBCC96F-67BA-4004-93E4-9665375A5E9F
FLAT Mode Example - OpenMP
As previously mentioned, in FLAT mode, the stacks are exposed as devices.
Offloading to a single device (stack)
In this scheme, the default root device which is device 0 is used to offload. See code example below:
int device_id = omp_get_default_device();
#pragma omp target teams distribute parallel for device(device_id) map(...)
for (int i = 0, i < N; i++) {
...
}
Offloading to multiple devices (stacks)
In this scheme, we have multiple root devices (stacks) on which the code will run; the stacks may belong to one or more GPU cards. See code example below:
int num_devices = omp_get_num_devices();
#pragma omp parallel for
for (int device_id = 0; device_id < num_devices; device_id++) {
#pragma omp target teams distribute parallel for device(device_id) map(...)
for (int i = lb(device_id); I < ub(device_id); i++) {
...
}
}
We present below a full OpenMP program that offloads to multiple devices (stacks) in FLAT mode.
OpenMP Example
In the following program, flat_openmp_01.cpp, the array A is initialized on the device. First, we determine the number of devices (stacks) available, and then use the devices (stacks) to initialize different chunks of the array. The OpenMP device clause on the target pragma is used to specify which stack to use for a particular chunk. (If no device clause is specified, then the code will run on stack 0.)
omp_get_num_devices() returns the total number of devices (stacks) that are available. For example, on a 4-card system with 2 stacks each, the routine will return 8.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#define SIZE 320
int num_devices = omp_get_num_devices();
int chunksize = SIZE/num_devices;
int main(void)
{
int *A;
A = new int[sizeof(int) * SIZE];
printf ("num_devices = %d\n", num_devices);
for (int i = 0; i < SIZE; i++)
A[i] = -9;
#pragma omp parallel for
for (int id = 0; id < num_devices; id++) {
#pragma omp target teams distribute parallel for device(id) \
map(tofrom: A[id * chunksize : chunksize])
for (int i = id * chunksize; i < (id + 1) * chunksize; i++) {
A[i] = i;
}
}
for (int i = 0; i < SIZE; i++)
if (A[i] != i)
printf ("Error in: %d\n", A[i]);
else
printf ("%d\n", A[i]);
}
Compilation command:
$ icpx -fiopenmp -fopenmp-targets=spir64 flat_openmp_01.cpp
Run command:
$ OMP_TARGET_OFFLOAD=MANDATORY ./a.out
Notes:
OMP_TARGET_OFFLOAD=MANDATORY is used to make sure that the target region will run on the GPU. The program will fail if a GPU is not found.
There is no need to specify ZE_FLAT_DEVICE_HIERARCHY=FLAT with the run command, since FLAT mode is the default.
Running on a system with a single GPU card (2 stacks in total):
sycl-ls shows that there are 2 devices (corresponding to the 2 stacks):
$ sycl-ls
[level_zero:gpu][level_zero:0] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:1] ... Intel(R) Data Center GPU Max 1550 1.3
[opencl:gpu][opencl:0] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:1] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.
$ OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log
We see the following in libomptarget_debug.log, showing that 2 devices (corresponding to the 2 stacks) have been found.
Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 2 root devices, 2 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0
Target LEVEL_ZERO RTL --> -- 1
Running on a system with 4 GPU cards (8 stacks in total)
sycl-ls shows that there are 8 devices (corresponding to the 8 stacks):
$ sycl-ls
[level_zero:gpu][level_zero:0] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:1] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:2] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:3] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:4] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:5] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:6] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:7] ... Intel(R) Data Center GPU Max 1550 1.3
[opencl:gpu][opencl:0] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:1] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:2] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:3] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:4] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:5] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:6] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:7] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.
$ OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log
We see the following in libomptarget_debug.log, showing that 8 devices (corresponding to the 8 stacks) have been found:
Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 8 root devices, 8 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0
Target LEVEL_ZERO RTL --> -- 1
Target LEVEL_ZERO RTL --> -- 2
Target LEVEL_ZERO RTL --> -- 3
Target LEVEL_ZERO RTL --> -- 4
Target LEVEL_ZERO RTL --> -- 5
Target LEVEL_ZERO RTL --> -- 6
Target LEVEL_ZERO RTL --> -- 7