FLAT Mode Example

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 4/18/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

FLAT Mode Example - OpenMP

As previously mentioned, in FLAT mode, the stacks are exposed as devices.

Offloading to a single device (stack)

In this scheme, the default root device which is device 0 is used to offload. See code example below:


int device_id = omp_get_default_device();

#pragma omp target teams distribute parallel for device(device_id) map(...)
for (int i = 0, i < N; i++) {
  ...
}

Offloading to multiple devices (stacks)

In this scheme, we have multiple root devices (stacks) on which the code will run; the stacks may belong to one or more GPU cards. See code example below:


int num_devices = omp_get_num_devices();

#pragma omp parallel for
for (int device_id = 0; device_id < num_devices; device_id++) {

    #pragma omp target teams distribute parallel for device(device_id) map(...)
    for (int i = lb(device_id); I < ub(device_id); i++) {
      ...
    }
}

We present below a full OpenMP program that offloads to multiple devices (stacks) in FLAT mode.

OpenMP Example

In the following program, flat_openmp_01.cpp, the array A is initialized on the device. First, we determine the number of devices (stacks) available, and then use the devices (stacks) to initialize different chunks of the array. The OpenMP device clause on the target pragma is used to specify which stack to use for a particular chunk. (If no device clause is specified, then the code will run on stack 0.)

omp_get_num_devices() returns the total number of devices (stacks) that are available. For example, on a 4-card system with 2 stacks each, the routine will return 8.


#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define SIZE 320

int num_devices = omp_get_num_devices();
int chunksize = SIZE/num_devices;

int main(void)
{
  int *A;
  A = new int[sizeof(int) * SIZE];

  printf ("num_devices = %d\n", num_devices);

  for (int i = 0; i < SIZE; i++)
      A[i] = -9;

  #pragma omp parallel for
  for (int id = 0; id < num_devices; id++) {
      #pragma omp target teams distribute parallel for device(id) \
	      map(tofrom: A[id * chunksize : chunksize])
      for (int i = id * chunksize; i < (id + 1) * chunksize; i++) {
          A[i] = i;
      }
  }

  for (int i = 0; i < SIZE; i++)
    if (A[i] != i)
      printf ("Error in: %d\n", A[i]);
    else
      printf ("%d\n", A[i]);
}

Compilation command:


$ icpx -fiopenmp -fopenmp-targets=spir64 flat_openmp_01.cpp

Run command:


$ OMP_TARGET_OFFLOAD=MANDATORY ./a.out

Notes:

OMP_TARGET_OFFLOAD=MANDATORY is used to make sure that the target region will run on the GPU. The program will fail if a GPU is not found.
There is no need to specify ZE_FLAT_DEVICE_HIERARCHY=FLAT with the run command, since FLAT mode is the default.

Running on a system with a single GPU card (2 stacks in total):

sycl-ls shows that there are 2 devices (corresponding to the 2 stacks):


$ sycl-ls
[level_zero:gpu][level_zero:0] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:1] ... Intel(R) Data Center GPU Max 1550 1.3
[opencl:gpu][opencl:0] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:1] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO

We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.


$ OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log

We see the following in libomptarget_debug.log, showing that 2 devices (corresponding to the 2 stacks) have been found.


Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 2 root devices, 2 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0
Target LEVEL_ZERO RTL --> -- 1

Running on a system with 4 GPU cards (8 stacks in total)

sycl-ls shows that there are 8 devices (corresponding to the 8 stacks):


$ sycl-ls
[level_zero:gpu][level_zero:0] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:1] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:2] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:3] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:4] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:5] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:6] ... Intel(R) Data Center GPU Max 1550 1.3
[level_zero:gpu][level_zero:7] ... Intel(R) Data Center GPU Max 1550 1.3
[opencl:gpu][opencl:0] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:1] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:2] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:3] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:4] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:5] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:6] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO
[opencl:gpu][opencl:7] ... Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO

We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.


$ OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log

We see the following in libomptarget_debug.log, showing that 8 devices (corresponding to the 8 stacks) have been found:


Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 8 root devices, 8 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0
Target LEVEL_ZERO RTL --> -- 1
Target LEVEL_ZERO RTL --> -- 2
Target LEVEL_ZERO RTL --> -- 3
Target LEVEL_ZERO RTL --> -- 4
Target LEVEL_ZERO RTL --> -- 5
Target LEVEL_ZERO RTL --> -- 6
Target LEVEL_ZERO RTL --> -- 7

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

FLAT Mode Example - OpenMP

OpenMP Example