Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP

Devices Discovery

Before you run an application, it is recommended to run the sycl-ls command to find out which devices are available on this platform, especially when the run is for a performance measure. This ensures the run is not taking a fallback path.

Context

OpenMP context is implicit. The runtime creates and management context, but one Intel extension to query OpenMP context (omp_target_get_context()) has been added for interop support.

Unified Shared Memory

Three new OpenMP APIs as Intel extensions for USM memory allocations have been added. omp_target_alloc API support for USM has also been added. Please refer to OpenMP USM Allocation API section for details.

Context Associated with Sub-sub-devices

Creating a context with a single sub-sub-device in it and the queue is attached to that sub-sub-device (compute slice). In this scheme, the execution/visibility is limited to the sub-sub-device only, and expected to offer the fine-grained partitioning and control at compute slice (C-slice) at level. See a code example:

#define DEVKIND 1 // C-Slice

int root_id = omp_get_default_device();

#pragma omp parallel for
for (int id = 0; id < NUM_SUBSUBDEVICES; ++id) {
#pragma omp target teams distribute parallel for device(root_id)               \
    subdevice(DEVKIND, id) map(…)
  for (int i = lb(id), i < ub(id); i++) {
    ...;
  }
}

The recommendation of using the sub-sub-device scheme is for running multiple small kernels concurrently on a stack at compute-slice level, and these kernels does not have enough parallelism to utilize all compute-slices of a stack. Each stack of Intel^® Data Center GPU Max has four compute command streamers for dispatching kernels to their associated compute-slices. For each stack, we can run 4 smaller kernels concurrently. It means we can run 8 smaller kernels concurrently on a 2-stack system.

Context Associated with Sub-devices

Creating a context with multiple sub-devices (multiple stacks) of the same root-device. In this scheme, queues are to be attached to the sub-devices effectively implementing “explicit scaling.” In this scheme, the code will run on multiple sub-devices. See a code example below:

#define DEVKIND 0 // TILE

int root_id = omp_get_default_device();

#pragma omp parallel for
for (int id = 0; id < NUM_SUBDEVICES; ++id) {
#pragma omp target teams distribute parallel for device(root_id)               \
    subdevice(DEVKIND, id) map(…)
  for (int i = lb(id), i < ub(id); i++) {
    …
  }
}

Context Associated with Root Device

Creating a context with a single root-device and a queue attached to it, the work will be automatically be distributed across all sub-devices/stacks via “implicit scaling” by the GPU driver. This is the most simple way to enable multi-stack utilization, which does not offer the possibility to target specific stacks. See a code example below:

int root_id = omp_get_default_device();

#pragma omp target teams distribute parallel for device(root_id) map(…)
for (int i = 0, i < N; i++) {
  …
}

Context Associated with Multiple Root Devices

Launching offload region to multiple root-devices (multi-card). In this scheme, the most nonrestrictive context with queues attached to different root devices offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed. See a code example:

Int num_devices = omp_get_num_devices();

#pragma omp parallel for
for (int root_id = 0; root_id < num_devices; root_id++) {
#pragma omp target teams distribute parallel for device(root_id) map(…)
  for (int i = lb(root_id); I < ub(root_id); i++) {
    …
  }
}

Depending on the chosen explicit sub-devices usage described and algorithm used, make sure to do proper memory allocation/synchronization. The following program is a full example using explicit sub-devices.

#include <assert.h>
#include <iostream>
#include <omp.h>
#include <stdint.h>
#ifndef NUM_SUBDEVICES
#define NUM_SUBDEVICES 1
#endif

#ifndef DEVKIND
#define DEVKIND 0
#endif

template <int num_subdevices> struct mptr {
  float *p[num_subdevices];
};

int main(int argc, char **argv) {
  constexpr int SIZE = 8e6;
  constexpr int SIMD_SIZE = 32;
  constexpr std::size_t TOTAL_SIZE = SIZE * SIMD_SIZE;
  constexpr int num_subdevices = NUM_SUBDEVICES;

  mptr<num_subdevices> device_ptr_a;
  mptr<num_subdevices> device_ptr_b;
  mptr<num_subdevices> device_ptr_c;

  const int device_id = omp_get_default_device();
  std::cout << "device_id = " << device_id << std::endl;

  for (int sdev = 0; sdev < num_subdevices; ++sdev) {
    device_ptr_a.p[sdev] =
        static_cast<float *> malloc(TOTAL_SIZE * sizeof(float));
    device_ptr_b.p[sdev] =
        static_cast<float *> malloc(TOTAL_SIZE * sizeof(float));
    device_ptr_c.p[sdev] =
        static_cast<float *> malloc(TOTAL_SIZE * sizeof(float));

#pragma omp target enter data map(alloc                                        \
                                  : device_ptr_a.p[sdev] [0:TOTAL_SIZE])       \
    device(device_id) subdevice(DEVKIND, sdev)
#pragma omp target enter data map(alloc                                        \
                                  : device_ptr_b.p[sdev] [0:TOTAL_SIZE])       \
    device(device_id) subdevice(DEVKIND, sdev)
#pragma omp target enter data map(alloc                                        \
                                  : device_ptr_c.p[sdev] [0:TOTAL_SIZE])       \
    device(device_id) subdevice(DEVKIND, sdev)
  }
  std::cout << "memory footprint per GPU = "
            << 3 * (std::size_t)(TOTAL_SIZE) * sizeof(float) * 1E-9 << " GB"
            << std::endl;

#pragma omp parallel for
  for (int sdev = 0; sdev < num_subdevices; ++sdev) {
    float *a = device_ptr_a.p[sdev];
    float *b = device_ptr_b.p[sdev];

#pragma omp target teams distribute parallel for device(device_id)             \
    subdevice(LEVEL, sdev)
    for (int i = 0; i < TOTAL_SIZE; ++i) {
      a[i] = i + 0.5;
      b[i] = i - 0.5;
    }
  }

  const int no_max_rep = 200;
  double time = 0.0;
  for (int irep = 0; irep < no_max_rep + 1; ++irep) {
    if (irep == 1)
      time = omp_get_wtime();

#pragma omp parallel for num_threads(num_subdevices)
    for (int sdev = 0; sdev < num_subdevices; ++sdev) {
      float *a = device_ptr_a.p[sdev];
      float *b = device_ptr_b.p[sdev];
      float *c = device_ptr_c.p[sdev];

#pragma omp target teams distribute parallel for device(device_id)             \
    subdevice(LEVEL, sdev)
      for (int i = 0; i < TOTAL_SIZE; ++i) {
        c[i] = a[i] + b[i];
      }
    }
  }

  time = omp_get_wtime() - time;
  time = time / no_max_rep;

  const std::size_t streamed_bytes =
      3 * (std::size_t)(TOTAL_SIZE)*num_subdevices * sizeof(float);
  std::cout << "bandwidth = " << (streamed_bytes / time) * 1E-9 << " GB/s"
            << std::endl;
  std::cout << "time = " << time << " s" << std::endl;
  std::cout.precision(10);

  for (int sdev = 0; sdev < num_subdevices; ++sdev) {
#pragma omp target update from(device_ptr_c.p[sdev][:TOTAL_SIZE])              \
    device(device_id) subdevice(LEVEL, sdev)
    std::cout << "-GPU: device id = : " << sdev << std::endl;
    std::cout << "target result:" << std::endl;
    std::cout << "c[" << 0 << "] = " << device_ptr_c.p[sdev][0] << std::endl;
    std::cout << "c[" << SIMD_SIZE - 1
              << "] = " << device_ptr_c.p[sdev][SIMD_SIZE - 1] << std::endl;
    std::cout << "c[" << TOTAL_SIZE / 2
              << "] = " << device_ptr_c.p[sdev][TOTAL_SIZE / 2] << std::endl;
    std::cout << "c[" << TOTAL_SIZE - 1
              << "] = " << device_ptr_c.p[sdev][TOTAL_SIZE - 1] << std::endl;
  }

  for (int sdev = 0; sdev < num_subdevices; ++sdev) {
    for (int i = 0; i < TOTAL_SIZE; ++i) {
      assert((int)(device_ptr_c.p[sdev][i]) ==
             (int)(device_ptr_c.p[sdev][i] +
                   device_ptr_a.p[sdev][i] * device_ptr_b.p[sdev][i]));
    }
  }

  for (int sdev = 0; sdev < num_subdevices; ++sdev) {
#pragma omp target exit data map(release : device_ptr_a.p[sdev][:TOTAL_SIZE])
#pragma omp target exit data map(release : device_ptr_b.p[sdev][:TOTAL_SIZE])
#pragma omp target exit data map(release : device_ptr_a.p[sdev][:TOTAL_SIZE])
  }
}

The explicit scaling of this OpenMP workload achieves a linear scaling ~2x on an Intel^® Data Center GPU Max system using Ahead-Of-Time (AOT) compilation. The build command is:

icpx -fp-contract=fast -O2 -ffast-math -DNUM_SUBDEVICES=2 -fopenmp \
 -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device xehp" \
 openmp_explicit_subdevice.cpp -o run.x

Besides “subdevice” clause language extension, an environment variable is provided by the OpenMP offloading compiler and runtime to map an OpenMP “device” to an GPU card (device), a stack (sub-device) and a compute-slice (sub-sub-device). The environment variable is LIBOMPTARGET_DEVICES=[device | subdevice | subsubdevice]

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP

Devices Discovery

Context

Unified Shared Memory

Context Associated with Sub-sub-devices

Context Associated with Sub-devices

Context Associated with Root Device

Context Associated with Multiple Root Devices