Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL

The section we describe the SYCL explicit scaling language API and provide usage examples on Arctic Sound-based platform for multi-GPU and multi-stack execution.

Device discovery

Before you run an application, it is recommended to run the sycl-ls command to find out which devices are available on this platform, especially when the run is for a performance measure. This ensures the run is not taking a fallback path.

Root-device

Intel GPUs are represented as SYCL GPU devices, and as root-devices. The discovery of root-devices is best with the sycl-ls tool, for example:

$ sycl-ls

[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.35.020776]
[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.20776]
[host:0] HOST: SYCL host platform 1.2 [1.2]

Note that sycl-ls shows devices from all platforms of all SYCL backends that are seen by SYCL runtime. Thus in the example above, there are GPUs corresponding to the single physical GPU (managed by either OpenCL or Level-Zero backend). root-devices.

One can use the environment variable ONEAPI_DEVICE_SELECTOR described in EnvironmentVariables.md

$ ONEAPI_DEVICE_SELECTOR=level_zero:* sycl-ls

[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.20776]

If there are multiple GPUs in a system then they will be seen as multiple different root-devices. On Linux these would be multiple SYCL root-devices of the same SYCL platform (representing Level-Zero driver). On Windows* these would appear as root-devices of multiple different SYCL platforms (Level-Zero drivers).

CreateMultipleRootDevices=N NEOReadDebugKeys=1 environment variables can be used to emulate multiple GPU cards. For example:

$ CreateMultipleRootDevices=2 NEOReadDebugKeys=1 \
SYCL_DEVICE_FILTER=level_zero sycl-ls

[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.20776]
[level_zero:1] GPU : Intel(R) Level-Zero 1.1 [1.1.20776]

Sub-device

Intel^® Data Center GPU Max 1350 or 1550 has 2 stacks. The root-device, corresponding to the whole GPU, can be partitioned to 2 sub-devices, each sub-device corresponding to a physical stack.

try {
  vector<device> SubDevices = RootDevice.create_sub_devices<
      cl::sycl::info::partition_property::partition_by_affinity_domain>(
      cl::sycl::info::partition_affinity_domain::numa);
}

Each call to create_sub_devices will return exactly the same sub-devices and in the persistent order. To control what sub-devices are exposed by Level-Zero UMD one can use ZE_AFFINITY_MASK environment variable. Note that the partition_by_affinity_domain is the only partitioning supported for Intel GPUs.

Similarly, next_partitionable and numa are the only partitioning properties supported (both doing the same thing). CreateMultipleRootDevices=N NEOReadDebugKeys=1 environment variables can be used to emulate multiple stacks of a GPU.

Sub-sub-device

Each sub-device (stack) can be further decomposed to a set of sub-sub-devices (Compute Slice). One can create a context associating with a sub-sub-device. In this scheme, the execution resource will be limited to the sub-sub-device, giving the program fine-grained control at compute slice level. The following code finds all sub-devices and sub-sub-devices of a device:

#include <CL/sycl.hpp>
#include <iostream>
namespace sycl;
int main() {
  sycl::device d(sycl::gpu_selector{});
  std::vector<sycl::device> *subdevices = new std::vector<sycl::device>();
  std::vector<sycl::device> *CCS = new std::vector<sycl::device>();
  auto part_prop = d.get_info<sycl::info::device::partition_properties>();
  size_t num_of_tiles;
  size_t num_of_ccs;
  if (part_prop.empty()) {
    num_of_tiles = 1;
  } else {
    for (int i = 0; i < part_prop.size(); i++) {
      if (part_prop[i] ==
          sycl::info::partition_property::partition_by_affinity_domain) {
        auto sub_devices = d.create_sub_devices<
            sycl::info::partition_property::partition_by_affinity_domain>(
            sycl::info::partition_affinity_domain::numa);
        num_of_tiles = sub_devices.size();
        for (int j = 0; j < num_of_tiles; j++)
          subdevices->push_back(sub_devices[j]);
        break;
      } else {
        num_of_tiles = 1;
      }
    }
  }
  std::cout << "List of Tiles:\n";
  for (int i = 0; i < num_of_tiles; i++) {
    std::cout << i << ") Device name: "
              << (*subdevices)[i].get_info<sycl::info::device::name>() << "\n";
    std::cout
        << "  Max Compute Units: "
        << (*subdevices)[i].get_info<sycl::info::device::max_compute_units>()
        << "\n";
  }
  for (int j = 0; j < num_of_tiles; j++) {
    auto part_prop1 =
        (*subdevices)[j].get_info<sycl::info::device::partition_properties>();
    if (part_prop1.empty()) {
      std::cout << "No partition support\n";
    } else {
      for (int i = 0; i < part_prop1.size(); i++) {
        if (part_prop1[i] ==
            sycl::info::partition_property::partition_by_affinity_domain) {
          auto ccses =
              (*subdevices)[j]
                  .create_sub_devices<sycl::info::partition_property::
                                          partition_by_affinity_domain>(
                      sycl::info::partition_affinity_domain::numa);
          num_of_ccs = ccses.size();
          for (int k = 0; k < num_of_ccs; k++)
            CCS->push_back(ccses[k]);
          break;
        } else {
          num_of_ccs = 1;
        }
      }
    }
  }
  std::cout << "List of Compute Command Streamers:\n";
  for (int i = 0; i < CCS->size(); i++) {
    std::cout << i << ") Device name: "
              << (*CCS)[i].get_info<sycl::info::device::name>() << "\n";
    std::cout << "  Max Compute Units: "
              << (*CCS)[i].get_info<sycl::info::device::max_compute_units>()
              << "\n";
  }
  return 0;
}

Context

Contexts are used for resources isolation and sharing. A SYCL context may consist of one or multiple devices. Both root-devices and sub-devices can be within single context, but they all should be of the same SYCL platform. A SYCL program (kernel_bundle) created against a context with multiple devices will be built to each of the root-devices in the context. For context that consists of multiple sub-devices of the same root-device only single build (to that root-device) is needed.

Unified shared memory

Memory allocated against a root-device is accessible by all of its sub-devices (stacks). So if you are operating on a context with multiple sub-devices of the same root-device, then you can use malloc_device on that root-device instead of using the slower malloc_host. Remember that if using malloc_device you’d need an explicit copy out to the host if it necessary to see data there. Please refer to section Unified Shared Memory Allocations for the details on the three types of USM allocations.

Buffer

SYCL buffers are also created against a context and are mapped to the Level-Zero USM allocation discussed above. Current mapping is as follows:

For an integrated device, the allocations are made on the host, and are accessible by the host and the device without any copying.
Memory buffers for context with sub-devices of the same root-device (possibly including the root-device itself) are allocated on that root-device. Thus they are readily accessible by all the devices in such context. The synchronization with the host is performed by SYCL RT with map/unmap doing implicit copies when necessary.
Memory buffers for context with devices from different root-devices in it are allocated on host (thus made accessible to all devices).

Queue

SYCL queue is always attached to a single device in a possibly multi-device context. In order of most performant to least performant, here are some typical scenarios:

Context Associated with Single Sub-device

Creating a context with a single sub-device in it and the queue is attached to that sub-device (stack), in this scheme, the execution/visibility is limited to the single sub-device only, and expected to offer the best performance per stack. See a code example:

try {
  vector<device> SubDevices = ...;
  for (auto &D : SubDevices) {
    // Each queue is in its own context, no data sharing across them.
    auto Q = queue(D);
    Q.submit([&](handler &cgh) { ... });
  }
}

Context associated with multiple sub-devices

Creating a context with multiple sub-devices (multiple stacks) of the same root-device, in this scheme, queues are to be attached to the sub-devices effectively implementing “explicit scaling”. In this scheme, the root-device should not be passed to such context for better performance. See a code example below:

try {
  vector<device> SubDevices = ...;
  auto C = context(SubDevices);
  for (auto &D : SubDevices) {
    // All queues share the same context, data can be shared across
    // queues.
    auto Q = queue(C, D);
    Q.submit([&](handler &cgh) { ... });
  }
}

Context associated with root device

Creating a context with a single root-device in it and the queue is attached to that root-device. In this scheme, the work will be automatically distributed across all sub-devices/stacks via “implicit scaling” by the GPU driver, which is the most simple way to enable multi-stack hardware but does not offer the possibility to target specific stacks. See a code example below:

try {
  // The queue is attached to the root-device, driver distributes to
  // sub - devices, if any.
  auto D = device(gpu_selector{});
  auto Q = queue(D);
  Q.submit([&](handler &cgh) { ... });
}

Context associated with multiple root devices

Creating Contexts with multiple root-devices (multi-card). In this scheme, the most nonrestrictive context with queues attached to different root-devices, which offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed. See a code example:

try {
  auto P = platform(gpu_selector{});
  auto RootDevices = P.get_devices();
  auto C = context(RootDevices);
  for (auto &D : RootDevices) {
    // Context has multiple root-devices, data can be shared across
    // multi - card(requires explict copying)
    auto Q = queue(C, D);
    Q.submit([&](handler &cgh) { ... });
  }
}

Depending on the chosen explicit sub-devices usage described and algorithm used, make sure to do proper memory allocation/synchronization. The following program is a full example using explicit sub-devices - one queue per stack:

#include <CL/sycl.hpp>
#include <algorithm>
#include <cassert>
#include <cfloat>
#include <iostream>
#include <string>
namespace sycl;

constexpr int num_runs = 10;
constexpr size_t scalar = 3;

cl_ulong triad(size_t array_size) {

  cl_ulong min_time_ns0 = DBL_MAX;
  cl_ulong min_time_ns1 = DBL_MAX;

  device dev = device(gpu_selector());

  std::vector<device> subdev = {};
  subdev = dev.create_sub_devices<sycl::info::partition_property::
    partition_by_affinity_domain>(sycl::info::partition_affinity_domain::numa);

  queue q[2] = {queue(subdev[0], property::queue::enable_profiling{}),
    queue(subdev[1], property::queue::enable_profiling{})};

  std::cout << "Running on device: " <<
    q[0].get_device().get_info<info::device::name>() << "\n";
  std::cout << "Running on device: " <<
    q[1].get_device().get_info<info::device::name>() << "\n";

  double *A0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);
  double *B0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);
  double *C0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);

  double *A1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);
  double *B1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);
  double *C1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);

  for ( int i = 0; i < array_size/2; i++) {
     A0[i]= 1.0; B0[i]= 2.0; C0[i]= 0.0;
     A1[i]= 1.0; B1[i]= 2.0; C1[i]= 0.0;
  }

  for (int i = 0; i< num_runs; i++) {
    auto q0_event = q[0].submit([&](handler& h) {
        h.parallel_for(array_size/2, [=](id<1> idx) {
            C0[idx] = A0[idx] + B0[idx] * scalar;
            });
        });

    auto q1_event = q[1].submit([&](handler& h) {
        h.parallel_for(array_size/2, [=](id<1> idx) {
            C1[idx] = A1[idx] + B1[idx] * scalar;
            });
        });

    q[0].wait();
    q[1].wait();

    cl_ulong exec_time_ns0 =
      q0_event.get_profiling_info<info::event_profiling::command_end>() -
      q0_event.get_profiling_info<info::event_profiling::command_start>();

    std::cout << "Tile-0 Execution time (iteration " << i << ") [sec]: "
      << (double)exec_time_ns0 * 1.0E-9 << "\n";
    min_time_ns0 = std::min(min_time_ns0, exec_time_ns0);

    cl_ulong exec_time_ns1 =
      q1_event.get_profiling_info<info::event_profiling::command_end>() -
      q1_event.get_profiling_info<info::event_profiling::command_start>();

    std::cout << "Tile-1 Execution time (iteration " << i << ") [sec]: "
      << (double)exec_time_ns1 * 1.0E-9 << "\n";
    min_time_ns1 = std::min(min_time_ns1, exec_time_ns1);
  }

  // Check correctness
  bool error = false;
  for ( int i = 0; i < array_size/2; i++) {
    if ((C0[i] != A0[i] + scalar * B0[i]) || (C1[i] != A1[i] + scalar * B1[i])) {
      std::cout << "\nResult incorrect (element " << i << " is " << C0[i] << ")!\n";
      error = true;
    }
  }

  sycl::free(A0, q[0]);
  sycl::free(B0, q[0]);
  sycl::free(C0, q[0]);

  sycl::free(A1, q[1]);
  sycl::free(B1, q[1]);
  sycl::free(C1, q[1]);

  if (error) return -1;

  std::cout << "Results are correct!\n\n";
  return std::max(min_time_ns0, min_time_ns1);
}

int main(int argc, char *argv[]) {

  size_t array_size;
  if (argc > 1 ) {
    array_size =  std::stoi(argv[1]);
  }
  else {
    std::cout << "Run as ./<progname> <arraysize in elements>\n";
    return 1;
  }
  std::cout << "Running with stream size of " << array_size
    << " elements (" << (array_size * sizeof(double))/(double)1024/1024 << "MB)\n";

  cl_ulong min_time = triad(array_size);

  if (min_time == -1) return 1;
  size_t triad_bytes = 3 * sizeof(double) * array_size;
  std::cout << "Triad Bytes: " << triad_bytes << "\n";
  std::cout << "Time in sec (fastest run): " << min_time * 1.0E-9 << "\n";
  double triad_bandwidth = 1.0E-09 * triad_bytes/(min_time*1.0E-9);
  std::cout << "Bandwidth of fastest run in GB/s: " << triad_bandwidth << "\n";
  return 0;
}

The build command using Ahead-Of-Time or AOT compilation is:

icpx -fsycl -fsycl-targets=spir64_gen -O2 -ffast-math -Xs "-device xehp" explicit-subdevice.cpp -o run.exe

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL

Device discovery

Root-device

Sub-device

Sub-sub-device

Context

Unified shared memory

Buffer

Queue

Context Associated with Single Sub-device

Context associated with multiple sub-devices

Context associated with root device

Context associated with multiple root devices