Advanced Topics

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 4/18/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Advanced Topics

Compute Command Streamers (CCSs)

Each stack of the Intel^® Data Center GPU Max contains 4 Compute Command Streamers (CCSs), which can be used to access a pool of Execution Units (EUs).

Hardware allows for the selection of a specific distribution of EUs among the CCSs. The EUs in a stack may be assigned to a single CCS, 2 CCSs, or 4 CCSs in the stack.

1-CCS mode (Default): In this mode, 1 CCS in each stack is exposed. The CCS has access to all the EUs in the stack. Other CCSs are disabled.
2-CCS mode: In this mode, 2 CCSs in each stack are exposed, each having half of the EUs in the stack assigned to it. If the EUs of one of the CCSs are idle, those EUs cannot be used by the other CCSs.
4-CCS mode: In this mode, all 4 CCSs of the stack are exposed, each having a quarter of the EUs in the stack assigned to it. As with the 2-CCS mode, EUs of idle CCSs cannot be used by other CCSs.

The default is 1-CCS mode.

Some applications may benefit from using 1 CCS per stack to access all the EUs in the stack, while other applications may benefit from using 2 or 4 CCSs per stack where a subset of the EUs are assigned to each CCS.

Using 2 or 4 CCSs per stack may be useful when running multiple small kernels concurrently on a stack, and the computations by each of these kernels does not require all the compute resources (EUs) of the stack. In this case, it may be advantageous to submit different kernels to different CCSs in the stack, thus allowing the kernels to run in parallel.

The environment variable ZEX_NUMBER_OF_CCS can be used to specify how many CCSs are exposed in each of the stacks in a GPU card.

The format for ZEX_NUMBER_OF_CCS is a comma-separated list of device-mode pairs, i.e., ZEX_NUMBER_OF_CCS=<Root Device Index>:<CCS Mode>,<Root Device Index>:<CCS Mode>… For instance, in a GPU card with 2 stacks, one would specify the following to set stack 0 in 4-CCS mode, and stack 1 in 1-CCS mode.


ZEX_NUMBER_OF_CCS=0:4,1:1

Below we show examples of exposing CCSs in SYCL, OpenMP and MPI applications.

Using Multiple CCSs in SYCL

In SYCL, one can create a context associated with a CCS (subsubdevice), giving the program fine-grained control at the CCS level. The following example finds all stacks (subdevices) and CCSs (subsubdevices) on a GPU card (device):


#include <sycl/sycl.hpp>

int main() {
  // Find all GPU devices
  auto devices = sycl::platform(sycl::gpu_selector_v).get_devices();
  for (int n = 0; n < devices.size(); n++) {
    std::cout << "\nGPU" << n << ": "
              << devices[n].get_info<sycl::info::device::name>() << " ("
              << devices[n].get_info<sycl::info::device::max_compute_units>()
              << ")\n";
    std::vector<sycl::device> subdevices;
    std::vector<sycl::device> subsubdevices;
    auto part_prop =
        devices[n].get_info<sycl::info::device::partition_properties>();
    if (part_prop.empty()) {
      std::cout << "No partition_properties\n";
    } else {
      for (int i = 0; i < part_prop.size(); i++) {
        // Check if device can be partitioned into Tiles
        if (part_prop[i] ==
            sycl::info::partition_property::partition_by_affinity_domain) {
          auto sub_devices =
              devices[n]
                  .create_sub_devices<sycl::info::partition_property::
                                          partition_by_affinity_domain>(
                      sycl::info::partition_affinity_domain::numa);
          for (int j = 0; j < sub_devices.size(); j++) {
            subdevices.push_back(sub_devices[j]);
            std::cout << "\ntile" << j << ": "
                      << subdevices[j].get_info<sycl::info::device::name>()
                      << " ("
                      << subdevices[j]
                             .get_info<sycl::info::device::max_compute_units>()
                      << ")\n";
            auto part_prop1 =
                subdevices[j]
                    .get_info<sycl::info::device::partition_properties>();
            if (part_prop1.empty()) {
              std::cout << "No partition_properties\n";
            } else {
              for (int i = 0; i < part_prop1.size(); i++) {
                // Check if Tile can be partitioned into Slices (CCS)
                if (part_prop1[i] == sycl::info::partition_property::
                                         ext_intel_partition_by_cslice) {
                  auto sub_devices =
                      subdevices[j]
                          .create_sub_devices<
                              sycl::info::partition_property::
                                  ext_intel_partition_by_cslice>();
                  for (int k = 0; k < sub_devices.size(); k++) {
                    subsubdevices.push_back(sub_devices[k]);
                    std::cout
                        << "slice" << k << ": "
                        << subsubdevices[k].get_info<sycl::info::device::name>()
                        << " ("
                        << subsubdevices[k]
                               .get_info<
                                   sycl::info::device::max_compute_units>()
                        << ")\n";
                  }
                  break;
                } else {
                  std::cout << "No ext_intel_partition_by_cslice\n";
                }
              }
            }
          }
          break;
          // Check if device can be partitioned into Slices (CCS)
        } else if (part_prop[i] == sycl::info::partition_property::
                                       ext_intel_partition_by_cslice) {
          auto sub_devices =
              devices[n]
                  .create_sub_devices<sycl::info::partition_property::
                                          ext_intel_partition_by_cslice>();
          for (int k = 0; k < sub_devices.size(); k++) {
            subsubdevices.push_back(sub_devices[k]);
            std::cout << "slice" << k << ": "
                      << subsubdevices[k].get_info<sycl::info::device::name>()
                      << " ("
                      << subsubdevices[k]
                             .get_info<sycl::info::device::max_compute_units>()
                      << ")\n";
          }
          break;
        } else {
          std::cout << "No ext_intel_partition_by_cslice or "
                       "partition_by_affinity_domain\n";
        }
      }
    }
  }
  return 0;
}

The SYCL code below demonstrates how multiple kernels can be submitted to multiple CCSs to execute concurrently.

The example code finds all CCSs, creates sycl::queue for each CCS found on GPU device and submits kernels to all CCSs using a for-loop.


#include <sycl/sycl.hpp>

static constexpr size_t N = 5280; // global size
static constexpr size_t B = 32;   // WG size

void kernel_compute_mm(sycl::queue &q, float *a, float *b, float *c, size_t n,
                       size_t wg) {
  q.parallel_for(
      sycl::nd_range<2>(sycl::range<2>{n, n}, sycl::range<2>{wg, wg}),
      [=](sycl::nd_item<2> item) {
        const int i = item.get_global_id(0);
        const int j = item.get_global_id(1);
        float temp = 0.0f;
        for (int k = 0; k < N; k++) {
          temp += a[i * N + k] * b[k * N + j];
        }
        c[i * N + j] = temp;
      });
}

int main() {
  auto start =
      std::chrono::high_resolution_clock::now().time_since_epoch().count();

  // find all CCS / Tiles in GPU
  auto device = sycl::device(sycl::gpu_selector_v);
  std::cout << "\nGPU: " << device.get_info<sycl::info::device::name>() << " ("
            << device.get_info<sycl::info::device::max_compute_units>()
            << ")\n";
  std::vector<sycl::device> subdevices;
  std::vector<sycl::device> subsubdevices;
  auto part_prop = device.get_info<sycl::info::device::partition_properties>();
  if (part_prop.empty()) {
    std::cout << "No partition_properties\n";
  } else {
    for (int i = 0; i < part_prop.size(); i++) {
      // Check if device can be partitioned into Tiles
      if (part_prop[i] ==
          sycl::info::partition_property::partition_by_affinity_domain) {
        auto sub_devices = device.create_sub_devices<
            sycl::info::partition_property::partition_by_affinity_domain>(
            sycl::info::partition_affinity_domain::numa);
        for (int j = 0; j < sub_devices.size(); j++) {
          subdevices.push_back(sub_devices[j]);
          std::cout
              << "\nTile" << j << ": "
              << subdevices[j].get_info<sycl::info::device::name>() << " ("
              << subdevices[j].get_info<sycl::info::device::max_compute_units>()
              << ")\n";
          auto part_prop1 =
              subdevices[j]
                  .get_info<sycl::info::device::partition_properties>();
          if (part_prop1.empty()) {
            std::cout << "No partition_properties\n";
          } else {
            for (int i = 0; i < part_prop1.size(); i++) {
              // Check if Tile can be partitioned into Slices (CCS)
              if (part_prop1[i] == sycl::info::partition_property::
                                       ext_intel_partition_by_cslice) {
                auto sub_devices = subdevices[j]
                                       .create_sub_devices<
                                           sycl::info::partition_property::
                                               ext_intel_partition_by_cslice>();
                for (int k = 0; k < sub_devices.size(); k++) {
                  subsubdevices.push_back(sub_devices[k]);
                  std::cout
                      << "Slice" << k << ": "
                      << subsubdevices[k].get_info<sycl::info::device::name>()
                      << " ("
                      << subsubdevices[k]
                             .get_info<sycl::info::device::max_compute_units>()
                      << ")\n";
                }
                break;
              } else {
                std::cout << "No ext_intel_partition_by_cslice\n";
              }
            }
          }
        }
        break;
        // Check if device can be partitioned into Slices (CCS)
      } else if (part_prop[i] == sycl::info::partition_property::
                                     ext_intel_partition_by_cslice) {
        auto sub_devices = device.create_sub_devices<
            sycl::info::partition_property::ext_intel_partition_by_cslice>();
        for (int k = 0; k < sub_devices.size(); k++) {
          subsubdevices.push_back(sub_devices[k]);
          std::cout << "Slice" << k << ": "
                    << subsubdevices[k].get_info<sycl::info::device::name>()
                    << " ("
                    << subsubdevices[k]
                           .get_info<sycl::info::device::max_compute_units>()
                    << ")\n";
        }
        break;
      } else {
        std::cout << "No ext_intel_partition_by_cslice or "
                     "partition_by_affinity_domain\n";
      }
    }
  }

  // Set devices to submit compute kernel
  std::vector<sycl::device> devices(1, device);
  if (subsubdevices.size())
    devices = subsubdevices;
  else if (subdevices.size())
    devices = subdevices;
  auto num_devices = devices.size();

  // Define matrices
  float *matrix_a[num_devices];
  float *matrix_b[num_devices];
  float *matrix_c[num_devices];

  float v1 = 2.f;
  float v2 = 3.f;
  for (int n = 0; n < num_devices; n++) {
    matrix_a[n] = static_cast<float *>(malloc(N * N * sizeof(float)));
    matrix_b[n] = static_cast<float *>(malloc(N * N * sizeof(float)));
    matrix_c[n] = static_cast<float *>(malloc(N * N * sizeof(float)));

    // Initialize matrices with values
    for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++) {
        matrix_a[n][i * N + j] = v1++;
        matrix_b[n][i * N + j] = v2++;
        matrix_c[n][i * N + j] = 0.f;
      }
  }

  float *da[num_devices];
  float *db[num_devices];
  float *dc[num_devices];

  std::vector<sycl::queue> q(num_devices);

  // create queues for each device
  std::cout << "\nSubmitting Compute Kernel to Devices:\n";
  for (int i = 0; i < num_devices; i++) {
    q[i] = sycl::queue(devices[i]);
    std::cout
        << "Device" << i << ": "
        << q[i].get_device().get_info<sycl::info::device::name>() << " ("
        << q[i].get_device().get_info<sycl::info::device::max_compute_units>()
        << ")\n";
  }

  // device mem alloc for matrix a,b,c for each device
  for (int i = 0; i < num_devices; i++) {
    da[i] = sycl::malloc_device<float>(N * N, q[i]);
    db[i] = sycl::malloc_device<float>(N * N, q[i]);
    dc[i] = sycl::malloc_device<float>(N * N, q[i]);
  }

  // warm up: kernel submit with zero size
  for (int i = 0; i < num_devices; i++)
    kernel_compute_mm(q[i], da[i], db[i], dc[i], 0, 0);

  // kernel sync
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // memcpy for matrix and b to device alloc
  for (int i = 0; i < num_devices; i++) {
    q[i].memcpy(&da[i][0], &matrix_a[i][0], N * N * sizeof(float));
    q[i].memcpy(&db[i][0], &matrix_b[i][0], N * N * sizeof(float));
  }

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // submit matrix multiply kernels to all devices
  for (int i = 0; i < num_devices; i++)
    kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);

  // wait for compute complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // copy back result to host
  for (int i = 0; i < num_devices; i++)
    q[i].memcpy(&matrix_c[i][0], &dc[i][0], N * N * sizeof(float));

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // print first element of result matrix
  std::cout << "\nMatrix Multiplication Complete\n";
  for (int i = 0; i < num_devices; i++)
    std::cout << "device" << i << ": matrix_c[0][0]=" << matrix_c[i][0] << "\n";

  for (int i = 0; i < num_devices; i++) {
    free(matrix_a[i]);
    free(matrix_b[i]);
    free(matrix_c[i]);
    sycl::free(da[i], q[i]);
    sycl::free(db[i], q[i]);
    sycl::free(dc[i], q[i]);
  }

  auto duration =
      std::chrono::high_resolution_clock::now().time_since_epoch().count() -
      start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}

To build the examples, run:


$ icpx -fsycl -o ccs ccs.cpp
$ icpx -fsycl -o ccs_matrixmul ccs_matrixmul.cpp

The number of CCSs found in ccs and the number of kernels executing in parallel in ccs_matrixmul depend on the setting of the environment variable ZEX_NUMBER_OF_CCS.

Using Multiple CCSs in OpenMP

In OpenMP, the CCSs in each stack can be exposed as devices to offer fine-grained partitioning and control at the CCS level.

In order to expose CCSs as devices, one of the following two environment variables should be set before running the program:


$ export ONEAPI_DEVICE_SELECTOR="*:*.*.*"


$ LIBOMPTARGET_DEVICES=SUBSUBDEVICE

The following OpenMP program illustrates the use of CCSs in FLAT mode.

First, the program determines the number of devices that are available on the platform by calling omp_get_num_devices(). Then the program offloads kernels to each of the devices, where each kernel initializes a different chunk of array A.

omp_get_num_devices() returns the total number of devices that are available.

The device clause on the target directive is used to specify to which device a kernel should be offloaded.

At runtime the environment variable ONEAPI_DEVICE_SELECTOR=”:.*.*” (or LIBOMPTARGET_DEVICES=SUBSUBDEVICE) is set, along with ZEX_NUMBER_OF_CCS, to expose CCSs as devices.


#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define SIZE 320

int num_devices = omp_get_num_devices();
int chunksize = SIZE/num_devices;

int main(void)
{
  int *A;
  A = new int[sizeof(int) * SIZE];

  printf ("num_devices = %d\n", num_devices);

  for (int i = 0; i < SIZE; i++)
      A[i] = -9;

  #pragma omp parallel for
  for (int id = 0; id < num_devices; id++) {
      #pragma omp target teams distribute parallel for device(id) \
	      map(tofrom: A[id * chunksize : chunksize])
      for (int i = id * chunksize; i < (id + 1) * chunksize; i++) {
          A[i] = i;
      }
  }

  for (int i = 0; i < SIZE; i++)
    if (A[i] != i)
      printf ("Error in: %d\n", A[i]);
    else
      printf ("%d\n", A[i]);
}

Compilation command:


$ icpx -fiopenmp -fopenmp-targets=spir64 flat_openmp_02.cpp

Run command:


$ OMP_TARGET_OFFLOAD=MANDATORY ONEAPI_DEVICE_SELECTOR="*:*.*.*" \
  ZEX_NUMBER_OF_CCS="0:4,1:4 ./a.out

Notes:

The program is identical to the one in the FLAT Mode Example - OpenMP. The only difference is that additional environment variables (ONEAPI_DEVICE_SELECTOR and ZEX_NUMBER_OF_CCS) are set before running the program to expose CCSs (instead of stacks) as devices.
Setting ONEAPI_DEVICE_SELECTOR=”:.*.*” causes CCSs to be exposed to the application as root devices. Alternatively, LIBOMPTARGET_DEVICES=SUBSUBDEVICE may be set.
ZEX_NUMBER_OF_CCS=”0:4,1:4 specifies that the 4 CCSs in stack 0, as well as the 4 CCSs in stack 1, are exposed.
OMP_TARGET_OFFLOAD=MANDATORY is used to make sure that the target region will run on the GPU. The program will fail if a GPU is not found.
There is no need to specify ZE_FLAT_DEVICE_HIERARCHY=FLAT with the run command, since FLAT mode is the default.

Running on a system with a single GPU card (2 stacks in total):

We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.


$ OMP_TARGET_OFFLOAD=MANDATORY ONEAPI_DEVICE_SELECTOR="*:*.*.*" \
  ZEX_NUMBER_OF_CCS="0:4,1:4 LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log

We see the following in libomptarget_debug.log, showing that 8 devices corresponding to the 8 CCSs (4 CCSs in each of the 2 stacks) have been found.


Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 8 root devices, 8 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0.0.0
Target LEVEL_ZERO RTL --> -- 0.0.1
Target LEVEL_ZERO RTL --> -- 0.0.2
Target LEVEL_ZERO RTL --> -- 0.0.3
Target LEVEL_ZERO RTL --> -- 1.0.0
Target LEVEL_ZERO RTL --> -- 1.0.1
Target LEVEL_ZERO RTL --> -- 1.0.2
Target LEVEL_ZERO RTL --> -- 1.0.3

Running on a system with 4 GPU cards (8 stacks in total):

We add LIBOMPTARGET_DEBUG=1 to the run command to get libomptarget.so debug information.


$ OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_DEBUG=1 ./a.out >& libomptarget_debug.log

We see the following in libomptarget_debug.log, showing that 32 devices corresponding to the 32 CCSs (4 CCSs in each of the 8 stacks) have been found:


Target LEVEL_ZERO RTL --> Found a GPU device, Name = Intel(R) Data Center GPU Max 1550
Target LEVEL_ZERO RTL --> Found 32 root devices, 32 total devices.
Target LEVEL_ZERO RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL_ZERO RTL --> -- 0.0.0
Target LEVEL_ZERO RTL --> -- 0.0.1
Target LEVEL_ZERO RTL --> -- 0.0.2
Target LEVEL_ZERO RTL --> -- 0.0.3
Target LEVEL_ZERO RTL --> -- 1.0.0
Target LEVEL_ZERO RTL --> -- 1.0.1
Target LEVEL_ZERO RTL --> -- 1.0.2
Target LEVEL_ZERO RTL --> -- 1.0.3
Target LEVEL_ZERO RTL --> -- 2.0.0
Target LEVEL_ZERO RTL --> -- 2.0.1
Target LEVEL_ZERO RTL --> -- 2.0.2
Target LEVEL_ZERO RTL --> -- 2.0.3
Target LEVEL_ZERO RTL --> -- 3.0.0
Target LEVEL_ZERO RTL --> -- 3.0.1
Target LEVEL_ZERO RTL --> -- 3.0.2
Target LEVEL_ZERO RTL --> -- 3.0.3
Target LEVEL_ZERO RTL --> -- 4.0.0
Target LEVEL_ZERO RTL --> -- 4.0.1
Target LEVEL_ZERO RTL --> -- 4.0.2
Target LEVEL_ZERO RTL --> -- 4.0.3
Target LEVEL_ZERO RTL --> -- 5.0.0
Target LEVEL_ZERO RTL --> -- 5.0.1
Target LEVEL_ZERO RTL --> -- 5.0.2
Target LEVEL_ZERO RTL --> -- 5.0.3
Target LEVEL_ZERO RTL --> -- 6.0.0
Target LEVEL_ZERO RTL --> -- 6.0.1
Target LEVEL_ZERO RTL --> -- 6.0.2
Target LEVEL_ZERO RTL --> -- 6.0.3
Target LEVEL_ZERO RTL --> -- 7.0.0
Target LEVEL_ZERO RTL --> -- 7.0.1
Target LEVEL_ZERO RTL --> -- 7.0.2
Target LEVEL_ZERO RTL --> -- 7.0.3

Using Multiple CCSs in MPI

A typical use case for running more than 1 CCS per GPU stack is in MPI applications where there are large portions of the application time consumed by non-offloaded code run on the CPU. Running with 4-CCS mode will allow the user to run with MPI ranks numbering four times the number of GPU stacks, allowing the host process to consume more CPU cores.

An example of DGEMMs executed through MPI is shown in the following source:


#include "mkl.h"
#include "mkl_omp_offload.h"
#include 
#include <chrono>
#include <limits>
#include <mpi.h>
#include <omp.h>
#define FLOAT double
#define MPI_FLOAT_T MPI_DOUBLE
#define MKL_INT_T MKL_INT
#define index(i, j, ld) (((j) * (ld)) + (i))
#define RAND() ((FLOAT)rand() / (FLOAT)RAND_MAX * 2.0 - 1.0)
#define LD_ALIGN 256
#define LD_BIAS 8
#define HPL_PTR(ptr_, al_) ((((size_t)(ptr_) + (al_)-1) / (al_)) * (al_))
static inline MKL_INT_T getld(MKL_INT_T x) {
  MKL_INT_T ld;
  ld = HPL_PTR(x, LD_ALIGN);
  if (ld - LD_BIAS >= x)
    ld -= LD_BIAS;
  else
    ld += LD_BIAS;
  return ld;
}
int main(int argc, char **argv) {
  if ((argc < 4) || (argc > 4 && argc < 8)) {
    printf("Performs a DGEMM test C = alpha*A*B + beta*C\n");
    printf("A matrix is MxK and B matrix is KxN\n");
    printf("All matrices are stored in column-major format\n");
    printf("Run as ./dgemm <M> <K> <N> [ <beta> <iterations>]\n");
    printf("Required inputs are:\n");
    printf("      M: number of rows of matrix A\n");
    printf("      K: number of cols of matrix A\n");
    printf("      N: number of cols of matrix B\n");
    printf("Optional inputs are (all must be provided if providing any):\n");
    printf("      alpha: scalar multiplier (default: 1.0)\n");
    printf("      beta:  scalar multiplier (default: 0.0)\n");
    printf("      iterations: number of blocking DGEMM calls to perform "
           "(default: 10)\n");
    return EXIT_FAILURE;
  }
  MKL_INT_T HA = (MKL_INT_T)(atoi(argv[1]));
  MKL_INT_T WA = (MKL_INT_T)(atoi(argv[2]));
  MKL_INT_T WB = (MKL_INT_T)(atoi(argv[3]));
  FLOAT alpha, beta;
  int niter;
  if (argc > 4) {
    sscanf(argv[4], "%lf", &alpha);
    sscanf(argv[5], "%lf", &beta);
    niter = atoi(argv[6]);
  } else {
    alpha = 1.0;
    beta = 0.0;
    niter = 10;
  }
  MKL_INT_T HB = WA;
  MKL_INT_T WC = WB;
  MKL_INT_T HC = HA;
  MKL_INT_T ldA = getld(HA);
  MKL_INT_T ldB = getld(HB);
  MKL_INT_T ldC = getld(HC);
  double tot_t = 0.0, best_t = std::numeric_limits<double>::max();
  FLOAT *A = new FLOAT[ldA * WA];
  FLOAT *B, *C, *local_B, *local_C;
  MPI_Init(&argc, &argv);
  int mpi_rank, mpi_size;
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
  if (mpi_rank == 0) {
    B = new FLOAT[ldB * WB];
    C = new FLOAT[ldC * WC];
    srand(2864);
    for (int j = 0; j < WA; j++)
      for (int i = 0; i < HA; i++)
        A[index(i, j, ldA)] = RAND();
    for (int j = 0; j < WB; j++)
      for (int i = 0; i < HB; i++)
        B[index(i, j, ldB)] = RAND();
    if (beta != 0.0) {
      for (int j = 0; j < WC; j++)
        for (int i = 0; i < HC; i++)
          C[index(i, j, ldC)] = RAND();
    } else {
      for (int j = 0; j < WC; j++)
        for (int i = 0; i < HC; i++)
          C[index(i, j, ldC)] = 0.0;
    }
  }
  size_t sizea = (size_t)ldA * WA;
  size_t local_sizeb, local_sizec;
  int *displacements_b = new int[mpi_size];
  int *send_counts_b = new int[mpi_size];
  int *displacements_c = new int[mpi_size];
  int *send_counts_c = new int[mpi_size];
  int local_WB = WB / mpi_size;
  send_counts_b[0] = ldB * (local_WB + WB % mpi_size);
  send_counts_c[0] = ldC * (local_WB + WB % mpi_size);
  displacements_b[0] = 0;
  displacements_c[0] = 0;
  for (int i = 1; i < mpi_size; i++) {
    send_counts_b[i] = ldB * local_WB;
    send_counts_c[i] = ldC * local_WB;
    displacements_b[i] = displacements_b[i - 1] + send_counts_b[i - 1];
    displacements_c[i] = displacements_b[i - 1] + send_counts_c[i - 1];
  }
  if (mpi_rank == 0) {
    local_WB += WB % mpi_size;
  }
  local_sizeb = ldB * local_WB;
  local_sizec = ldC * local_WB;
  local_B = new FLOAT[local_sizeb];
  local_C = new FLOAT[local_sizec];
  MPI_Bcast(A, sizea, MPI_FLOAT_T, 0, MPI_COMM_WORLD);
  MPI_Scatterv(B, send_counts_b, displacements_b, MPI_FLOAT_T, local_B,
               local_sizeb, MPI_FLOAT_T, 0, MPI_COMM_WORLD);
  MPI_Scatterv(C, send_counts_c, displacements_c, MPI_FLOAT_T, local_C,
               local_sizec, MPI_FLOAT_T, 0, MPI_COMM_WORLD);
#if defined(OMP_AFFINITIZATION)
#if OMP_AFFINITIZATION == 1
  int ndev = omp_get_num_devices();
  int dnum = mpi_rank % ndev;
  omp_set_default_device(dnum);
#endif
#endif
#pragma omp target data map(to                                                 \
                            : A [0:sizea], local_B [0:local_sizeb])            \
    map(tofrom                                                                 \
        : local_C [0:local_sizec])
  {
#pragma omp dispatch
    dgemm("N", "N", &HA, &local_WB, &WA, &alpha, A, &ldA, local_B, &ldB, &beta,
          local_C, &ldC);
    for (int i = 0; i < niter; i++) {
      auto start_t = std::chrono::high_resolution_clock::now();
#pragma omp dispatch
      dgemm("N", "N", &HA, &local_WB, &WA, &alpha, A, &ldA, local_B, &ldB,
            &beta, local_C, &ldC);
      MPI_Barrier(MPI_COMM_WORLD);
      auto end_t = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double> diff = end_t - start_t;
      tot_t += diff.count();
      best_t = std::min(best_t, diff.count());
    }
  }
  MPI_Gatherv(local_C, local_sizec, MPI_FLOAT_T, C, send_counts_c,
              displacements_c, MPI_FLOAT_T, 0, MPI_COMM_WORLD);
  delete[] local_B;
  delete[] local_C;
  delete[] displacements_b;
  delete[] displacements_c;
  delete[] send_counts_b;
  delete[] send_counts_c;
  MPI_Allreduce(MPI_IN_PLACE, &tot_t, 1, MPI_FLOAT_T, MPI_MAX, MPI_COMM_WORLD);
  MPI_Allreduce(MPI_IN_PLACE, &best_t, 1, MPI_FLOAT_T, MPI_MAX, MPI_COMM_WORLD);
  if (mpi_rank == 0) {
    double tflop_count = (double)2.0 * HA * WB * WA;
    if (beta != 0.0)
      tflop_count += (double)HA * WB;
    tflop_count *= 1.E-12;
    printf("Total runtime for %d iterations: %f seconds.\n", niter, tot_t);
    printf("Mean TFLOP/s: %f\n", (double)niter * tflop_count / tot_t);
    printf("Best TFLOP/s: %f\n", (double)tflop_count / best_t);
    delete[] B;
    delete[] C;
  }
  delete[] A;
  MPI_Finalize();
  return EXIT_SUCCESS;
}

In this example, the DGEMMs are MKL calls executed through OpenMP offload. The matrices are statically partitioned among the MPI ranks.

In order to build the binary, execute:


$ cd examples/MPI/02_omp_mpi_onemkl_dgemm
$ make

With Intel^® MPI, each process can bind to one or multiple GPU stacks. If more than one process is allocated to a GPU stack, the GPU driver will enqueue a kernel to one of the CCS associated with the stack. We can rely on the environment variable I_MPI_OFFLOAD_CELL_LIST to specify the device stacks used.

For example to run the application in 4-CCS mode, with the four MPI ranks being allocated to the first device’s first stack:


$ export ZEX_NUMBER_OF_CCS=0:4
$ export I_MPI_OFFLOAD_CELL_LIST=0,0,0,0
$ mpirun -n 4 ./dgemm 8192 8192 8192

If we want to run the application with the first four MPI ranks being allocated to the first device’s first stack, and second four MPI ranks being allocated to the first device’s second stack. The expectation is that this will have double the FLOP/s as the previous run:


$ export ZEX_NUMBER_OF_CCS=0:4,1:4
$ export I_MPI_OFFLOAD_CELL_LIST=0,0,0,0,1,1,1,1
$ mpirun -n 8 ./dgemm 8192 8192 8192

Note that the following Intel^® MPI environment variables are default, but may be useful to specify or modify in some cases:


$ export I_MPI_OFFLOAD_CELL=tile #Associated MPI ranks with GPU tiles
$ export I_MPI_OFFLOAD=1 #Enable MPI work with device pointers
$ export I_MPI_OFFLOAD_TOPOLIB=level_zero #Use Level Zero for topology detection (GPU pinning)

With MPICH, MPI ranks associate with a GPU stack explicitly through the environment variable ZE_AFFINITY_MASK. The driver will subsequently associate the rank to a CCS on the stack.

The same example application can be built with MPICH, if an appropriate MPICH installation is loaded. An example script to bind MPI ranks in a similar matter is provided in:


#!/bin/bash

if [ -z ${NCCS} ]; then
  NCCS=1
fi

if [ -z ${NGPUS} ]; then
  NGPU=1
fi

if [ -z ${NSTACKS} ]; then
  NSTACK=1
fi

subdevices=$((NGPU*NSTACK))

export ZE_AFFINITY_MASK=$(((MPI_LOCALRANKID/NCCS)%subdevices))

echo MPI_LOCALRANKID = $MPI_LOCALRANKID  ZE_AFFINITY_MASK = $ZE_AFFINITY_MASK
exec $@

Assuming that a node has 6 GPUs, each GPU has 2 stacks, and you want to run with 4 CCS per stack, usage of this script is as follows:


$ export NGPUS=6
$ export NSTACKS=2
$ export NCCS=4
$ export ZEX_NUMBER_OF_CCS=0:${NCCS},1:${NCCS}
$ mpiexec -n 48 ./gpu_rank_bind.sh ./dgemm 8192 8192 8192

Since the MKL DGEMM is handled through OpenMP offload, we can also associate the MPI ranks explicitly with specific CCSs through OpenMP. Some mixed MPI/OpenMP offload applications use this strategy.


#if OMP_AFFINITIZATION == 1
 int ndev = omp_get_num_devices();
 dnum = mpi_rank % ndev;
 omp_set_default_device(dnum);
#endif
...
#pragma omp target data map(to:A[0:sizea],local_B[0:local_sizeb]) map(tofrom:local_C[0:local_sizec])
 {
 ...
#pragma omp dispatch
     dgemm("N","N",&HA,&local_WB,&WA,&alpha,A,&ldA,local_B,&ldB,&beta,local_C,&ldC);
 ...
 }

This mode of the binary can be built by using the following options:


$ cd examples/MPI/02_omp_mpi_onemkl_dgemm
$ make OMP_AFFINITIZATION=1

We can then run the application without explicitly specifying device affinity.


$ export ZEX_NUMBER_OF_CCS=0:4,1:4
$ export LIBOMPTARGET_DEVICES=SUBSUBDEVICE
$ mpirun -n 8 ./dgemm.out

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Advanced Topics

Compute Command Streamers (CCSs)