Visible to Intel only — GUID: GUID-FED11DA9-AE5B-4187-8EA1-94275279F120
Visible to Intel only — GUID: GUID-FED11DA9-AE5B-4187-8EA1-94275279F120
Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP
Devices Discovery
Before you run an application, it is recommended to run the sycl-ls command to find out which devices are available on this platform, especially when the run is for a performance measure. This ensures the run is not taking a fallback path.
Context
OpenMP context is implicit. The runtime creates and management context, but one Intel extension to query OpenMP context (omp_target_get_context()) has been added for interop support.
Unified Shared Memory
Three new OpenMP APIs as Intel extensions for USM memory allocations have been added. omp_target_alloc API support for USM has also been added. Please refer to OpenMP USM Allocation API section for details.
Context Associated with Sub-sub-devices
Creating a context with a single sub-sub-device in it and the queue is attached to that sub-sub-device (compute slice). In this scheme, the execution/visibility is limited to the sub-sub-device only, and expected to offer the fine-grained partitioning and control at compute slice (C-slice) at level. See a code example:
#define DEVKIND 1 // C-Slice int root_id = omp_get_default_device(); #pragma omp parallel for for (int id = 0; id < NUM_SUBSUBDEVICES; ++id) { #pragma omp target teams distribute parallel for device(root_id) \ subdevice(DEVKIND, id) map(…) for (int i = lb(id), i < ub(id); i++) { ...; } }
The recommendation of using the sub-sub-device scheme is for running multiple small kernels concurrently on a stack at compute-slice level, and these kernels does not have enough parallelism to utilize all compute-slices of a stack. Each stack of Intel® Data Center GPU Max has four compute command streamers for dispatching kernels to their associated compute-slices. For each stack, we can run 4 smaller kernels concurrently. It means we can run 8 smaller kernels concurrently on a 2-stack system.
Context Associated with Sub-devices
Creating a context with multiple sub-devices (multiple stacks) of the same root-device. In this scheme, queues are to be attached to the sub-devices effectively implementing “explicit scaling.” In this scheme, the code will run on multiple sub-devices. See a code example below:
#define DEVKIND 0 // TILE int root_id = omp_get_default_device(); #pragma omp parallel for for (int id = 0; id < NUM_SUBDEVICES; ++id) { #pragma omp target teams distribute parallel for device(root_id) \ subdevice(DEVKIND, id) map(…) for (int i = lb(id), i < ub(id); i++) { … } }
Context Associated with Root Device
Creating a context with a single root-device and a queue attached to it, the work will be automatically be distributed across all sub-devices/stacks via “implicit scaling” by the GPU driver. This is the most simple way to enable multi-stack utilization, which does not offer the possibility to target specific stacks. See a code example below:
int root_id = omp_get_default_device(); #pragma omp target teams distribute parallel for device(root_id) map(…) for (int i = 0, i < N; i++) { … }
Context Associated with Multiple Root Devices
Launching offload region to multiple root-devices (multi-card). In this scheme, the most nonrestrictive context with queues attached to different root devices offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed. See a code example:
Int num_devices = omp_get_num_devices(); #pragma omp parallel for for (int root_id = 0; root_id < num_devices; root_id++) { #pragma omp target teams distribute parallel for device(root_id) map(…) for (int i = lb(root_id); I < ub(root_id); i++) { … } }
Depending on the chosen explicit sub-devices usage described and algorithm used, make sure to do proper memory allocation/synchronization. The following program is a full example using explicit sub-devices.
#include <assert.h> #include <iostream> #include <omp.h> #include <stdint.h> #ifndef NUM_SUBDEVICES #define NUM_SUBDEVICES 1 #endif #ifndef DEVKIND #define DEVKIND 0 #endif template <int num_subdevices> struct mptr { float *p[num_subdevices]; }; int main(int argc, char **argv) { constexpr int SIZE = 8e6; constexpr int SIMD_SIZE = 32; constexpr std::size_t TOTAL_SIZE = SIZE * SIMD_SIZE; constexpr int num_subdevices = NUM_SUBDEVICES; mptr<num_subdevices> device_ptr_a; mptr<num_subdevices> device_ptr_b; mptr<num_subdevices> device_ptr_c; const int device_id = omp_get_default_device(); std::cout << "device_id = " << device_id << std::endl; for (int sdev = 0; sdev < num_subdevices; ++sdev) { device_ptr_a.p[sdev] = static_cast<float *>(malloc(TOTAL_SIZE * sizeof(float))); device_ptr_b.p[sdev] = static_cast<float *>(malloc(TOTAL_SIZE * sizeof(float))); device_ptr_c.p[sdev] = static_cast<float *>(malloc(TOTAL_SIZE * sizeof(float))); #pragma omp target enter data map(alloc \ : device_ptr_a.p[sdev] [0:TOTAL_SIZE]) \ device(device_id) subdevice(DEVKIND, sdev) #pragma omp target enter data map(alloc \ : device_ptr_b.p[sdev] [0:TOTAL_SIZE]) \ device(device_id) subdevice(DEVKIND, sdev) #pragma omp target enter data map(alloc \ : device_ptr_c.p[sdev] [0:TOTAL_SIZE]) \ device(device_id) subdevice(DEVKIND, sdev) } std::cout << "memory footprint per GPU = " << 3 * (std::size_t)(TOTAL_SIZE) * sizeof(float) * 1E-9 << " GB" << std::endl; #pragma omp parallel for for (int sdev = 0; sdev < num_subdevices; ++sdev) { float *a = device_ptr_a.p[sdev]; float *b = device_ptr_b.p[sdev]; #pragma omp target teams distribute parallel for device(device_id) \ subdevice(LEVEL, sdev) for (int i = 0; i < TOTAL_SIZE; ++i) { a[i] = i + 0.5; b[i] = i - 0.5; } } const int no_max_rep = 200; double time = 0.0; for (int irep = 0; irep < no_max_rep + 1; ++irep) { if (irep == 1) time = omp_get_wtime(); #pragma omp parallel for num_threads(num_subdevices) for (int sdev = 0; sdev < num_subdevices; ++sdev) { float *a = device_ptr_a.p[sdev]; float *b = device_ptr_b.p[sdev]; float *c = device_ptr_c.p[sdev]; #pragma omp target teams distribute parallel for device(device_id) \ subdevice(LEVEL, sdev) for (int i = 0; i < TOTAL_SIZE; ++i) { c[i] = a[i] + b[i]; } } } time = omp_get_wtime() - time; time = time / no_max_rep; const std::size_t streamed_bytes = 3 * (std::size_t)(TOTAL_SIZE)*num_subdevices * sizeof(float); std::cout << "bandwidth = " << (streamed_bytes / time) * 1E-9 << " GB/s" << std::endl; std::cout << "time = " << time << " s" << std::endl; std::cout.precision(10); for (int sdev = 0; sdev < num_subdevices; ++sdev) { #pragma omp target update from(device_ptr_c.p[sdev][:TOTAL_SIZE]) \ device(device_id) subdevice(LEVEL, sdev) std::cout << "-GPU: device id = : " << sdev << std::endl; std::cout << "target result:" << std::endl; std::cout << "c[" << 0 << "] = " << device_ptr_c.p[sdev][0] << std::endl; std::cout << "c[" << SIMD_SIZE - 1 << "] = " << device_ptr_c.p[sdev][SIMD_SIZE - 1] << std::endl; std::cout << "c[" << TOTAL_SIZE / 2 << "] = " << device_ptr_c.p[sdev][TOTAL_SIZE / 2] << std::endl; std::cout << "c[" << TOTAL_SIZE - 1 << "] = " << device_ptr_c.p[sdev][TOTAL_SIZE - 1] << std::endl; } for (int sdev = 0; sdev < num_subdevices; ++sdev) { for (int i = 0; i < TOTAL_SIZE; ++i) { assert((int)(device_ptr_c.p[sdev][i]) == (int)(device_ptr_c.p[sdev][i] + device_ptr_a.p[sdev][i] * device_ptr_b.p[sdev][i])); } } for (int sdev = 0; sdev < num_subdevices; ++sdev) { #pragma omp target exit data map(release : device_ptr_a.p[sdev][:TOTAL_SIZE]) #pragma omp target exit data map(release : device_ptr_b.p[sdev][:TOTAL_SIZE]) #pragma omp target exit data map(release : device_ptr_a.p[sdev][:TOTAL_SIZE]) } }
The explicit scaling of this OpenMP workload achieves a linear scaling ~2x on an Intel® Data Center GPU Max system using Ahead-Of-Time (AOT) compilation. The build command is:
icpx -fp-contract=fast -O2 -ffast-math -DNUM_SUBDEVICES=2 -fopenmp \ -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device xehp" \ openmp_explicit_subdevice.cpp -o run.x
Besides “subdevice” clause language extension, an environment variable is provided by the OpenMP offloading compiler and runtime to map an OpenMP “device” to an GPU card (device), a stack (sub-device) and a compute-slice (sub-sub-device). The environment variable is LIBOMPTARGET_DEVICES=[device | subdevice | subsubdevice]