Developer Guide and Reference

ID 767251
Date 10/31/2024
Public
Document Table of Contents

OpenMP* Offloading SPMD/SIMT and SIMD Models

For Intel GPUs, OpenMP kernel generation supports two programming models: SPMD (Single Program Multiple Data) and SIMD (Single Instruction Multiple Data). The key differences are:

  • Data Parallelism: SPMD primarily exploits data parallelism, where multiple threads simultaneously operate on different data elements. SIMD, on the other hand, focuses on executing the same operation on multiple data elements.

  • Granularity: SPMD typically operates at a coarser granularity, where each thread might handle a significant portion of the overall computation. SIMD operates at a finer granularity, dealing with individual instructions and vectorized data elements.

  • Syntax: While both SPMD and SIMD can be implemented using OpenMP directives, their specific directives (for example, parallel for for SPMD or simd for SIMD) reflect their respective models' characteristics.

The OpenMP SPMD (also known as SIMT (Single Instruction Multiple Threads)) model is a common GPU programming model. Given the following code snippet, at the kernel level, Loop-A and its entire body are vectorized with SIMD8, SIMD16, or SIMD32 for Intel® ARC GPU with its native SIMD8 hardware support (for example, the compiler generates SIMD8, SIMD16, or SIMD32 kernels). For Intel® GPU Max Series with its native SIMD16 hardware support, Loop-A and its entire body are vectorized with SIMD16 or SIMD32; that is, the compiler generates SIMD16 or SIMD32 kernels for SIMT16 or SIMT32 thread execution.

#pragma omp target teams distribute parallel for  // Loop-A is vectorized with SIMD8, SIMD16, or SIMD32 based on the HW width of the GPU SIMD hardware unit.
for (int a = 0; a < M; a++) {  
    code 1;
    for (int b = 0; b < N; b++) {  
        code 2;
   for (int c = 0; c < K; c++) {  
        code 3;
    code 4;
}

The OpenMP offloading SPMD model is the default model for OpenMP offloading, which is enabled with the compiler options -fiopenmp -fopenmp-targets=spir64. The compiler generates SIMD8, SIMD16, or SIMD32 kernels like 8-way, 16-way, or 32-way SIMT parallelism support for the outer Loop-A.

OpenMP SIMD model, which is a common CPU programming model. The Intel GPU has a SIMD engine in its execution unit, which allows the compiler to generate explicit SIMD code from the seamless transition from well-tuned CPU code with the outer-parallel-inner-simd scheme. Given the following snippet, Loop-A and its entire body are not vectorized at the kernel level (for example, the compiler generates the SIMD1 kernel for the thread execution).

#pragma omp target teams distribute parallel for   // SIMD1 kernel is generated for Loop-A
for (int a = 0; a < M; a++) {  
    code 1;
    #pragma omp simd simdlen(32)                // Loop-B is vectorized with SIMD32 in the kernel
    for (int b = 0; b < N; b++) {  
        code 2;
    #pragma omp simd simdlen(8)                  // Loop-C is vectorized with SIMD8 in the kernel   
    for (int c = 0; c < K; c++) {  
        code 3;
    code 4;
}

The OpenMP offloading SIMD model is controlled by the compiler options -fiopenmp -fopenmp-targets=spir64 -fopenmp-target-simd; it enables the compiler to generate SIMD1 kernel with explicit SIMD code inside the kernel for OpenMP SIMD loops. The model allows more flexibility in register allocation and SIMD width control for OpenMP SIMD loops inside target regions (or kernels).