Developer Reference for Intel® oneAPI Math Kernel Library for C

ID 766684
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use Intel® oneAPI Math Kernel Library (oneMKL) and OpenMP* offload to run standard oneMKL computations on Intel GPUs. You can find the list of oneMKL features that support OpenMP offload in the mkl_omp_offload.h header file, which includes:

  • All Level 1, 2, and 3 BLAS functions through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
  • BLAS-like extensions: cblas_?axpby, cblas_?axpy_batch{_strided}, cblas_?copy_batch{_strided}, cblas_?gemv_batch{_strided}, cblas_?dgmm_batch{_strided}, cblas_hgemm, cblas_gemm_bf16bf16f32, cblas_gemm_s8u8s32, cblas_?gemm_batch{_strided}, cblas_?trsm_batch{_strided}, and cblas_?gemmt functionality through the CBLAS and BLAS interfaces as well as mkl_?omatcopy_batch_strided, mkl_?imatcopy_batch_strided, and mkl_?omatadd_batch_strided, supporting both synchronous and asynchronous execution
  • LAPACK, supporting both synchronous and asynchronous execution
    • {s,d,c,z}getrf
    • {s,d,c,z}getri
    • {s,d,c,z}getrs
    • {s,d,c,z}potrf
    • {s,d,c,z}potri
    • {s,d,c,z}potrs
    • {s,d}orgqr, {c,z}ungqr
    • {s,d}ormqr, {c,z}unmqr
    • {s,d,c,z}trtrs
    • {s,d,c,z}steqr
    • {s,d,c,z}geqrf
    • {s,d,c,z}gesvd
    • {s,d,c,z}gebrd
    • {s,d}syev, {c,z}heev
    • {s,d}syevd, {c,z}heevd
    • {s,d}syevx, {c,z}heevx
    • {s,d}sygvd, {c,z}hegvd
    • {s,d}sygvx, {c,z}hegvx
    • {s,d}sytrd, {c,z}hetrd
  • LAPACK-like extension: {s,d,c,z}getrf_batch_strided, {s,d,c,z}getrs_batch_strided, {s,d,c,z}getrfnp_batch_strided, {s,d,c,z}getrsnp_batch_strided
  • Vector Statistics
    • Distributions: v?RngUniform, v?RngGaussian, v?RngExponential, v?RngLognormal, v?RngCauchy, v?RngGumbel, v?RngLaplace, v?RngRayleigh, v?RngWeibull, viRngBernoulli, viRngGeometric, viRngPoisson, viRngUniformBits, viRngUniformBits32/64
    • Basic Random Number Generators: VSL_BRNG_MCG31, VSL_BRNG_MCG59, VSL_BRNG_PHILOX4X32X10, VSL_BRNG_MRG32K3A
  • FFTs through both DFTI and FFTW3 interfaces in one, two, and three dimensions.
    • For COMPLEX_STORAGE, only the DFTI_COMPLEX_COMPLEX format is currently supported on CPU and GPU devices.
    • Out-of-place R2C and C2C transforms offloaded to the GPU currently require INPUT_DISTANCE and OUTPUT_DISTANCE to be equal. The use of different INPUT_DISTANCE and OUTPUT_DISTANCE values is not currently supported.
    • Both synchronous and asynchronous computations are supported.
    • For R2C/C2R transforms on the GPU, only DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX is supported (implying DFTI_PACKED_FORMAT=DFTI_CCE_FORMAT).
    • NOTE:
      INCONSISTENT_CONFIGURATION errors at compute time indicate an invalid descriptor or invalid data pointer. Double check your data mapping if you encounter such errors.
    • Arbitrary strides and batch distances are not supported for multi-dimensional R2C transforms offloaded to the GPU. Considering the last dimension of the data, every element must be separated from its two nearest peers (along another dimension and/or in another batch) by a constant distance. For example, to compute a batched, two-dimensional R2C FFT of size [N2, N1] with input strides [0, S2, 1] (row-major layout with unit elementary stride and no offset), INPUT_DISTANCE must be equal to N2*S2 so that every element is separated from its nearest last-dimension counterpart(s) by a distance S2 (in this example), even across batches.
    • For batched 3D C2C problems, only INPUT_DISTANCE and OUTPUT_DISTANCE equal to the problem size are currently supported.
    • Due to the variadic implementation of DftiComputeForward and DftiComputeBackward, out-of-place compute calls using the DFTI API with the OpenMP 5.1 dispatch construct differ from common dispatch construct usage by requiring a "need_device_ptr" clause. The oneMKL examples provided on installation demonstrate this usage.
    • Transforms on GPU devices may overwrite FFT-irrelevant, padding entries in the output data.
  • Sparse BLAS
    • mkl_sparse_{s, d}_create_csr
    • mkl_sparse_{s, d}_export_csr
    • mkl_sparse_destroy
    • mkl_sparse_order
      • Currently supports only CSR matrix format.
    • mkl_sparse_set_mv_hint
      • Currently supports only SPARSE_OPERATION_NON_TRANSPOSE with CSR matrix format for general MV (SPARSE_MATRIX_TYPE_GENERAL) and triangular MV (SPARSE_MATRIX_TYPE_TRIANGULAR with fill modes SPARSE_FILL_MODE_LOWER/SPARSE_FILL_MODE_UPPER).
    • mkl_sparse_set_sv_hint
    • mkl_sparse_optimize
      • Supports optimization for mkl_sparse_{s, d}_mv functionality based on supported hints added through mkl_sparse_set_mv_hint offload.
      • Supports optimization for mkl_sparse_{s, d}_trsv functionality based on supported hints added through mkl_sparse_set_sv_hint offload.
      • Both synchronous and asynchronous executions are supported.
        NOTE:
        Note that although you can run the mkl_sparse_optimize offload function asynchronously, you are responsible for the data dependency between the optimization routine and the execution routines.
    • mkl_sparse_{s, d}_mv:
      • Currently supports only SPARSE_OPERATION_NON_TRANSPOSE with the following combinations of matrix types:
        • SPARSE_MATRIX_TYPE_GENERAL
        • SPARSE_MATRIX_TYPE_TRIANGULAR with fill modes SPARSE_FILL_MODE_LOWER/SPARSE_FILL_MODE_UPPER and diagonal types SPARSE_DIAG_UNIT/SPARSE_DIAG_NON_UNIT
        • SPARSE_MATRIX_TYPE_SYMMETRIC fill modes SPARSE_FILL_MODE_LOWER/SPARSE_FILL_MODE_UPPER and diagonal type SPARSE_DIAG_NON_UNIT (currently, SPARSE_DIAG_UNIT is not supported)
      • Both synchronous and asynchronous computations are supported.
    • mkl_sparse_{s, d}_mm:
      • Currently supported only with SPARSE_MATRIX_TYPE_GENERAL and SPARSE_OPERATION_NON_TRANSPOSE.
      • Both SPARSE_LAYOUT_ROW_MAJOR and SPARSE_LAYOUT_COLUMN_MAJOR are supported.
      • Both synchronous and asynchronous computations are supported.
    • mkl_sparse_{s, d}_trsv
      • Currently only supported with SPARSE_MATRIX_TYPE_TRIANGULAR, SPARSE_OPERATION_NON_TRANSPOSE, and alpha == 1.0
      • Both synchronous and asynchronous computations are supported
    • mkl_sparse_sp2m
      • Currently supported only with SPARSE_MATRIX_TYPE_GENERAL.
      • Both synchronous and asynchronous computations are supported with Level Zero backend, and currently only synchronous computations are supported with OpenCL backend.
      • Note that you can run the mkl_sparse_sp2m offload function asynchronously, but you are responsible for the data dependency between the first stage and the second stage of mkl_sparse_sp2m.
      • mkl_sparse_sp2m internally creates arrays for the sparse C matrix output. As they may be expected to be used subsequently on both host and device, they are created internally using USM shared memory. The arrays are managed by the library and will be cleaned up when the corresponding C matrix handle is destroyed; however, direct access to the arrays is provided by the mkl_sparse_{s,d}_export_csr() OpenMP offload function. Users are recommended to make a copy to their own arrays if they want to have such data beyond the scope of the C matrix handle. The choice of USM shared memory for C arrays is made for functional support of the OpenMP Offload paradigm and has a performance impact over choosing USM device memory, which would be more performant but not functional in all subsequent use cases.
      • The created C matrix in the provided handle is not guaranteed to be sorted, so the mkl_sparse_order() OpenMP offload API is provided for user convenience if that property is needed.
      • The input matrix handle A is not required to be sorted on input, but the input matrix handle B is required to be sorted on input.
    • In Sparse BLAS, the usage model consists of the creation stage, the inspection stage, the execution stage, and the destruction stage. For Sparse BLAS with C OpenMP Offload, all stages can be asynchronously executed, provided any data dependencies are already respected.

The OpenMP offload feature from Intel® oneAPI Math Kernel Library enables you to run oneMKL computations on Intel GPUs through the standard oneMKL APIs within an omp target variant dispatch section. For example, the standard CBLAS API for single precision real data type matrix multiply is:

void cblas_sgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const MKL_INT M, const MKL_INT N,
    const MKL_INT K, const float alpha, const float *A, const MKL_INT lda,
    const float *B, const MKL_INT ldb, const float beta, float *C,
    const MKL_INT ldc);

If the oneMKL function (for example, cblas_sgemm) is called outside of an omp target variant dispatch section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the oneMKL function within a dispatch variant construct is synchronous. OpenMP offload computations may be done asynchronously by adding the nowait clause to the target variant dispatch construct. This ensures that the host thread encountering the task region generated by this target construct will not be blocked by the oneMKL call. Rather, the host thread is returned to the caller for further use. To finish the asynchronous (nowait) computations and ensure memory and execution model consistency (for example, that the results of a computation will be ready in memory to map), the last such nowait computation is followed by the stand-alone construct #pragma omp taskwait.

From the OpenMP Application Programming Interface version 5.0 specification: "The taskwait region binds to the current task region [i.e., in this case, the last nowait computation]. The current task region is suspended at an implicit task scheduling point associated with the construct. The current task region remains suspended until all child tasks that it generated before the taskwait region complete execution [currently, depend clause is not supported]."

Example

Examples for using the OpenMP offload for oneMKL are located in the Intel® oneAPI Math Kernel Library installation directory, under:

examples/c_offload

The following code snippet shows how to use OpenMP offload for single-call oneMKL features such as most dense linear algebra functionality.

#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h" // MKL header file for OpenMP offload
int dnum = 0; 
int main() {
    float *a, *b, *c, alpha = 1.0, beta = 1.0;
    MKL_INT m = 150, n = 200, k = 128, lda = m, ldb = k, ldc = m;
    MKL_INT sizea = lda * k, sizeb = ldb * n, sizec = ldc * n;
    // allocate matrices and check pointers
    a = (float *)mkl_malloc(sizea * sizeof(float), 64);
    ...
    // initialize matrices
#pragma omp target map(c[0:sizec])
    {
        for (i = 0; i < sizec; i++) {
            c[i] = 42;
        }
        ...
    }
    // run gemm on host, use standard MKL interface
    cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, 
lda, b, ldb, beta, c, ldc);
    // map the a, b and c matrices on the device memory
#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) 
device(dnum)
    {
        // run gemm on gpu, use standard MKL interface within a variant dispatch 
construct
        // if offload is not possible, default to cpu 
        // use the use_device_ptr clause to specify that a, b and c are device 
memory
#pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
        {
            cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, 
alpha, a, lda, b, ldb, beta, c, ldc);
        }
    }
    // Free matrices
    mkl_free(a);
        …
}

Some of the oneMKL functionality requires to call a set of functions to perform the corresponding computation. This is the case, for example, for the Discrete Fourier Transform which for a typical computation involves calling the functions.

DFTI_EXTERN MKL_LONG DftiCreateDescriptor(DFTI_DESCRIPTOR_HANDLE*,
                              enum DFTI_CONFIG_VALUE,
                              enum DFTI_CONFIG_VALUE,
                              MKL_LONG, ...);
DFTI_EXTERN MKL_LONG DftiCommitDescriptor(DFTI_DESCRIPTOR_HANDLE);
DFTI_EXTERN MKL_LONG DftiComputeForward(DFTI_DESCRIPTOR_HANDLE, void*, ...);
DFTI_EXTERN MKL_LONG DftiComputeBackward(DFTI_DESCRIPTOR_HANDLE, void*, ...);
DFTI_EXTERN MKL_LONG DftiFreeDescriptor(DFTI_DESCRIPTOR_HANDLE*);

In that case, only a subset of the calls must be wrapped in an omp target variant dispatch construct as shown in the following code snippet for DFTI.

#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h"
int main(void)
{
    const int devNum = 0;
    const MKL_LONG N = 64;      // Size of 1D transform
    MKL_LONG status    = 0;
    MKL_LONG statusGPU = 0;
    DFTI_DESCRIPTOR_HANDLE descHandle    = NULL;
    DFTI_DESCRIPTOR_HANDLE descHandleGPU = NULL;
    MKL_Complex8 *x    = NULL;
    MKL_Complex8 *xGPU = NULL;
    printf("Create DFTI descriptor\n");
    status = DftiCreateDescriptor(&descHandle, DFTI_SINGLE, DFTI_COMPLEX, 1, N);
    printf("Create GPU DFTI descriptor\n");
    statusGPU = DftiCreateDescriptor(&descHandleGPU, DFTI_SINGLE, DFTI_COMPLEX, 
1, N);
    printf("Commit DFTI descriptor\n");
    status = DftiCommitDescriptor(descHandle);
    printf("Commit GPU DFTI descriptor\n");
#pragma omp target variant dispatch device(devNum)
    {
        statusGPU = DftiCommitDescriptor(descHandleGPU);
    }
    printf("Allocate memory for input array\n");
    x = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64);
    printf("Allocate memory for GPU input array\n");
    xGPU = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64);
    printf("Initialize input for forward FFT\n");
    // init x and xGPU ...
    printf("Compute forward FFT in-place\n");
    status = DftiComputeForward(descHandle, x);
    printf("Compute GPU forward FFT in-place\n");
#pragma omp target data map(tofrom:xGPU[0:N]) device(devNum)
    {
#pragma omp target variant dispatch use_device_ptr(xGPU) device(devNum)
        {
            statusGPU = DftiComputeForward(descHandleGPU, xGPU);
        }
    }
    // use results now in x and xGPU ...
 cleanup:
    DftiFreeDescriptor(&descHandle);
    DftiFreeDescriptor(&descHandleGPU);
    mkl_free(x);
    mkl_free(xGPU);
}

For asynchronous execution of multi-call oneMKL computation, the nowait clause needs to be used only on the call to the function performing the actual computation (for example, DftiCompute{Forward,Backward}). For instance, the following snippet shows how the DFTI example above could be changed to have two, back-to-back, asynchronous (nowait) computations dispatched, with a taskwait at the end of the second to ensure the completion of both computations before their results are accessed:

printf("Compute Intel GPU forward FFT 1 in-place\n");
#pragma omp target data map(tofrom:x1GPU[0:N1], x2GPU[0:N2]) device(devNum)
    {
#pragma omp target variant dispatch use_device_ptr(x1GPU) device(devNum) nowait
        {
            status1GPU = DftiComputeForward(descHandle1GPU, x1GPU);
        }
    printf("Compute Intel GPU forward FFT 2 in-place\n");
#pragma omp target variant dispatch use_device_ptr(x2GPU) device(devNum) nowait
        {
            status2GPU = DftiComputeForward(descHandle2GPU, x2GPU);
        }
#pragma omp taskwait
    }
    if (status1GPU != DFTI_NO_ERROR) goto failed;
    if (status2GPU != DFTI_NO_ERROR) goto failed;

For sparse BLAS computations, the workflow ‘create a CSR matrix handle’ → ‘compute’ → ‘destroy the CSR matrix handle’ must be done so that the offloaded data arrays are alive through the full workflow. For instance, if you are using a target data map, then the workflow must be contained in a single target data region. On the other hand, if the arrays were allocated directly using omp_target_alloc() or the Intel Extensions omp_target_alloc_host/omp_target_alloc_device/omp_target_alloc_shared, then the workflow must be contained at least in a subset of the scope where those arrays are usable; that is, before the corresponding calls to omp_target_free. The following snippet shows how the Sparse BLAS OpenMP Offload example for mkl_sparse_s_mv() could be run using a target data map region, where N is the number of rows, M is the number of columns, and NNZ is the number of non-zero entries of the sparse matrix csrA_gpu, x is the input vector, and the output is stored in the z array:

#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M]) map(tofrom:z[0:N]) device(devNum)
{                                                                            
#pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a)
        {                                                                   
            status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a);                            
        }                                                                            
#pragma omp target variant dispatch device(devNum) use_device_ptr(x, z)
        {                                                              
            status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z);                                            
        }                                                                            
#pragma omp target variant dispatch device(devNum)
        {                                         
            status_gpu3 = mkl_sparse_destroy(csrA_gpu);
        }                                              
    }

For asynchronous execution of multi-call oneMKL Sparse BLAS computation, the nowait clause can be added to the call of the function performing the actual computation (for example, calls to the mkl_sparse_{s,d}_mv() function).

As an example, the following snippet shows how the Sparse BLAS example above could be changed to have two asynchronous (nowait) computations using the same matrix handle, csrA_gpu, but unrelated vector data so there is no read/write dependency between them. Add a taskwait at the end of the second execution to ensure the completion of both computations before the mkl_sparse_destroy() function is called:

#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M],w[0:M]) map(tofrom:y[0:N],z[0:N]) device(devNum)
    {

#pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a)
        {
            status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a);
        }

#pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) nowait
        {
            status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z);
        }

#pragma omp target variant dispatch device(devNum) use_device_ptr(w, y) nowait
        {
            status_gpu3 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, w, beta, y);
        }
#pragma omp taskwait
       
#pragma omp target variant dispatch device(devNum)
        {
            status_gpu4 = mkl_sparse_destroy(csrA_gpu);
        }
    }