Developer Reference for Intel® oneAPI Math Kernel Library for Fortran

ID 766686
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use Intel® oneAPI Math Kernel Library (oneMKL) and OpenMP* offload to run standard oneMKL computations on Intel GPUs. You can find the list of oneMKL features that support OpenMP offload in the mkl_omp_offload.f90 interface module file which includes:

  • All Level 1, 2, and 3 BLAS functions, supporting both synchronous and asynchronous execution
  • BLAS-like extensions: ?axpby, ?axpy_batch_strided, ?gemv_batch_strided, ?dgmm_batch_strided, gemm_s8u8s32, ?gemm_batch_strided, ?trsm_batch_strided, ?gemmt, mkl_?omatcopy_batch_strided, mkl_?imatcopy_batch_strided, and mkl_?omatadd_batch_strided, supporting synchronous and asynchronous execution
  • LAPACK, supporting both synchronous and asynchronous execution
    • {s,d,c,z}getrf
    • {s,d,c,z}getri
    • {s,d,c,z}getrs
    • {s,d,c,z}potrf
    • {s,d,c,z}potri
    • {s,d,c,z}potrs
    • {s,d}orgqr, {c,z}ungqr
    • {s,d}ormqr, {c,z}unmqr
    • {s,d,c,z}trtrs
    • {s,d,c,z}steqr
    • {s,d,c,z}geqrf
    • {s,d,c,z}gesvd
    • {s,d,c,z}gebrd
    • {s,d}syev, {c,z}heev
    • {s,d}syevd, {c,z}heevd
    • {s,d}syevx, {c,z}heevx
    • {s,d}sygvd, {c,z}hegvd
    • {s,d}sygvx, {c,z}hegvx
    • {s,d}sytrd, {c,z}hetrd
  • LAPACK-like extension:{s,d,c,z}getrf_batch, {s,d,c,z}getrf_batch_strided, {s,d,c,z}getri_oop_batch, {s,d,c,z}getri_oop_batch_strided, {s,d,c,z}getrs_batch_strided, {s,d,c,z}getrfnp_batch_strided, {s,d,c,z}getrsnp_batch_strided
  • FFTs through both DFTI and FFTW3 interfaces in one, two, and three dimensions.
    • For COMPLEX_STORAGE, only the DFTI_COMPLEX_COMPLEX format is currently supported on CPU and GPU devices.
    • Out-of-place R2C and C2C transforms offloaded to the GPU currently require INPUT_DISTANCE and OUTPUT_DISTANCE to be equal. The use of different INPUT_DISTANCE and OUTPUT_DISTANCE values is not currently supported.
    • Both synchronous and asynchronous computations are supported.
    • For R2C/C2R transforms on the GPU, only DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX is supported (implying DFTI_PACKED_FORMAT=DFTI_CCE_FORMAT).
    • 2D and 3D FFTs are supported using rank-2/rank-3 Fortran arrays for input and output only through Fortran OpenMP offload interface.
    • NOTE:
      INCONSISTENT_CONFIGURATION errors at compute time indicate an invalid descriptor or invalid data pointer. Double check your data mapping if you encounter such errors.
    • Arbitrary strides and batch distances are not supported for multi-dimensional R2C transforms offloaded to the GPU. Considering the first dimension of the data, every element must be separated from its two nearest peers (along another dimension and/or in another batch) by a constant distance. For example, to compute a batched, two-dimensional R2C FFT of size [N1, N2] with input strides [0, 1, S2] (column-major layout with unit elementary stride and no offset), INPUT_DISTANCE must be equal to N2*S2 so that every element is separated from its nearest first-dimension counterpart(s) by a distance S2 (in this example), even across batches.
    • For batched 3D C2C problems, only INPUT_DISTANCE and OUTPUT_DISTANCE equal to the problem size are currently supported.
    • Transforms on GPU devices may overwrite FFT-irrelevant, padding entries in the output data.
  • Vector Statistics
    • Distributions: v?rnguniform, v?rnggaussian, v?rngexponential, v?rnglognormal, v?rngcauchy, v?rnggumbel, v?rnglaplace, v?rngrayleigh, v?rngweibull, virngbernoulli, virnggeometric, virngpoisson, virnguniformbits, virnguniformbits32/64, v?rngbeta, v?rngchisquare, v?rnggamma, v?rnggaussianmv, v?rnggumbel, virngpoissonv, virngnegbinomial, virngbinomial, virngmultinomial (for (kntrial * 16 && ntrial ≤ 16), where k is the probability vector length
    • Basic Random Number Generators: VSL_BRNG_MCG31, VSL_BRNG_MCG59, VSL_BRNG_PHILOX4X32X10, VSL_BRNG_MRG32K3A, VSL_BRNG_SOBOL, VSL_BRNG_MT19937, VSL_BRNG_MT2203
    • Summary Statistics: vslsSSCompute, vsldSSCompute with the following estimates:
      • VSL_SS_MIN, VSL_SS_MAX, VSL_SS_SUM, VSL_SS_(2, 3, 4)R_SUM, VSL_SS_MEAN, VSL_SS_(2, 3, 4)R_MOM, VSL_SS_(2, 3, 4)C_SUM, VSL_SS(2, 3, 4)C_MOM, VSL_SS_VARIATION, VSL_SS_SKEWNESS, VSL_SS_KURTOSIS
      • VSL_SS_METHOD_FAST and VSL_SS_METHOD_FAST_USER_MEAN are supported

The OpenMP offload feature from Intel® oneAPI Math Kernel Library allows you to run oneMKL computations on Intel GPUs through the standard oneMKL APIs within an omp target variant dispatch section. For example, the standard CBLAS API for single precision real data type matrix multiply is:

subroutine sgemm ( transa, transb, m, n, k, alpha, a, lda,        &
          &b, ldb, beta, c, ldc ) BIND(C)
       character*1,intent(in)              :: transa, transb
       integer,intent(in)                  :: m, n, k, lda, ldb, ldc
       real,intent(in)                     :: alpha, beta
       real,intent(in)                     :: a( lda, * ), b( ldb, * )
       real,intent(in)                     :: c( ldc, * )
     end subroutine sgemm

If sgemm is called outside of an omp target variant dispatch section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the oneMKL function within a dispatch variant construct is synchronous, the nowait clause can be used on the dispatch variant construct to specify that asynchronous execution is desired. In that case, synchronization needs to be handled by the application using standard OpenMP synchronization functionality, for example the omp taskwait construct.

In order to offload to a device, arguments to the oneMKL function must be mapped to the device memory if they represent a return value (marked with intent(out) or intent(inout) in the subroutine declaration) or if they point to an array of data (such as a matrix or vector, even if it is an input array). Users must map these arguments to the device using the omp target data construct before calling the oneMKL routine, and in addition must use the use_device_ptr clause in the omp target variant dispatch construct to specify variables that have been mapped to device memory.

Example

Examples for using the OpenMP offload for oneMKL are located in the Intel® oneAPI Math Kernel Library installation directory, under:

examples/f_offload
include "mkl_omp_offload.f90"

program sgemm_example
use onemkl_blas_omp_offload_lp64
use common_blas  

character*1 :: transa = 'N', transb = 'N'
integer :: i, j, m = 5, n = 3, k = 10
integer :: lda, ldb, ldc
real :: alpha = 1.0, beta = 1.0
real,allocatable :: a(:,:), b(:,:), c(:,:)

! initialize leading dimension and allocate and initialize arrays
lda = m
…
allocate(a(lda,k))
…
 
! initialize matrices
call sinit_matrix(transa, m, k, lda, a)
…

! Calling sgemm on the CPU using standard oneMKL Fortran interface
call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)

! map the a, b and c matrices on the device memory
!$omp target data map(a,b,c)

! Calling sgemm on the GPU using standard oneMKL Fortran interface within a variant dispatch construct
! Use the use_device_ptr clause to specify that a, b and c are device memory
!$omp target variant dispatch use_device_ptr(a,b,c)
call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
!$omp end target variant dispatch

!$omp end target data

! Free memory
deallocate(a)
…
stop
end program