Visible to Intel only — GUID: GUID-ED82C368-19BC-408F-A21B-56944ABF1C95
Visible to Intel only — GUID: GUID-ED82C368-19BC-408F-A21B-56944ABF1C95
Introduction to the Intel® oneAPI Math Kernel Library (oneMKL) BLAS and LAPACK with DPC++
This guide provides an overview of the Intel® oneAPI Math Kernel Library (oneMKL) BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) application programming interfaces for the Data Parallel C++ (DPC++) implementation of SYCL. It is aimed at users who have had some prior experience with the standard BLAS and LAPACK APIs.
In general, the DPC++ APIs for BLAS and LAPACK are similar to the standard BLAS and LAPACK APIs, sharing the same routine names and argument orders. Unlike standard routines, however, DPC++ routines are designed to run asynchronously on a compute device (CPU or GPU) and typically use device memory for inputs and outputs. To support this functionality, the data types of many arguments have changed and each routine takes an additional argument (a DPC++ queue), which specifies where the routine should be executed. There are several smaller API changes that are described below.
In oneMKL, all DPC++ routines and associated data types belong to the oneapi::mkl namespace. CPU-based oneMKL routines are still available via the C interface (which uses the global namespace). Additionally, each BLAS-specific routine is in the oneapi::mkl::blas, oneapi::mkl::blas::column_major, and oneapi::mkl::blas::row_major namespaces. Each LAPACK-specific routine is in the oneapi::mkl::lapack namespace. Currently, LAPACK DPC++ APIs do not support matrices stored using row major layout.
By default, column major layout is assumed for all BLAS functions in the oneapi::mkl::blas namespace. BLAS functions in the oneapi::mkl::blas::column_major namespace can also be used when matrices are stored using column major layout. To use row major layout, BLAS functions in the oneapi::mkl::blas::row_major namespace must be used. For example, oneapi::mkl::blas::gemm is the DPC++ routine for matrix multiplication using column major layout for storing matrices, while ::{cblas_}{s, d, c, z}gemm is the traditional CPU-based version.
Differences between Standard BLAS/LAPACK and DPC++ oneMKL APIs
Naming
DPC++ BLAS and LAPACK APIs are overloaded on precision. For example, unlike standard BLAS API having four different routines for GEMM computation with names based on precision (sgemm, dgemm, cgemm and zgemm), the DPC++ BLAS has only one entry point for GEMM computation named gemm accepting float, double, half, bfloat16, std::complex<float>, and std::complex<double> data types.
References
All DPC++ objects (buffers and queues) are passed by reference, rather than by pointer. Other parameters are typically passed by value.
Queues
Every DPC++ BLAS and LAPACK routine has an extra parameter at the beginning: A DPC++ queue (type queue&), where computational tasks are submitted. A queue can be associated with a CPU device or a GPU device. The CPU and GPU devices are supported for all BLAS functions. Refer to the documentation for individual LAPACK functions to see which devices are supported.
Vector and Matrix Types
DPC++ has two APIs for storing data on a device and sharing data between devices and the host: the buffer API and the unified shared memory (USM) API. DPC++ BLAS and LAPACK routines support both APIs.
With the buffer API, vector and matrix inputs to DPC++ BLAS and LAPACK routines are DPC++ buffer types. Currently, all buffers must be one-dimensional, but you can use DPC++’s buffer::reinterpret() member function to convert a higher-dimensional buffer to a one-dimensional one.
For the USM API, vector and matrix inputs to DPC++ BLAS and LAPACK routines are pointers of the appropriate type, but the pointers must point to memory allocated by one of the DPC++ USM allocation routines (eg malloc_host, malloc_shared, or malloc_device). Memory that is allocated with the usual malloc or new routines cannot be used in the Intel® oneAPI Math Kernel Library (oneMKL) DPC++ interfaces.
For example, the gemv routine takes a matrix A and vectors x, y. For the real double precision case, each of these parameters has types:
double* in standard BLAS;
buffer<double,1>& in DPC++ BLAS with the buffer API;
double* in DPC++ BLAS with the USM API, with the restriction that the memory the pointer refers to must be allocated in a device-accessible way using a DPC++ USM allocation routine.
Scalars
Scalar inputs are passed by value for all BLAS functions.
Complex Numbers
In DPC++, complex numbers are represented with C++ std::complex types. For instance, MKL_Complex8 can be replaced by std::complex<float>.
This is true for scalar, vector, and matrix arguments. For instance, a double-precision complex vector would have type buffer<std::complex<double>,1>.
Return Values
Some BLAS and LAPACK routines (dot, nrm2, asum, iamax) return a scalar result as their return value. In DPC++, to support asynchronous computation, these routines take an additional argument that occurs at the end of the argument list. The result value is stored in this buffer (for buffer API) or pointer (for USM API) when the computation completes. These routines, like the other DPC++ routines, have a return type of void for buffer API or sycl::event for USM API.
Computation Options (Character Parameters)
Standard BLAS and LAPACK use special alphabetic characters to control operations: transposition of matrices, storage of symmetric and triangular matrices, etc. In DPC++, these special characters are replaced by scoped enum types for extra type safety.
For example, the BLAS matrix-vector multiplication dgemv takes a character argument trans, which can be one of N or T, specifying whether the input matrix A should be transposed before multiplication.
In DPC++, trans is a member of the scoped enum type oneapi::mkl::transpose. You can use the traditional character-based names oneapi::mkl::transpose::N and oneapi::mkl::transpose::T, or the equivalent, more descriptive names oneapi::mkl::transpose::nontrans and oneapi::mkl::transpose::trans.
See the Data Types for more information on the new types.
Matrix Layout (Row Major and Column Major)
The standard BLAS and LAPACK APIs require a Fortran layout for matrices (column major), where matrices are stored column-by-column in memory and the entries in each column are stored in consecutive memory locations. By default, oneMKL for DPC++ likewise assumes this matrix layout. The oneapi::mkl::blas::row_major namespace must be used for row major layout for BLAS. Row major layout is not supported directly for LAPACK, but you can transpose row major input matrices, call the desired DPC++ LAPACK routines, and then transpose the output matrices back to row major layout.
Example for BLAS
Below is a short excerpt of a program calling standard BLAS dgemm:
double *A = …, *B = …, *C = …; double alpha = 2.0, beta = 3.0; int m = 16, n = 20, k = 24; int lda = m, ldb = n, ldc = m; dgemm(“N”, “T”, &m, &n, &k, &alpha, A, &lda, B, &ldb, &beta, C, &ldc);
The DPC++ equivalent of this excerpt would be as follows:
using namespace sycl; using namespace oneapi::mkl; queue Q(…); buffer A = …, B = …, C = …; int m = 16, n = 20, k = 24; int lda = m, ldb = n, ldc = m; blas::gemm(Q, transpose::N, transpose::T, m, n, k, 2.0, A, lda, B, ldb, 3.0, C, ldc);