Developer Reference for Intel® oneAPI Math Kernel Library for C

ID 766684
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Overview

Intel® oneAPI Math Kernel Library (oneMKL) is optimized for performance on Intel processors. oneMKL also runs on non-Intel x86-compatible processors.

NOTE:

oneMKL provides limited input validation to minimize the performance overheads. It is your responsibility when using oneMKL to ensure that input data has the required format and does not contain invalid characters. These can cause unexpected behavior of the library. Examples of the inputs that may result in unexpected behavior:

  • Not-a-number (NaN) and other special floating point values
  • Large inputs may lead to accumulator overflow

As the oneMKL API accepts raw pointers, it is your application's responsibility to validate the buffer sizes before passing them to the library. The library requires subroutine and function parameters to be valid before being passed. While some oneMKL routines do limited checking of parameter errors, your application should check for NULL pointers, for example.

The Intel® oneAPI Math Kernel Library includes Fortran routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. In addition to the Fortran interface, Intel® oneAPI Math Kernel Library (oneMKL) includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematics, Vector Statistics, and many other functions. For hardware and software requirements to use Intel® oneAPI Math Kernel Library (oneMKL), seeIntel® oneAPI Math Kernel Library (oneMKL) Release Notes.

NOTE:

Function calls at runtime for Intel® oneAPI Math Kernel Library (oneMKL) libraries on the Microsoft Windows* operating system can utilize the functionLoadLibrary() and related loading functions in static, dynamic, and single-dynamic library linking models. These functions attempt to access the loader lock which when used within or at the same time as another DllMainfunction call, can lead to a deadlock. If possible, avoid making your calls to Intel® oneAPI Math Kernel Library (oneMKL) in aDllMain function or at the same time as other calls to DllMain even on separate threads. Refer to the Microsoft documentation about DllMain and Dynamic-Link Library Best Practices for more details.

BLAS Routines

The BLAS routines and functions are divided into the following groups according to the operations they perform:

  • BLAS Level 1 Routines perform operations of both addition and reduction on vectors of data. Typical operations include scaling and dot products.

  • BLAS Level 2 Routines perform matrix-vector operations, such as matrix-vector multiplication, rank-1 and rank-2 matrix updates, and solution of triangular systems.

  • BLAS Level 3 Routines perform matrix-matrix operations, such as matrix-matrix multiplication, rank-k update, and solution of triangular systems.

Starting from release 8.0, Intel® oneAPI Math Kernel Library (oneMKL) also supports the Fortran 95 interface to the BLAS routines.

Starting from release 10.1, a number of BLAS-like Extensions are added to enable the user to perform certain data manipulation, including matrix in-place and out-of-place transposition operations combined with simple matrix arithmetic operations.

Sparse BLAS Routines

The Sparse BLAS Level 1 Routines and Functions and Sparse BLAS Level 2 and Level 3 Routinesroutines and functions operate on sparse vectors and matrices. These routines perform vector operations similar to the BLAS Level 1, 2, and 3 routines. The Sparse BLAS routines take advantage of vector and matrix sparsity: they allow you to store only non-zero elements of vectors and matrices. Intel® oneAPI Math Kernel Library (oneMKL) also supports Fortran 95 interface to Sparse BLAS routines.

Sparse QR

Sparse QRin Intel® oneAPI Math Kernel Library (oneMKL) is a set of routines used to solve sparse matrices with real coefficients and general structure. All Sparse QR routines can be divided into three steps: reordering, factorization, and solving. Currently, only CSR format is supported for the input matrix, and Sparse QR operates on the matrix handle used in all SpBLAS IE routines. (For details on how to create a matrix handle, refer tomkl-sparse-create-csr.)

LAPACK Routines

The Intel® oneAPI Math Kernel Library fully supports the LAPACK 3.7 set of computational, driver, auxiliary and utility routines.

The original versions of LAPACK from which that part of Intel® oneAPI Math Kernel Library (oneMKL) was derived can be obtained fromhttp://www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen.

The LAPACK routines can be divided into the following groups according to the operations they perform:

Starting from release 8.0, Intel® oneAPI Math Kernel Library (oneMKL) also supports the Fortran 95 interface to LAPACK computational and driver routines. This interface provides an opportunity for simplified calls of LAPACK routines with fewer required arguments.

Sparse Solver Routines

Direct sparse solver routines in Intel® oneAPI Math Kernel Library (oneMKL) (seeSparse Solver Routines ) solve symmetric and symmetrically-structured sparse matrices with real or complex coefficients. For symmetric matrices, these Intel® oneAPI Math Kernel Library (oneMKL) subroutines can solve both positive-definite and indefinite systems. Intel® oneAPI Math Kernel Library (oneMKL) includes a solver based on the PARDISO* sparse solver, referred to as Intel® oneAPI Math Kernel Library (oneMKL) PARDISO, as well as an alternative set of user callable direct sparse solver routines.

If you use the Intel® oneAPI Math Kernel Library (oneMKL) PARDISO sparse solver, please cite:

O.Schenk and K.Gartner. Solving unsymmetric sparse systems of linear equations with PARDISO. J. of Future Generation Computer Systems, 20(3):475-487, 2004.

Intel® oneAPI Math Kernel Library (oneMKL) provides also an iterative sparse solver (seeSparse Solver Routines) that uses Sparse BLAS level 2 and 3 routines and works with different sparse data formats.

Extended Eigensolver Routines

TheExtended Eigensolver RCI Routines is a set of high-performance numerical routines for solving standard (Ax = λx) and generalized (Ax = λBx) eigenvalue problems, where A and B are symmetric or Hermitian. It yields all the eigenvalues and eigenvectors within a given search interval. It is based on the Feast algorithm, an innovative fast and stable numerical algorithm presented in [Polizzi09], which deviates fundamentally from the traditional Krylov subspace iteration based techniques (Arnoldi and Lanczos algorithms [Bai00]) or other Davidson-Jacobi techniques [Sleijpen96]. The Feast algorithm is inspired by the density-matrix representation and contour integration technique in quantum mechanics.

It is free from orthogonalization procedures. Its main computational tasks consist of solving very few inner independent linear systems with multiple right-hand sides and one reduced eigenvalue problem orders of magnitude smaller than the original one. The Feast algorithm combines simplicity and efficiency and offers many important capabilities for achieving high performance, robustness, accuracy, and scalability on parallel architectures. This algorithm is expected to significantly augment numerical performance in large-scale modern applications.

Some of the characteristics of the Feast algorithm [Polizzi09] are:

  • Converges quickly in 2-3 iterations with very high accuracy

  • Naturally captures all eigenvalue multiplicities

  • No explicit orthogonalization procedure

  • Can reuse the basis of pre-computed subspace as suitable initial guess for performing outer-refinement iterations

    This capability can also be used for solving a series of eigenvalue problems that are close one another.

  • The number of internal iterations is independent of the size of the system and the number of eigenpairs in the search interval

  • The inner linear systems can be solved either iteratively (even with modest relative residual error) or directly

VM Functions

The Vector Mathematics functions (see Vector Mathematical Functions) include a set of highly optimized implementations of certain computationally expensive core mathematical functions (power, trigonometric, exponential, hyperbolic, etc.) that operate on vectors of real and complex numbers.

Application programs that might significantly improve performance with VM include nonlinear programming software, integrals computation, and many others. VM provides interfaces both for Fortran and C languages.

Statistical Functions

Vector Statistics (VS) contains three sets of functions (see Statistical Functions) providing:
  • Pseudorandom, quasi-random, and non-deterministic random number generator subroutines implementing basic continuous and discrete distributions. To provide best performance, the VS subroutines use calls to highly optimized Basic Random Number Generators (BRNGs) and a set of vector mathematical functions.
  • A wide variety of convolution and correlation operations.
  • Initial statistical analysis of raw single and double precision multi-dimensional datasets.

Fourier Transform Functions

The Intel® oneAPI Math Kernel Library (oneMKL) multidimensional Fast Fourier Transform (FFT) functions with mixed radix support (see Fourier Transform Functions) provide uniformity of discrete Fourier transform computation and combine functionality with ease of use. Both Fortran and C interface specifications are given. There is also a cluster version of FFT functions, which runs on distributed-memory architectures and is provided only for Intel® 64 architectures.

The FFT functions provide fast computation via the FFT algorithms for arbitrary lengths. See the Intel® oneAPI Math Kernel Library (oneMKL) Developer Guide for the specific radices supported.

Partial Differential Equations Support

Intel® oneAPI Math Kernel Library (oneMKL) provides tools for solving Partial Differential Equations (PDE) (seePartial Differential Equations Support). These tools are Trigonometric Transform interface routines and Poisson Solver.

The Trigonometric Transform routines may be helpful to users who implement their own solvers similar to the Intel® oneAPI Math Kernel Library (oneMKL) Poisson Solver. The users can improve performance of their solvers by using fast sine, cosine, and staggered cosine transforms implemented in the Trigonometric Transform interface.

The Poisson Solver is designed for fast solving of simple Helmholtz, Poisson, and Laplace problems. The Trigonometric Transform interface, which underlies the solver, is based on the Intel® oneAPI Math Kernel Library (oneMKL) FFT interface (refer toFourier Transform Functions), optimized for Intel® processors.

Support Functions

The Intel® oneAPI Math Kernel Library (oneMKL) support functions (seeSupport Functions) are used to support the operation of the Intel® oneAPI Math Kernel Library (oneMKL) software and provide basic information on the library and library operation, such as the current library version, timing, setting and measuring of CPU frequency, error handling, and memory allocation.

Starting from release 10.0, the Intel® oneAPI Math Kernel Library (oneMKL) support functions provide additional threading control.

Starting from release 10.1, Intel® oneAPI Math Kernel Library (oneMKL) selectively supports aProgress Routine feature to track progress of a lengthy computation and/or interrupt the computation using a callback function mechanism. The user application can define a function called mkl_progressthat is regularly called from the Intel® oneAPI Math Kernel Library (oneMKL) routine supporting the progress routine feature. SeeProgress Routine in Support Functions for reference. Refer to a specific LAPACK or DSS/PARDISO function description to see whether the function supports this feature or not.

oneMKL Initialization on CPU

When a user first invokes any oneMKL functions, there is an initialization cost to keep in mind. Here are some details about running oneMKL C/Fortran functions:

When we run an application with oneMKL C/Fortran functions on CPU, we spend time on some service routines. Here's what is happening inside the library when we call oneMKL C/Fortran functions:

  • The first step is setting xerbla. It's a oneMKL routine that acts as an error handler for BLAS, LAPACK, VS, and VM domains if an input parameter has an invalid value. See xerbla for more information.

  • The next step is to check which oneMKL verbose mode was chosen. oneMKL verbose mode is needed to profile oneMKL usage in the application. You can read more about oneMKL Verbose mode in the documentation here:

    The oneMKL Verbose feature is enabled only for certain domains such as BLAS (and BLAS-like extensions), LAPACK, selected functionality in ScaLAPACK and FFT, and (in the DPC++ API only) RNG.

  • The next item in the list is the oneMKL dispatcher. oneMKL dispatcher checks the hardware used for running the application and the available instruction set. Based on the results from dispatcher, different function implementations (optimized for different hardware and instruction-sets) will be called. More details can be found in the oneMKL documentation here:

  • During the function run (or even before), you may need to allocate the memory. oneMKL has a memory manager that provides a list of support functions, the ability to redefine memory functions, and internal fast memory allocations with memory reuse. See the following for more information:

     

  • If you're in the threading mode, oneMKL will also call its own threading manager where it will check for different environment variables and set the number of threads. You can read more about this in oneMKL documentation here:

As an example, BLAS dgemm was run on the 4th Gen Intel® Xeon® Scalable Processors system. Sizes of matrices A and B were 10000x10000. Running the dgemm function in sequential mode took 32.5 seconds (32500 milliseconds), from which:

  • Setting oneMKL xerbla took 0.001 millisecond.
  • Setting/checking oneMKL verbose mode took 0.009 milliseconds.
  • Checking for MKL_CBWR settings and detecting CPU using MKL dispatcher took 0.004 milliseconds.
  • Additional internal memory allocations in dgemm took 0.009 milliseconds followed by 0.002 milliseconds of deallocation.

As you can see in the example, before the dgemm function runs there are several mkl_malloc calls to allocate memory for the A, B, and C matrices. Overall memory allocation took around 0.084 milliseconds. After the dgemm function completes, there are several mkl_free calls to free the A, B, and C matrix memory. This took around 5.159 milliseconds.

If you run dgemm with intel omp threading, you'll spend 24 milliseconds in the oneMKL threading manager. If you run dgemm with tbb threading, you'll spend around 5 milliseconds in oneMKL threading manager.

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201