Visible to Intel only — GUID: GUID-394906BC-E7CC-4A59-A1D6-BBB19EB3ACC7
Visible to Intel only — GUID: GUID-394906BC-E7CC-4A59-A1D6-BBB19EB3ACC7
oneMKL Initialization on GPU
When we run an application with oneMKL functions on GPU, we spend time on some service routines as well. Here’s what happens inside the library when we call oneMKL functions on GPU:
The first step is to check which oneMKL verbose mode was chosen. oneMKL verbose mode is needed to profile oneMKL usage in the application. You can read more about oneMKL Verbose mode in the documentation here:
Using oneMKL Verbose Mode (Linux)Using oneMKL Verbose Mode (Windows)
The oneMKL Verbose feature is supported by BLAS (and BLAS-like extensions), LAPACK, FFT, and (in the DPC++ API only) RNG.
The next item in the list is the oneMKL GPU information detector, which checks which GPU is present on the system in order to run the optimal code implementation. It’s looking for architecture, stepping, tiles, backend, and other important parameters.
The next step is to create a kernel and all that it requires (creating a program cache, adding a new kernel to the cache, searching for existing kernels, and so forth).
Finally, some time is spent in the oneMKL memory manager, where the required memory is allocated or freed using the internal oneMKL implementation. oneMKL has a memory manager that provides a list of support functions, the ability to redefine memory functions, and internal fast memory allocations with memory reuse.
Let’s look at details of running a oneMKL gemm example from ${ONEAPI_ROOT}/share/doc/mkl/examples/examples_sycl.tgz/sycl/blas/source/gemm.cpp (Linux) or %ONEAPI_ROOT%\share\doc\mkl\examples\examples_sycl.zip\sycl\blas\source\gemm.cpp (Windows). Just execution of BLAS gemm with the single-precision real data type on an Intel® Data Center GPU Max Series card took 48.795 milliseconds:
Checking for oneMKL Verbose mode settings took 0.012 milliseconds.
The sum of all times for getting the information about the GPU device took around 1.568 milliseconds.
Creating the gemm kernel took around 0.958 milliseconds.
Kernel cache allocation took around 0.01 milliseconds.
At the end of the application when the oneMKL library is unloading, we also clean caches. This took 0.111 milliseconds.
In addition to the time above, we need to make sure that before running the gemm function, we have all required memory allocated for the A, B, and C matrices. This took an additional 0.084 milliseconds. And after running gemm we need to clean all allocated memory, which took around 0.14.
If we run gemm on CPU, the classic C/Fortran implementation will be used. Refer to the C/Fortran version of the developer reference for a corresponding discussion of those initialization costs.