Offloading oneMKL Computations onto the GPU
The Intel® oneAPI Math Kernel Library (oneMKL) improves performance with math routines for software applications that solve large computational problems. oneMKL provides BLAS, Sparse BLAS, and LAPACK linear algebra routines, fast Fourier transforms, vectorized math functions, random number generation functions, and other functionality.
The oneMKL distribution includes an examples directory which contains examples of various calls to oneMKL routines.
For more information about the Intel® oneAPI Math Kernel Library, see:
Developer Reference for Intel® oneAPI Math Kernel Library - C
Developer Reference for Intel® oneAPI Math Kernel Library - Fortran
Compile and Link Commands when Using oneMKL OpenMP Offload
The information given in this section is specific to Linux. For information specific to Windows, and for more details, refer to the Intel® oneAPI Math Kernel Library Link Line Advisor.
Notes:
The link commands shown below will dynamically link to the oneMKL library.
The Intel® oneMKL LP64 libraries index arrays with the 32-bit integer type; whereas the Intel® oneMKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 231 - 1 elements).
C/C++ (Linux)
The compile and link commands for a C/C++ program that uses OpenMP threading and calls oneMKL C/C++ API with 32-bit integers are as follows.
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -c source.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
If the program calls oneMKL C/C++ API with 64-bit integers, the compile and link commands are:
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DMKL_ILP64 -c source.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
Fortran (Linux)
The compile and link commands for a Fortran program that uses OpenMP threading and calls oneMKL Fortran API with 32-bit integers are as follows.
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c source.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
If the program calls oneMKL Fortran API with 64-bit integers, the compile and link commands are:
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -i8 -fpp -free -c source.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
A Note on the -qmkl Compiler Option
Use the -qmkl option (equivalent to -qmkl=parallel) to link with a certain Intel® oneAPI Math Kernel Library threading layer depending on the threading option provided:
For -fiopenmp, the OpenMP threading layer for Intel® Compilers
For -tbb, the Intel® Threading Building Blocks (Intel® TBB) threading layer
Use -qmkl=sequential to link with the sequential version of Intel® oneAPI Math Kernel Library.
Note that -qmkl=parallel/sequential affects threading on the CPU only. Offloaded MKL computations will always be parallelized as appropriate, and will occupy as many Execution Units (EUs) on the GPU as possible.
OpenMP Directives to Offload oneMKL Computations
You can use OpenMP directives to offload oneMKL computations onto the GPU. There are two ways to do this.
One way involves using the Intel-specific OpenMP extension target variant dispatch directive. You would place the call to the oneMKL routine inside a target variant dispatch construct, as shown in the example below. In this example, matrices a, b, and c used in the multiplication are mapped to the device before the call to the oneMKL routine cblas_dgemm. The use_device_ptr(A,B,C) clause is used on the target variant dispatch directive to indicate that a, b, and c point to objects that have corresponding storage on the device. When cblas_dgemm is called, the corresponding device pointers for a, b, and c will be passed as arguments, and the device copies of a, b, and c will be used in the computation.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define EPSILON 0.0001 int main() { double *A, *B, *C, *C_fl; int64_t m, n, k; double alpha, beta; double sum; int64_t i, j, q; int fail; printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n" " Intel oneMKL function dgemm, where A, B, and C are matrices and \n" " alpha and beta are double precision scalars\n\n"); m = 2000, k = 200, n = 1000; printf (" Initializing data for matrix multiplication C=A*B for matrix \n" " A(%li x %li) and matrix B(%li x %li)\n\n", m, k, k, n); alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A = (double *)mkl_malloc( m * k * sizeof( double ), 64 ); B = (double *)mkl_malloc( k * n * sizeof( double ), 64 ); C = (double *)mkl_malloc( m * n * sizeof( double ), 64 ); C_fl = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); if (A == NULL || B == NULL || C == NULL || C_fl == NULL) { printf( "\n ERROR: Cannot allocate memory for matrices. Exiting... \n\n"); return 1; } printf (" Intializing matrices \n\n"); for (i = 0; i < (m*k); i++) { A[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C[i] = 0.0; C_fl[i] = 0.0; } printf (" Computing matrix product using Intel oneMKL dgemm function via CBLAS interface \n\n"); #pragma omp target data map(to: A[0:m*k], B[0:k*n]) map(tofrom: C[0:m*n]) { #pragma omp target variant dispatch use_device_ptr(A, B, C) { cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, k, B, n, beta, C, n); } } printf ("\n Top left corner of matrix C: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C[j+i*n]); } printf ("\n"); } printf (" Computing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) { sum += A[k*i+q] * B[n*q+j]; } C_fl[n*i+j] = alpha * sum + beta * C_fl[n*i+j]; } } printf ("\n Top left corner of matrix C_fl: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C_fl[j+i*n]); } printf ("\n"); } printf (" Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (fabs(C[i] - C_fl[i]) > EPSILON) { fail = 1; break; } } if (fail) printf ("\n **** FAIL **** \n"); else printf ("\n **** PASS **** \n"); printf ("\n Deallocating memory \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); return fail; }
Another way to inform the compiler that oneMKL computations should be offloaded onto the GPU is by using the OpenMP 5.1 dispatch directive, as shown in the example below. In this example too, matrices a, b, and c are mapped to the device before the call to the oneMKL routine cblas_dgemm. When cblas_dgemm is called, the corresponding device pointers for a, b, and c will be passed as arguments, so the device copies of a, b, and c will be used in the computation.
The use_device_ptr clause is not needed on the dispatch directive. With OpenMP 5.1, the list of device pointers needed by the oneMKL routines is given in the oneMKL OpenMP offload header file, mkl_omp_offload.h, where the GPU variant function is declared. The user should carefully review the list of device pointers required in the oneMKL header file and make sure that the corresponding matrices are accessible from the device before calling the oneMKL routine.
Note that, depending on the version of the compiler you are using, you may need to add the compiler option -fopenmp-version=51 in order for the dispatch directive to be accepted.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define EPSILON 0.0001 int main() { double *A, *B, *C, *C_fl; int64_t m, n, k; double alpha, beta; double sum; int64_t i, j, q; int fail; printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n" " Intel oneMKL function dgemm, where A, B, and C are matrices and \n" " alpha and beta are double precision scalars\n\n"); m = 2000, k = 200, n = 1000; printf (" Initializing data for matrix multiplication C=A*B for matrix \n" " A(%li x %li) and matrix B(%li x %li)\n\n", m, k, k, n); alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A = (double *)mkl_malloc( m * k * sizeof( double ), 64 ); B = (double *)mkl_malloc( k * n * sizeof( double ), 64 ); C = (double *)mkl_malloc( m * n * sizeof( double ), 64 ); C_fl = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); if (A == NULL || B == NULL || C == NULL || C_fl == NULL) { printf( "\n ERROR: Cannot allocate memory for matrices. Exiting... \n\n"); return 1; } printf (" Intializing matrices \n\n"); for (i = 0; i < (m*k); i++) { A[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C[i] = 0.0; C_fl[i] = 0.0; } printf (" Computing matrix product using Intel oneMKL dgemm function via CBLAS interface \n\n"); #pragma omp target data map(to: A[0:m*k], B[0:k*n]) map(tofrom: C[0:m*n]) { #pragma omp target variant dispatch use_device_ptr(A, B, C) cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, k, B, n, beta, C, n); } printf ("\n Top left corner of matrix C: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C[j+i*n]); } printf ("\n"); } printf (" Computing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) { sum += A[k*i+q] * B[n*q+j]; } C_fl[n*i+j] = alpha * sum + beta * C_fl[n*i+j]; } } printf ("\n Top left corner of matrix C_fl: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C_fl[j+i*n]); } printf ("\n"); } printf (" Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (fabs(C[i] - C_fl[i]) > EPSILON) { fail = 1; break; } } if (fail) printf ("\n **** FAIL **** \n"); else printf ("\n **** PASS **** \n"); printf ("\n Deallocating memory \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); return fail; }
Notes:
oneMKL routines expect the arrays/matrices to be on the device before the computation will be run. So the user has to map the data to the device, or allocate the data directly on the device, before calling a oneMKL routine.
If a oneMKL routine is not called from a target variant dispatch (or dispatch) region, or if offload is disabled, then the oneMKL computations will be executed on the CPU.
Only one call to a oneMKL routine can be issued from an OpenMP target variant dispatch (or dispatch) construct. If there are two consecutive calls to oneMKL routines, then the calls should be placed in separate target variant dispatch (or dispatch) constructs.
Fortran
When calling oneMKL routines from Fortran code, be sure to add the following include statement:
include "mkl_omp_offload.f90"
Also, if calling oneMKL Fortran API with 32-bit integers, add the following module use statement:
use onemkl_blas_omp_offload_lp64
On the other hand, if calling oneMKL Fortran API with 64-bit integers, add the following module use statement:
use onemkl_blas_omp_offload_ilp64
The following Fortran example illustrates how DGEMM is called from a Fortran program, and the include and use statements mentioned above.
include "mkl_omp_offload.f90" program DGEMM_MAIN #if defined(MKL_ILP64) use onemkl_blas_omp_offload_ilp64 #else use onemkl_blas_omp_offload_lp64 #endif use omp_lib use iso_fortran_env implicit none integer, parameter :: m = 20 integer, parameter :: k = 5 integer, parameter :: n = 10 double precision a(m,k), b(k,n), c1(m,n), c2(m,n) double precision alpha, beta integer i, j print* print*,' D G E M M EXAMPLE PROGRAM' ! Initialize alpha = 1.025 beta = 0.75 do i = 1, m do j = 1, k a(i,j) = (i-1) - (0.25 * k) end do end do do i = 1, k do j = 1, n b(i,j) = -((i-1) + j) end do end do do i = 1, m do j = 1, n c1(i,j) = 0.2 + i - j c2(i,j) = 0.2 + i - j end do end do ! Execute DGEMM on host. call DGEMM('N','N',m,n,k,alpha,a,m,b,k,beta,c1,m) print * print *, 'c1 - After DGEMM host execution' do i=1,m print 110, (c1(i,j),j=1,n) end do print* ! Execute DGEMM on device !$omp target data map(to: a, b) map(tofrom: c2) !$omp dispatch call DGEMM('N','N',m,n,k,alpha,a,m,b,k,beta,c2,m) !$omp end target data print * print *, 'c2 - After DGEMM device execution' do i=1,m print 110, (c2(i,j),j=1,n) end do print * 110 format(7x,10(f10.2,2x)) end
To compile and link the above Fortran example with 32-bit integers:
ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c dgemm_dispatch_f.f90 ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fsycl -L${MKLROOT}/lib/intel64 -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -lmkl_sycl dgemm_dispatch_f.o
To compile and link the above Fortran example with 64-bit integers:
ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -m64 -DMKL_ILP64 -i8 -fpp -free -c dgemm_dispatch_f.f90 ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fsycl -L${MKLROOT}/lib/intel64 -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -lmkl_sycl dgemm_dispatch_f.o
After generating the executable (a.out), from a C/C++ or Fortran program, you can run the executable under onetrace and look for the heading “Device Timing Results” in the generated trace. Below that heading we should see the oneMKL kernels listed. This way we confirm that oneMKL computations have been offloaded onto the GPU.
Example run command:
OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 onetrace -h -d ./a.out
Batching of oneMKL GEMM Calls
The oneMKL library includes “batch” routines that allow the user to batch several oneMKL calls into a single oneMKL call. At runtime, oneMKL will intelligently execute all of the matrix operations to optimize overall performance.
For example, the cblas_dgemm routine computes a matrix-matrix product of two general matrices a and b, returning the result in a matrix c. The cblas_dgemm interface is shown below.
void cblas_dgemm (const CBLAS_LAYOUT layout, const CBLAS_TRANSPOSE transa, const CBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const double alpha, const double *a, const MKL_INT lda, const double *b, const MKL_INT ldb, const double beta, double *c, const MKL_INT ldc);
The cblas_dgemm_batch routine is similar to the cblas_dgemm routine, but the cblas_dgemm_batch routine performs matrix-matrix operations on groups of matrices, processing a number of groups at once.
The cblas_dgemm_batch interface is shown below. Note that the interface resembles the cblas_dgemm interface. However, it involves passing matrix arguments as arrays of pointers to matrices, and passing parameters as arrays of parameters.
void cblas_dgemm_batch (const CBLAS_LAYOUT layout, const CBLAS_TRANSPOSE* transa_array, const CBLAS_TRANSPOSE* transb_array, const MKL_INT* m_array, const MKL_INT* n_array, const MKL_INT* k_array, const double* alpha_array, const double **a_array, const MKL_INT* lda_array, const double **b_array, const MKL_INT* ldb_array, const double* beta_array, double **c_array, const MKL_INT* ldc_array, const MKL_INT group_count, const MKL_INT* group_size);
The batch operation is defined as follows:
idx = 0 for i = 0 .. group_count - 1 alpha and beta in alpha_array[i] and beta_array[i] for j = 0 .. group_size[i] - 1 a, b, and c matrices in a_array[idx], b_array[idx], and c_array[idx], respectively c := alpha*op(a)*op(b) + beta*c, idx = idx + 1 end for end for
where:
op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH,
alpha and beta are scalar elements of alpha_array and beta_array,
a, b and c are matrices such that for m, n, and k which are elements of m_array, n_array, and k_array:
op(a) is an m-by-k matrix,
op(b) is a k-by-n matrix,
C is an m-by-n matrix.
a, b, and c represent matrices stored at addresses pointed to by a_array, b_array, and c_array, respectively. The number of entries in a_array, b_array, and c_array, or total_batch_count, is equal to the sum of all of the group_size entries.
It is possible to batch the multiplications of different shapes and parameters by packaging them into groups, where each group consists of multiplications of matrices of the same shapes (same m, n, and k) and the same parameters.
The basic assumption for the batch API are that all operations in a batch (whether in the same group or different groups) are independent of one another. So oneMKL does not guarantee any particular ordering between operations in a batch, and will try to execute multiple operations in parallel.
In general, the larger you can make the batch size, the better. This allows oneMKL to better parallelize the operations and distribute the work across the GPU.
We illustrate how two calls to cblas_dgemm can be replaced with one call to cblas_dgemm_batch. The following example includes two calls to cblas_dgemm.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define epsilon 0.0000001f bool compare(double x, double y) { // returns true if x and y are the same return fabs(x - y) <= epsilon; } int main() { double *A1, *B1, *C1, *C1_fl; double *A2, *B2, *C2, *C2_fl; int m, n, k, i, j, q; double alpha, beta; double sum; int fail; double t_start, t_end; m = 2000, k = 200, n = 1000; alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*k); i++) { A1[i] = A2[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B1[i] = B2[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C1[i] = C2[i] = 0.0; C1_fl[i] = C2_fl[i] = 0.0; } printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); t_start = omp_get_wtime(); #pragma omp target data \ map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ map(tofrom: C1[0:m*n], C2[0:m*n]) { #pragma omp dispatch cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A1, k, B1, n, beta, C1, n); #pragma omp dispatch cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A2, k, B2, n, beta, C2, n); } t_end = omp_get_wtime(); printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2[j+i*n]); } printf ("\n"); } printf (" \nComputing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A1[k*i+q] * B1[n*q+j]; C1_fl[n*i+j] = sum; } } for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A2[k*i+q] * B2[n*q+j]; C2_fl[n*i+j] = sum; } } printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1_fl[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2_fl[j+i*n]); } printf ("\n"); } printf ("\n Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { fail = 1; break; } } if (fail) { printf (" **** FAIL **** \n"); } else { printf(" time = %lf seconds\n", t_end - t_start); printf (" **** PASS **** \n"); } mkl_free(A1); mkl_free(B1); mkl_free(C1); mkl_free(C1_fl); mkl_free(A2); mkl_free(B2); mkl_free(C2); mkl_free(C2_fl); return 0; }
The two calls to cblas_dgemm in the above example can be batched together, resulting in one call to cblas_dgemm_batch, as shown in the following example. Note that the batch is composed of one group of size 2, since we have two matrix multiplications with the same set of parameters (layout, transa, transb, m, n, k, alpha, lda, ldb, beta, and ldc). total_batch_size in this case is 2.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define epsilon 0.0000001f bool compare(double x, double y) { // return true is x and y are the same return (fabs(x - y) <= epsilon); } int main() { double *A1, *B1, *C1, *C1_fl; double *A2, *B2, *C2, *C2_fl; int m, n, k, i, j, q; double alpha, beta; double sum; int fail; double t_start, t_end; m = 2000, k = 200, n = 1000; alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*k); i++) { A1[i] = A2[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B1[i] = B2[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C1[i] = C2[i] = 0.0; C1_fl[i] = C2_fl[i] = 0.0; } printf (" \nComputing matrix product using Intel MKL cblas_dgemm_batch function \n"); #define GRP_COUNT 1 // 1 group MKL_INT group_count = GRP_COUNT; MKL_INT group_sizes[GRP_COUNT] = {2}; // 8 matrix multiplications CBLAS_TRANSPOSE transa_array[GRP_COUNT] = {CblasNoTrans}; CBLAS_TRANSPOSE transb_array[GRP_COUNT] = {CblasNoTrans}; MKL_INT m_array[GRP_COUNT] = {m}; MKL_INT n_array[GRP_COUNT] = {n}; MKL_INT k_array[GRP_COUNT] = {k}; MKL_INT lda_array[GRP_COUNT] = {k}; MKL_INT ldb_array[GRP_COUNT] = {n}; MKL_INT ldc_array[GRP_COUNT] = {n}; double alpha_array[GRP_COUNT] = {alpha}; double beta_array[GRP_COUNT] = {beta}; // Number of matrix multiplications = 2 double **a_array, **b_array, **c_array; a_array = (double **)mkl_calloc(2, sizeof( double* ), 64); b_array = (double **)mkl_calloc(2, sizeof( double* ), 64); c_array = (double **)mkl_calloc(2, sizeof( double* ), 64); t_start = omp_get_wtime(); // Call cblas_dgemm_batch #pragma omp target enter data \ map(to: A1[0:m*k], B1[0:k*n], C1[0:m*n]) \ map(to: A2[0:m*k], B2[0:k*n], C2[0:m*n]) #pragma omp target data use_device_ptr(A1, B1, C1, A2, B2, C2) { a_array[0] = A1, a_array[1] = A2; b_array[0] = B1, b_array[1] = B2; c_array[0] = C1, c_array[1] = C2; } #pragma omp target data \ map(to:a_array[0:2], b_array[0:2], c_array[0:2]) { #pragma omp dispatch cblas_dgemm_batch ( CblasRowMajor, transa_array, transb_array, m_array, n_array, k_array, alpha_array, (const double **)a_array, lda_array, (const double **)b_array, ldb_array, beta_array, c_array, ldc_array, group_count, group_sizes); } // end target data map #pragma omp target exit data \ map(from: C1[0:m*n], C2[0:m*n]) t_end = omp_get_wtime(); printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2[j+i*n]); } printf ("\n"); } printf (" \nComputing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A1[k*i+q] * B1[n*q+j]; C1_fl[n*i+j] = sum; } } for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A2[k*i+q] * B2[n*q+j]; C2_fl[n*i+j] = sum; } } printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1_fl[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2_fl[j+i*n]); } printf ("\n"); } printf ("\n Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { fail = 1; break; } } if (fail) { printf (" **** FAIL **** \n"); } else { printf(" time = %lf seconds\n", t_end - t_start); printf (" **** PASS **** \n"); } mkl_free(A1); mkl_free(B1); mkl_free(C1); mkl_free(C1_fl); mkl_free(A2); mkl_free(B2); mkl_free(C2); mkl_free(C2_fl); return 0; }
The performance of the above two examples when running on the particular GPU used (1-stack only) was as follows:
dgemm_example_01_c.cpp (two calls to cblas_dgemm): 2.976183 seconds dgemm_batch_example_01_c.cpp (one call to cblas_dgemm_batch): 1.881641 seconds
A more complex example of batching is shown below. In this example, we have a batch composed of 3 groups (GROUP_COUNT=3). The size of each group is a randomly chosen number between 1 and 10. Several parameters (layout, transA, transB, m, n, and k) are chosen randomly, but in each group the parameters are the same for all the multiplications. The total_batch_size is equal to the sum of all the group sizes.
#include <stdio.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #include "common.h" #define GROUP_COUNT 3 int dnum = 0; int main() { CBLAS_LAYOUT layout = (rand_int(0,1) == 0) ? CblasColMajor : CblasRowMajor; CBLAS_TRANSPOSE *transA, *transB; MKL_INT *m, *n, *k, *lda, *ldb, *ldc; double *alpha, *beta; MKL_INT *group_size, *sizea_array, *sizeb_array, *sizec_array, total_batch_size = 0, sizea, sizeb, sizec; double **a_array, **b_array, **c_array, **c_ref_array; double **a_array_dev, **b_array_dev, **c_array_dev; transA = (CBLAS_TRANSPOSE *)mkl_malloc(GROUP_COUNT * sizeof(CBLAS_TRANSPOSE), 64); transB = (CBLAS_TRANSPOSE *)mkl_malloc(GROUP_COUNT * sizeof(CBLAS_TRANSPOSE), 64); m = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); n = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); k = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); lda = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); ldb = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); ldc = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); group_size = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); alpha = (double *)mkl_malloc(GROUP_COUNT * sizeof(double), 64); beta = (double *)mkl_malloc(GROUP_COUNT * sizeof(double), 64); if ((m == NULL) || (n == NULL) || (k == NULL) || (lda == NULL) || (ldb == NULL) || (ldc == NULL) || (group_size == NULL) || (alpha == NULL) || (beta == NULL)) { printf("Cannot allocate input arrays\n"); return 1; } MKL_INT i, j, p, idx; for (i = 0; i < GROUP_COUNT; i++) { transA[i] = (rand_int(0,1) == 0) ? CblasNoTrans : CblasTrans; transB[i] = (rand_int(0,1) == 0) ? CblasNoTrans : CblasTrans; alpha[i] = rand_double_scalar(); beta[i] = rand_double_scalar(); m[i] = rand_int(1, 20); n[i] = rand_int(1, 20); k[i] = rand_int(1, 20); lda[i] = MAX(m[i], k[i]); ldb[i] = MAX(k[i], n[i]); ldc[i] = MAX(m[i], n[i]); group_size[i] = rand_int(1, 10); total_batch_size += group_size[i]; #ifdef MKL_ILP64 printf("Group %lld: layout = %s, transA = %s, transB = %s, m = %lld, n = %lld, k = %lld, lda = %lld, ldb = %lld, ldc = %lld, alpha = %lf, beta = %lf, group_size = %lld\n", i, (layout == CblasColMajor) ? "Column Major" : "Row Major", (transA[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", (transB[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", m[i], n[i], k[i], lda[i], ldb[i], ldc[i], alpha[i], beta[i], group_size[i]); #else printf("Group %d: layout = %s, transA = %s, transB = %s, m = %d, n = %d, k = %d, lda = %d, ldb = %d, ldc = %d, alpha = %lf, beta = %lf, group_size = %d\n", i, (layout == CblasColMajor) ? "Column Major" : "Row Major", (transA[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", (transB[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", m[i], n[i], k[i], lda[i], ldb[i], ldc[i], alpha[i], beta[i], group_size[i]); #endif } sizea_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); sizeb_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); sizec_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); a_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); b_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); c_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); a_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); b_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); c_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); c_ref_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); if ((sizea_array == NULL) || (sizeb_array == NULL) || (sizec_array == NULL) || (a_array == NULL) || (b_array == NULL) || (c_array == NULL) || (a_array_dev == NULL) || (b_array_dev == NULL) || (c_array_dev == NULL) || (c_ref_array == NULL)) { printf("Cannot allocate matrices and size arrays\n"); return 1; } idx = 0; for (i = 0; i < GROUP_COUNT; i++) { sizea = (((layout == CblasRowMajor) && (transA[i] == CblasTrans)) || ((layout == CblasColMajor) && (transA[i] == CblasNoTrans))) ? lda[i] * k[i] : m[i] * lda[i]; sizeb = (((layout == CblasRowMajor) && (transB[i] == CblasTrans)) || ((layout == CblasColMajor) && (transB[i] == CblasNoTrans))) ? ldb[i] * n[i] : k[i] * ldb[i]; sizec = (layout == CblasColMajor) ? ldc[i] * n[i] : ldc[i] * m[i]; for (j = 0; j < group_size[i]; j++) { a_array[idx] = (double *)mkl_malloc(sizeof(double) * sizea, 64); a_array_dev[idx] = a_array[idx]; sizea_array[idx] = sizea; if (a_array[idx] == NULL) { printf("cannot allocate a matrices\n"); return 1; } b_array[idx] = (double *)mkl_malloc(sizeof(double) * sizeb, 64); b_array_dev[idx] = b_array[idx]; sizeb_array[idx] = sizeb; if (b_array[idx] == NULL) { printf("cannot allocate b matrices\n"); return 1; } c_array[idx] = (double *)mkl_malloc(sizeof(double) * sizec, 64); c_array_dev[idx] = c_array[idx]; sizec_array[idx] = sizec; if (c_array[idx] == NULL) { printf("cannot allocate c matrices\n"); return 1; } c_ref_array[idx] = (double *)mkl_malloc(sizeof(double) * sizec, 64); if (c_ref_array[idx] == NULL) { printf("cannot allocate c_ref matrices\n"); return 1; } init_double_array(sizea, a_array[idx], 1); init_double_array(sizeb, b_array[idx], 1); init_double_array(sizec, c_array[idx], 1); for (p = 0; p < sizec_array[idx]; p++) c_ref_array[idx][p] = c_array[idx][p]; idx++; } } // run gemm_batch on host, use standard oneMKL interface cblas_dgemm_batch(layout, transA, transB, m, n, k, alpha, (const double **) a_array, lda, (const double **) b_array, ldb, beta, c_ref_array, ldc, GROUP_COUNT, group_size); double *a, *b, *c; for (i = 0; i < total_batch_size; i++) { a = a_array[i]; b = b_array[i]; c = c_array[i]; #pragma omp target enter data map(to:a[0:sizea_array[i]],b[0:sizeb_array[i]],c[0:sizec_array[i]]) #pragma omp target data use_device_ptr(a,b,c) { a_array_dev[i] = a; b_array_dev[i] = b; c_array_dev[i] = c; } } #pragma omp target data map(to:a_array_dev[0:total_batch_size], \ b_array_dev[0:total_batch_size], \ c_array_dev[0:total_batch_size]) device(dnum) { #pragma omp dispatch cblas_dgemm_batch(layout, transA, transB, m, n, k, alpha, (const double **) a_array_dev, lda, (const double **) b_array_dev, ldb, beta, c_array_dev, ldc, GROUP_COUNT, group_size); } for (i = 0; i < total_batch_size; i++) { a = a_array[i]; b = b_array[i]; c = c_array[i]; #pragma omp target exit data map(from:a[0:sizea_array[i]],b[0:sizeb_array[i]],c[0:sizec_array[i]]) } double computed, reference, diff; MKL_INT l; idx = 0; for (p = 0; p < GROUP_COUNT; p++) { for (l = 0; l < group_size[p]; l++) { for (i = 0; i < m[p]; i++) { for (j = 0; j < n[p]; j++) { if (layout == CblasColMajor) { computed = c_array[idx][i + j * ldc[p]]; reference = c_ref_array[idx][i + j * ldc[p]]; } else { computed = c_array[idx][j + i * ldc[p]]; reference = c_ref_array[idx][j + i * ldc[p]]; } diff = computed - reference; diff = (diff > 0) ? diff : -diff; if (diff > 0.0001) { #ifdef MKL_ILP64 printf("Error in matrix %lld (group = %lld, matrix index in group = %lld) at index [%lld][%lld], computed = %lf, reference = %lf, difference = %lf\n", idx, p, l, i, j, computed, reference, diff); #else printf("Error in matrix %d at index [%d][%d], computed = %lf, reference = %lf, difference = %lf\n", idx, i, j, computed, reference, diff); #endif free_double_matrices(a_array, total_batch_size); free_double_matrices(b_array, total_batch_size); free_double_matrices(c_array, total_batch_size); free_double_matrices(c_ref_array, total_batch_size); mkl_free(a_array); mkl_free(b_array); mkl_free(c_array); mkl_free(c_ref_array); mkl_free(a_array_dev); mkl_free(b_array_dev); mkl_free(c_array_dev); mkl_free(sizea_array); mkl_free(sizeb_array); mkl_free(sizec_array); mkl_free(transA); mkl_free(transB); mkl_free(m); mkl_free(n); mkl_free(k); mkl_free(lda); mkl_free(ldb); mkl_free(ldc); mkl_free(group_size); mkl_free(alpha); mkl_free(beta); return 1; } } } idx++; } } printf("Validation PASSED\n"); free_double_matrices(a_array, total_batch_size); free_double_matrices(b_array, total_batch_size); free_double_matrices(c_array, total_batch_size); free_double_matrices(c_ref_array, total_batch_size); mkl_free(a_array); mkl_free(b_array); mkl_free(c_array); mkl_free(c_ref_array); mkl_free(a_array_dev); mkl_free(b_array_dev); mkl_free(c_array_dev); mkl_free(sizea_array); mkl_free(sizeb_array); mkl_free(sizec_array); mkl_free(transA); mkl_free(transB); mkl_free(m); mkl_free(n); mkl_free(k); mkl_free(lda); mkl_free(ldb); mkl_free(ldc); mkl_free(group_size); mkl_free(alpha); mkl_free(beta); return 0; }
Speeding Up Independent, Consecutive GEMM Calls
There are various ways to speed up the execution of consecutive GEMM calls that can be executed independently. One way is to batch the GEMM calls by calling the batch version of GEMM as shown above.
Another way is to enclose the calls to GEMM by an OpenMP parallel construct, so each OpenMP thread executing the parallel region dispatches one of the GEMM calls. This parallel approach is illustrated in the following example.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define epsilon 0.0000001f bool compare(double x, double y) { // returns true if x and y are the same return fabs(x - y) <= epsilon; } int main() { double *A1, *B1, *C1, *C1_fl; double *A2, *B2, *C2, *C2_fl; int m, n, k, i, j, q; double alpha, beta; double sum; int fail; double t_start, t_end; m = 2000, k = 200, n = 1000; alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*k); i++) { A1[i] = A2[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B1[i] = B2[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C1[i] = C2[i] = 0.0; C1_fl[i] = C2_fl[i] = 0.0; } printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); t_start = omp_get_wtime(); #pragma omp target data \ map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ map(tofrom: C1[0:m*n], C2[0:m*n]) { #pragma omp parallel num_threads(2) { int id = omp_get_thread_num(); if (id == 0) { #pragma omp dispatch cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A1, k, B1, n, beta, C1, n); } else if (id == 1) { #pragma omp dispatch cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A2, k, B2, n, beta, C2, n); } } } t_end = omp_get_wtime(); printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2[j+i*n]); } printf ("\n"); } printf (" \nComputing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A1[k*i+q] * B1[n*q+j]; C1_fl[n*i+j] = sum; } } for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A2[k*i+q] * B2[n*q+j]; C2_fl[n*i+j] = sum; } } printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1_fl[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2_fl[j+i*n]); } printf ("\n"); } printf ("\n Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { fail = 1; break; } } if (fail) { printf (" **** FAIL **** \n"); } else { printf(" time = %lf seconds\n", t_end - t_start); printf (" **** PASS **** \n"); } mkl_free(A1); mkl_free(B1); mkl_free(C1); mkl_free(C1_fl); mkl_free(A2); mkl_free(B2); mkl_free(C2); mkl_free(C2_fl); return 0; }
Yet another way to speed up the execution of independent, consecutive GEMM calls is to use the nowait clause on the dispatch construct so the host thread does not have to wait for a dispatched GEMM call to complete before dispatching the next one. After the last GEMM call, we insert an OpenMP taskwait directive to guarantee that all the dispatched MKL calls complete before the host thread proceeds any further. This nowait approach is illustrated in the following example.
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" #define min(x,y) (((x) < (y)) ? (x) : (y)) #define epsilon 0.0000001f bool compare(double x, double y) { // returns true if x and y are the same return fabs(x - y) <= epsilon; } int main() { double *A1, *B1, *C1, *C1_fl; double *A2, *B2, *C2, *C2_fl; int m, n, k, i, j, q; double alpha, beta; double sum; int fail; double t_start, t_end; m = 2000, k = 200, n = 1000; alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*k); i++) { A1[i] = A2[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B1[i] = B2[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C1[i] = C2[i] = 0.0; C1_fl[i] = C2_fl[i] = 0.0; } printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); t_start = omp_get_wtime(); #pragma omp target data \ map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ map(tofrom: C1[0:m*n], C2[0:m*n]) { #pragma omp dispatch nowait cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A1, k, B1, n, beta, C1, n); #pragma omp dispatch nowait cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A2, k, B2, n, beta, C2, n); #pragma omp taskwait } t_end = omp_get_wtime(); printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2[j+i*n]); } printf ("\n"); } printf (" \nComputing matrix product using for-loops \n"); for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A1[k*i+q] * B1[n*q+j]; C1_fl[n*i+j] = sum; } } for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { sum = 0.0; for (q = 0; q < k; q++) sum += A2[k*i+q] * B2[n*q+j]; C2_fl[n*i+j] = sum; } } printf ("\n Top left corner of matrix C1: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C1_fl[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C2: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C2_fl[j+i*n]); } printf ("\n"); } printf ("\n Computations completed. Verifying... \n\n"); fail = 0; for (i = 0; i < (m*n); i++) { if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { fail = 1; break; } } if (fail) { printf (" **** FAIL **** \n"); } else { printf(" time = %lf seconds\n", t_end - t_start); printf (" **** PASS **** \n"); } mkl_free(A1); mkl_free(B1); mkl_free(C1); mkl_free(C1_fl); mkl_free(A2); mkl_free(B2); mkl_free(C2); mkl_free(C2_fl); return 0; }
Padding of Matrices Used in GEMM Computations
GEMM calls can be sped up by padding the leading dimensions, lda, ldb, and ldc, of the matrices a, b, and c, respectively.
The leading dimension of a matrix depends on the layout of the matrix in memory:
Column major layout (C/C++): The leading dimension of a matrix is the number of columns of the matrix.
Row major layout (Fortran): The leading dimension of a matrix is the number of rows of the matrix.
There are two rules to keep in mind for choosing the sizes of matrices passed to DGEMM calls.
Rule 1: For best performance, the leading dimensions of matrices passed to GEMM calls should be a multiple of 64 bytes (full cache line size). For single precision data, this means the leading dimension should be a multiple of (64 / sizeof(float)), which is equal to 16. For double precision data, this means the leading dimension should be a multiple of (64 / sizeof(double)), which is equal to 8.
Preferably, all matrices (a, b, and c) should be padded. However, padding matrix c is less important than padding a and b.
Rule 2: For best performance, leading dimensions should not be a multiple of a large power of 2 (e.g. 4096 bytes). Increasing the leading dimension slightly (e.g. from 4096 bytes to 4096+64 bytes) can improve performance in some cases.
Padding Example (Fortran)
The following Fortran example illustrates how matrices passed to DGEMM calls may be padded for improved performance.
include "mkl_omp_offload.f90" ! This subroutine reads command line arguments m1, k1, and n1. subroutine get_arguments (m1, k1, n1) implicit none integer :: m1, k1, n1 character(len=32) :: m1_char, k1_char, n1_char ! First, make sure that the right number of command line arguments ! have been provided. if (command_argument_count() .ne. 3) then print *, "ERROR: Three command-line arguments expected; stopping." stop endif ! Get command line arguments. call get_command_argument(1, m1_char) call get_command_argument(2, k1_char) call get_command_argument(3, n1_char) ! Convert arguments to integers. read (m1_char,*) m1 read (k1_char,*) k1 read (n1_char,*) n1 end subroutine get_arguments ! This function returns the smallest multiple of 8 that is >= n. ! Examples: ! if n = 3, then get_mul8 = 8 ! if n = 9, then get_mul8 = 16 ! if n = 30, then get_mul8 = 32 ! if n = 80, then get_mul8 = 8 integer function get_mul8 (n) implicit none integer :: n integer :: mod if (mod(n,8) .eq. 0) then get_mul8 = n else get_mul8 = ((n/8) + 1) * 8 endif end function get_mul8 ! This subroutine initializes matrices. subroutine init_matrix (m, k, n, a, b, c) implicit none integer :: m, k, n double precision :: a(m,k), b(k,n), c(m,n) integer :: i, j do i = 1, m do j = 1, k a(i,j) = (i-1) - (0.25 * k) end do end do do i = 1, k do j = 1, n b(i,j) = -((i-1) + j) end do end do do i = 1, m do j = 1, n c(i,j) = 0.2 + i - j end do end do end subroutine init_matrix program DGEMM_MAIN #if defined(MKL_ILP64) use onemkl_blas_omp_offload_ilp64 #else use onemkl_blas_omp_offload_lp64 #endif use omp_lib use iso_fortran_env implicit none interface integer function get_mul8 (n) implicit none integer :: n end function get_mul8 end interface double precision :: alpha, beta integer :: m1, k1, n1, m2, k2, n2 double precision, allocatable :: a1(:,:) double precision, allocatable :: b1(:,:) double precision, allocatable :: c1(:,:) double precision, allocatable :: a2(:,:) double precision, allocatable :: b2(:,:) double precision, allocatable :: c2(:,:) double precision :: start_t1, end_t1 double precision :: start_t2, end_t2 ! Read command line arguments m1, k1, and n1. call get_arguments (m1, k1, n1) ! ! Initialize alpha, beta, and m2, k2, n2 ! alpha = 1.025 beta = 0.75 m2 = get_mul8(m1) k2 = get_mul8(k1) n2 = get_mul8(n1) ! ! Allocate and initialize matrices. ! allocate( a1(1:m1,1:k1) ) allocate( b1(1:k1,1:n1) ) allocate( c1(1:m1,1:n1) ) allocate( a2(1:m2,1:k2) ) allocate( b2(1:k2,1:n2) ) allocate( c2(1:m2,1:n1) ) call init_matrix (m1, k1, n1, a1, b1, c1) call init_matrix (m2, k2, n2, a2, b2, c2) !$omp target data map(to: a1, b1, a2, b2) map(tofrom: c1, c2) ! Warm up run on device !$omp dispatch call DGEMM('N','N',m1,n1,k1,alpha,a1,m1,b1,k1,beta,c1,m1) ! ! Run DGEMM on device (using matrices a1, b1, and c1) ! start_t1 = omp_get_wtime() !$omp dispatch call DGEMM('N','N',m1,n1,k1,alpha,a1,m1,b1,k1,beta,c1,m1) end_t1 = omp_get_wtime() ! Warm up run on device !$omp dispatch call DGEMM('N','N',m2,n2,k2,alpha,a2,m2,b2,k2,beta,c2,m2) ! ! Run DGEMM on device (using padded matrices a2, b2, and c2) ! start_t2 = omp_get_wtime() !$omp dispatch call DGEMM('N','N',m2,n2,k2,alpha,a2,m2,b2,k2,beta,c2,m2) end_t2 = omp_get_wtime() !$omp end target data print 100, alpha, beta print * print 101, m1, n1, k1 print 111, (end_t1 - start_t1) print * print 102, m2, n2, k2 print 112, (end_t2 - start_t2) 100 format(7x, "ALPHA =", f10.4, " BETA =",f10.4) 101 format(7x, "M1 =", i5," N1 =", i5, " K1 =",i5) 111 format(7x, "Time (non-padded arrays) =", f10.4, " sec") 102 format(7x, "M2 =", i5," N2 =", i5, " K2 =",i5) 112 format(7x, "Time (padded arrays) =", f10.4, " sec") end
In the above example, the array bounds (m1, k1, and n1) are input as command line arguments. The matrices a1(1:m1, 1:k1), b1(1:k1, 1:n1), and c1(1:m1, 1:n1) are allocated and initialized.
Also, the padded matrices a2(1:m2, 1:k2), b2(1:k2, 1:n2), and c1(1:m2, 1:n2) are allocated and initialized. m2 is the smallest multiple of 8 that is >= m1. Similarly, k2 is the smallest multiple of 8 that is >= k1, and n2 is the smallest multiple of 8 that is >= n1.
The program compares the time taken by the DGEMM computation on a1, b1, and c1 versus the time taken by the DGEMM computation on a2, b2, and c2.
The compilation, link and run commands used are shown below.
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c dgemm_pad_f_01.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_f_01.o Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./a.out 12001 12001 12001
The output on the particular GPU used (1-stack only) was as follows:
ALPHA = 1.0250 BETA = 0.7500 M1 =12001 N1 =12001 K1 =12001 Time (non-padded arrays) = 0.2388 sec M2 =12008 N2 =12008 K2 =12008 Time (padded arrays) = 0.1648 sec
The above shows that padding arrays a, b, and c, the time taken by the DGEMM calls was reduced from 0.2388 seconds to 0.1648 seconds.
Padding Example (C/C++)
The following is a C/C++ example that illustrates how matrices passed to DGEMM calls may be padded for improved performance.
#include "mkl.h" #include "mkl_omp_offload.h" #include <float.h> #include <math.h> #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #if defined(PRECISION) #if PRECISION == 1 #define FLOAT double #else #define FLOAT float #endif #else #define PRECISION 1 #define FLOAT double #endif #define index(i, j, ld) (((j) * (ld)) + (i)) #define RAND() ((FLOAT)rand() / (FLOAT)RAND_MAX * 2.0 - 1.0) #define MALLOC(x) mkl_malloc((x), 64); #define FREE mkl_free #define MALLOC_CHECK(p) \ if (p == NULL) { \ fprintf(stderr, "%s:%d: memory allocation error\n", __FILE__, __LINE__); \ return EXIT_FAILURE; \ } #ifndef max #define max(a, b) (((a) > (b)) ? (a) : (b)) #endif #ifndef min #define min(a, b) (((a) < (b)) ? (a) : (b)) #endif void printMat(FLOAT *P, int m, int n, int ld) { int i, j; for (i = 0; i < m; i++) { printf("\n"); for (j = 0; j < n; j++) printf("%f ", P[index(i, j, ld)]); } printf("\n"); } void gemm_naive(int m, int n, int k, FLOAT alpha, const FLOAT *A, int lda, const FLOAT *B, int ldb, FLOAT beta, FLOAT *C, int ldc) { int i, j, kk; FLOAT temp; for (j = 0; j < n; j++) { for (i = 0; i < m; i++) { temp = 0.0; for (kk = 0; kk < k; kk++) { temp += alpha * A[index(i, kk, lda)] * B[index(kk, j, ldb)]; } C[index(i, j, ldc)] = temp + beta * C[index(i, j, ldc)]; } } } FLOAT infinity_norm_error(int m, int n, int ld, const FLOAT *ans, const FLOAT *ref) { int i, j, ind; FLOAT norm = 0.0, temp; for (i = 0; i < m; i++) { temp = 0.0; for (j = 0; j < n; j++) { ind = index(i, j, ld); temp += fabs(ref[ind] - ans[ind]); } norm = max(norm, temp); } return norm; } double mysecond() { struct timeval tp; struct timezone tzp; int i; i = gettimeofday(&tp, &tzp); if (i != 0) { fprintf(stderr, "%s:%d: timing error %d\n", __FILE__, __LINE__, i); return EXIT_FAILURE; } return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6); } #if defined(USE_MKL) int dnum = 0; #endif #if PRECISION == 1 #define LD_ALIGN 256 #define LD_BIAS 8 #else #define LD_ALIGN 512 #define LD_BIAS 16 #endif #define HPL_PTR(ptr_, al_) ((((size_t)(ptr_) + (al_)-1) / (al_)) * (al_)) #if defined(PAD_LD) static inline int getld(int x) { int ld; ld = HPL_PTR(x, LD_ALIGN); // Rule 1 if (ld - LD_BIAS >= x) ld -= LD_BIAS; else ld += LD_BIAS; // Rule 2 return ld; } #else static inline int getld(int x) { return x; } #endif int main(int argc, char **argv) { int i, j; if ((argc < 4) || (argc > 4 && argc < 8)) { printf("Performs a DGEMM test C = alpha*A*B + beta*C\n"); printf("A matrix is MxK and B matrix is KxN\n"); printf("All matrices are stored in column-major format\n"); printf("Run as ./dgemm_cublas <M> <K> <N> [<alpha> <beta> <iterations> " "<verify>]\n"); printf("Required inputs are:\n"); printf(" M: number of rows of matrix A\n"); printf(" K: number of cols of matrix A\n"); printf(" N: number of cols of matrix B\n"); printf("Optional inputs are (all must be provided if providing any):\n"); printf(" alpha: scalar multiplier (default: 1.0)\n"); printf(" beta: scalar multiplier (default: 0.0)\n"); printf(" iterations: number of blocking DGEMM calls to perform " "(default: 10)\n"); printf(" verify: set to 1 to check solution against CPU reference, " "not recommended for large M|K|N (default: 0)\n"); return EXIT_FAILURE; } FLOAT alpha, beta; int niter, verify; int HA = atoi(argv[1]); int WA = atoi(argv[2]); int WB = atoi(argv[3]); if ((HA == 0) || (WA == 0) || (WB == 0)) exit(1); if (argc > 4) { #if PRECISION == 1 sscanf(argv[4], "%lf", &alpha); sscanf(argv[5], "%lf", &beta); #else sscanf(argv[4], "%f", &alpha); sscanf(argv[5], "%f", &beta); #endif niter = atoi(argv[6]); verify = atoi(argv[7]); } else { alpha = 1.0; beta = 0.0; niter = 10; verify = 0; } #if PRECISION == 1 printf("DGEMM performance test\n"); #else printf("SGEMM performance test\n"); #endif int HB = WA; int WC = WB; int HC = HA; int ldA = getld(HA); int ldB = getld(HB); int ldC = getld(HC); printf("M = %d, K = %d, N = %d, ldA = %d, ldB = %d, ldC = %d, alpha = %f, " "beta = %f, iterations = %d, verify? = %d\n", HA, WA, WB, ldA, ldB, ldC, alpha, beta, niter, verify); double start_t, end_t, tot_t = 0.0, best_t = DBL_MAX; /*ALLOCATE HOST ARRAYS*/ FLOAT *A = (FLOAT *)MALLOC(ldA * WA * sizeof(FLOAT)); MALLOC_CHECK(A); FLOAT *B = (FLOAT *)MALLOC(ldB * WB * sizeof(FLOAT)); MALLOC_CHECK(B); FLOAT *C = (FLOAT *)MALLOC(ldC * WC * sizeof(FLOAT)); MALLOC_CHECK(C); FLOAT *Cref = NULL; if (verify) { Cref = (FLOAT *)MALLOC(ldC * WC * sizeof(FLOAT)); MALLOC_CHECK(Cref); } printf("\n---------------------\n"); printf("Array A: %d x %d\n", ldA, WA); printf("Array B: %d x %d\n", ldB, WB); printf("Array C: %d x %d\n", ldC, WC); printf("---------------------\n"); /*INITIALIZE WITH PSEUDO-RANDOM DATA*/ srand(2864); for (j = 0; j < WA; j++) for (i = 0; i < HA; i++) A[index(i, j, ldA)] = RAND(); for (j = 0; j < WB; j++) for (i = 0; i < HB; i++) B[index(i, j, ldB)] = RAND(); if (beta != 0.0) { for (j = 0; j < WC; j++) for (i = 0; i < HC; i++) C[index(i, j, ldC)] = RAND(); } else { for (j = 0; j < WC; j++) for (i = 0; i < HC; i++) C[index(i, j, ldC)] = 0.0; } if (verify) { for (j = 0; j < WC; j++) for (i = 0; i < HC; i++) Cref[index(i, j, ldC)] = C[index(i, j, ldC)]; } #if defined(USE_MKL) size_t sizea = (size_t)ldA * WA; size_t sizeb = (size_t)ldB * WB; size_t sizec = (size_t)ldC * WC; #pragma omp target data map(to: A [0:sizea], B [0:sizeb]) map(tofrom: C [0:sizec]) { // warm-up run #pragma omp dispatch #if PRECISION == 1 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, ldA, B, ldB, beta, C, ldC); #else cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, ldA, B, ldB, beta, C, ldC); #endif // run gemm on gpu, using dispatch construct for (i = 0; i < niter; i++) { start_t = mysecond(); #pragma omp dispatch #if PRECISION == 1 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, ldA, B, ldB, beta, C, ldC); #else cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, ldA, B, ldB, beta, C, ldC); #endif end_t = mysecond(); tot_t += end_t - start_t; best_t = min(best_t, end_t - start_t); } // end for } #endif // end #pragma omp target data double tflop_count; tflop_count = (double)2.0 * HA * WB * WA; if (beta != 0.0) tflop_count += (double)HA * WB; tflop_count *= 1.E-12; printf("Total runtime for %d iterations: %f seconds.\n", niter, tot_t); printf("Mean TFLOP/s: %f\n", (double)niter * tflop_count / tot_t); printf("Best TFLOP/s: %f\n", (double)tflop_count / best_t); if (verify) { // compute reference solution on host (1 added to niter to account for the // warm-up run) for (i = 0; i < (niter + 1); i++) { gemm_naive(HA, WB, WA, alpha, A, ldA, B, ldB, beta, Cref, ldC); } printf("Error Matrix Infinity Norm = %f\n", infinity_norm_error(HA, WB, ldC, C, Cref)); } FREE(A); FREE(B); FREE(C); if (verify) FREE(Cref); return EXIT_SUCCESS; }
The program reads command line arguments that specify the dimensions of the matrices a, b, and c, as well as some additional (optional) parameters. If the PAD_LD macro is defined, then the leading dimensions of the matrices are padded, according to Tul1 and Rule2 above. Then the program calls cblas_dgemm or cblas_sgemmniter times, and outputs the total runtime.
Compile and run (``LD_PAD`` macro is not defined)
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DUSE_MKL -DPRECISION=1 -c dgemm_pad_c_01.cpp icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_c_01.o -o dgemm_nopad.exe Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./dgemm_nopad.exe 4001 4001 4001 1.0 0.0 100 0
The performance (with no padding) was as follows on the particular GPU used (1-stack only):
DGEMM performance test M = 15000, K = 15000, N = 15000, ldA = 15000, ldB = 15000, ldC = 15000, alpha = 1.000000, beta = 0.000000, iterations = 100, verify? = 0 --------------------- Array A: 15000 x 15000 Array B: 15000 x 15000 Array C: 15000 x 15000 --------------------- Total runtime for 100 iterations: 138.368065 seconds. Mean TFLOP/s: 4.878293 Best TFLOP/s: 5.657242
Compile and run (``LD_PAD`` macro is defined)
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DUSE_MKL -DPRECISION=1 -DPAD_LD -c dgemm_pad_c_01.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_c_01.o -o dgemm_pad.exe Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./dgemm_pad.exe 4001 4001 4001 1.0 0.0 100 0
The performance with padding, where the leading dimensions of matrices a, b, and c was increased from 15000 to 15096), was as follows on the particular GPU used (1-stack only).
DGEMM performance test M = 15000, K = 15000, N = 15000, ldA = 15096, ldB = 15096, ldC = 15096, alpha = 1.000000, beta = 0.000000, iterations = 100, verify? = 0 --------------------- Array A: 15096 x 15000 Array B: 15096 x 15000 Array C: 15096 x 15000 --------------------- Total runtime for 100 iterations: 145.219459 seconds. Mean TFLOP/s: 4.648137 Best TFLOP/s: 5.525268