Visible to Intel only — GUID: GUID-3009D861-7ECF-413D-B052-B9B84AAF8AA5
oneMKL Code Sample
To demonstrate a typical workflow for the oneMKL with SYCL* interfaces, the following example source code snippets perform a double precision matrix-matrix multiplication on a GPU device.
// Standard SYCL header
#include <CL/sycl.hpp>
// STL classes
#include <exception>
#include <iostream>
// Declarations for Intel oneAPI Math Kernel Library SYCL/DPC++ APIs
#include "oneapi/mkl.hpp"
int main(int argc, char *argv[]) {
//
// User obtains data here for A, B, C matrices, along with setting m, n, k, ldA, ldB, ldC.
//
// For this example, A, B and C should be initially stored in a std::vector,
// or a similar container having data() and size() member functions.
//
// Create GPU device
sycl::device my_device;
try {
my_device = sycl::device(sycl::gpu_selector());
}
catch (...) {
std::cout << "Warning: GPU device not found! Using default device instead." << std::endl;
}
// Create asynchronous exceptions handler to be attached to queue.
// Not required; can provide helpful information in case the system isn’t correctly configured.
auto my_exception_handler = [](sycl::exception_list exceptions) {
for (std::exception_ptr const& e : exceptions) {
try {
std::rethrow_exception(e);
}
catch (sycl::exception const& e) {
std::cout << "Caught asynchronous SYCL exception:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "Caught asynchronous STL exception:\n"
<< e.what() << std::endl;
}
}
};
// create execution queue on my gpu device with exception handler attached
sycl::queue my_queue(my_device, my_exception_handler);
// create sycl buffers of matrix data for offloading between device and host
sycl::buffer<double, 1> A_buffer(A.data(), A.size());
sycl::buffer<double, 1> B_buffer(B.data(), B.size());
sycl::buffer<double, 1> C_buffer(C.data(), C.size());
// add oneapi::mkl::blas::gemm to execution queue and catch any synchronous exceptions
try {
using oneapi::mkl::blas::gemm;
using oneapi::mkl::transpose;
gemm(my_queue, transpose::nontrans, transpose::nontrans, m, n, k, alpha, A_buffer, ldA, B_buffer,
ldB, beta, C_buffer, ldC);
}
catch (sycl::exception const& e) {
std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "\t\tCaught synchronous STL exception during GEMM:\n"
<< e.what() << std::endl;
}
// ensure any asynchronous exceptions caught are handled before proceeding
my_queue.wait_and_throw();
//
// post process results
//
// Access data from C buffer and print out part of C matrix
auto C_accessor = C_buffer.template get_access<sycl::access::mode::read>();
std::cout << "\t" << C << " = [ " << C_accessor[0] << ", "
<< C_accessor[1] << ", ... ]\n";
std::cout << "\t [ " << C_accessor[1 * ldC + 0] << ", "
<< C_accessor[1 * ldC + 1] << ", ... ]\n";
std::cout << "\t [ " << "... ]\n";
std::cout << std::endl;
return 0;
}
Consider that (double precision valued) matrices A(of size m-by-k), B( of size k-by-n) and C(of size m-by-n) are stored in some arrays on the host machine with leading dimensions ldA, ldB, and ldC, respectively. Given scalars (double precision) alpha and beta, compute the matrix-matrix multiplication (mkl::blas::gemm):
C = alpha * A * B + beta * C
Include the standard SYCL headers and the oneMKL SYCL/DPC++ specific header that declares the desired mkl::blas::gemm API:
// Standard SYCL header
#include <CL/sycl.hpp>
// STL classes
#include <exception>
#include <iostream>
// Declarations for Intel oneAPI Math Kernel Library SYCL/DPC++ APIs
#include "oneapi/mkl.hpp"
Next, load or instantiate the matrix data on the host machine as usual and then create the GPU device, create an asynchronous exception handler, and finally create the queue on the device with that exception handler. Exceptions that occur on the host can be caught using standard C++ exception handling mechanisms; however, exceptions that occur on a device are considered asynchronous errors and stored in an exception list to be processed later by this user-provided exception handler.
// Create GPU device
sycl::device my_device;
try {
my_device = sycl::device(sycl::gpu_selector());
}
catch (...) {
std::cout << "Warning: GPU device not found! Using default device instead." << std::endl;
}
// Create asynchronous exceptions handler to be attached to queue.
// Not required; can provide helpful information in case the system isn’t correctly configured.
auto my_exception_handler = [](sycl::exception_list exceptions) {
for (std::exception_ptr const& e : exceptions) {
try {
std::rethrow_exception(e);
}
catch (sycl::exception const& e) {
std::cout << "Caught asynchronous SYCL exception:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "Caught asynchronous STL exception:\n"
<< e.what() << std::endl;
}
}
};
The matrix data is now loaded into the SYCL buffers, which enables offloading to desired devices and then back to host when complete. Finally, the mkl::blas::gemm API is called with all the buffers, sizes, and transpose operations, which will enqueue the matrix multiply kernel and data onto the desired queue.
// create execution queue on my gpu device with exception handler attached
sycl::queue my_queue(my_device, my_exception_handler);
// create sycl buffers of matrix data for offloading between device and host
sycl::buffer<double, 1> A_buffer(A.data(), A.size());
sycl::buffer<double, 1> B_buffer(B.data(), B.size());
sycl::buffer<double, 1> C_buffer(C.data(), C.size());
// add oneapi::mkl::blas::gemm to execution queue and catch any synchronous exceptions
try {
using oneapi::mkl::blas::gemm;
using oneapi::mkl::transpose;
gemm(my_queue, transpose::nontrans, transpose::nontrans, m, n, k, alpha, A_buffer, ldA, B_buffer,
ldB, beta, C_buffer, ldC);
}
catch (sycl::exception const& e) {
std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "\t\tCaught synchronous STL exception during GEMM:\n"
<< e.what() << std::endl;
}
At some time after the gemm kernel has been enqueued, it will be executed. The queue is asked to wait for all kernels to execute and then pass any caught asynchronous exceptions to the exception handler to be thrown. The runtime will handle transfer of the buffer’s data between host and GPU device and back. By the time an accessor is created for the C_buffer, the buffer data will have been silently transferred back to the host machine if necessary. In this case, the accessor is used to print out a 2x2 submatrix of C_buffer.
// Access data from C buffer and print out part of C matrix
auto C_accessor = C_buffer.template get_access<sycl::access::mode::read>();
std::cout << "\t" << C << " = [ " << C_accessor[0] << ", "
<< C_accessor[1] << ", ... ]\n";
std::cout << "\t [ " << C_accessor[1 * ldC + 0] << ", "
<< C_accessor[1 * ldC + 1] << ", ... ]\n";
std::cout << "\t [ " << "... ]\n";
std::cout << std::endl;
return 0;
Note that the resulting data is still in the C_buffer object and, unless it is explicitly copied elsewhere (like back to the original C container), it will only remain available through accessors until the C_buffer is out of scope.