Visible to Intel only — GUID: GUID-C954B78B-10ED-41FE-A691-F68C795C3B80
Visible to Intel only — GUID: GUID-C954B78B-10ED-41FE-A691-F68C795C3B80
Programming Intel® XMX Using SYCL Joint Matrix Extension
Joint matrix is a new SYCL extension for tensor hardware programming. This unifies targets like Intel® Advanced Matrix Extensions (Intel® AMX) for CPUs, Intel® Xe Matrix Extensions (Intel® XMX) for GPUs, and NVIDIA* Tensor Cores. In general, frameworks like TensorFlow and libraries like oneDNN are the answer for many types of users and applications. However, for users who want to build their own neural networks applications, these libraries and frameworks become high-level because users cannot do custom optimizations, and heavyweight because the size of the library is large. Moreover, new operations are introduced and changing in machine learning domain for which frameworks and libraries do not provide timely and performing solutions. For such cases, joint matrix has a lower-level of abstraction than the frameworks to provide performance, productivity, and fusion capabilities but at the same time offers portability by using one code to target different tensor hardware.
The detailed specification of this extension can be found here
The joint matrix extensions consists of a new type joint_matrix, explicit memory operations joint_matrix_load and joint_matrix_store, joint_matrix_fill for matrix initialization, joint_matrix_mad for the actual multiply and add operations, and get_wi_data()[] for element wise operations.
In the code below, the kernel makes use of the joint matrix interface by declaring 3 joint_matrix matrices: tA, tB, tC and computes the operation tC += tA * tB. DPAS (Dot Product and Accumulate Systolic) is the name of the elementary operations done in Intel® XMX. For more examples that use joint_matrix, please refer to the Intel llvm-test-suite repo.
In order for this example to run successfully on Intel® XMX, The GPU must have Intel® XMX hardware. There is no emulation or fall back strategy in the joint matrix implementation.
size_t NDRangeM = M / TM; size_t NDRangeN = N / TN; sycl::buffer<bfloat16, 2> bufA(A.get_data(), sycl::range<2>(M, K)); sycl::buffer<bfloat16, 2> bufB(B.get_data(), sycl::range<2>(K, N)); sycl::buffer<float, 2> bufC((float *)C.get_data(), sycl::range<2>(M, N)); sycl::queue q; q.submit([&](sycl::handler &cgh) { sycl::accessor accC(bufC, cgh, sycl::read_write, sycl::no_init); sycl::accessor accA(bufA, cgh, sycl::read_only, sycl::no_init); sycl::accessor accB(bufB, cgh, sycl::read_only, sycl::no_init); cgh.parallel_for( sycl::nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}), [=](sycl::nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] { // The joint matrix API has to be accessed by all the workitems in a // subgroup these functions will be called once by the subgroup no // code divergence between the workitems const auto global_idx = spmd_item.get_global_id(0); const auto global_idy = spmd_item.get_global_id(1); const auto sg_startx = global_idx - spmd_item.get_local_id(0); const auto sg_starty = global_idy - spmd_item.get_local_id(1); sycl::sub_group sg = spmd_item.get_sub_group(); sycl::ext::oneapi::experimental::matrix::joint_matrix< sycl::sub_group, bfloat16, use::a, TM, TK, layout::row_major> sub_a; // For B, we assume B has been already VNNIed. sycl::ext::oneapi::experimental::matrix::joint_matrix< sycl::sub_group, bfloat16, use::b, TK, TN, sycl::ext::intel::experimental::matrix::layout::packed> sub_b; sycl::ext::oneapi::experimental::matrix::joint_matrix< sycl::sub_group, float, use::accumulator, TM, TN> sub_c; joint_matrix_load(sg, sub_c, accC.get_pointer() + (sg_startx * TM) * N + sg_starty / SG_SZ * TN, N, layout::row_major); for (int k = 0; k < K / TK; k += 1) { // joint_matrix_load( sg, sub_a, accA.get_pointer() + (sg_startx * TM) * K + k * TK, K); joint_matrix_load(sg, sub_b, accB.get_pointer() + (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2, N * 2); sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c); } joint_matrix_store(sg, sub_c, accC.get_pointer() + (sg_startx * TM) * N + sg_starty / SG_SZ * TN, N, layout::row_major); }); // parallel for }).wait();
In order to get optimal performance on Intel® XMX, the GEMM kernel has to be written in a way that keeps feeding Intel® XMX with data it needs to perform the maximum multiply and adds operations/cycle. Some of these tuning techniques are the following:
One sub-group can perform multiple DPAS operations.
Blocking for cache locality should be made on the three`` i``, j, and k dimensions. Blocking on the first two dimensions is included in the global range of the kernel. The k-level cache blocking is done within the kernel body. e.g. by choosing block factors of 256x256x32, global range results in:
range<2> global{M / 256, N / 256 * SG_SIZE};
Then, by choosing one sub - group to perform 64x32x32 elements, local range results in:
range<2> local{256 / 64, 256 / 32 * SG_SIZE};
Large register file per thread gives the best results for the GEMM kernel. This can be explicitly specified in the compilation command as follows:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device pvc -internal_options -ze-opt-large-register-file" joint-matrix-bf16.cpp