Visible to Intel only — GUID: GUID-D47549A3-903D-482A-87D8-499E7F0BCB9C
Visible to Intel only — GUID: GUID-D47549A3-903D-482A-87D8-499E7F0BCB9C
Run a GPU Roofline Analysis
To estimate actual performance of a GPU application against hardware-imposed ceilings, you can use the GPU Roofline Insights feature. Intel® Advisor can generate a roofline model for kernels running on Intel GPUs. The GPU Roofline model offers a very efficient way to characterize your kernels and visualize how far you are from ideal performance. For details about the GPU Roofline, see the Intel Advisor User Guide.
Prerequisites: It is recommended to run the GPU Roofline with root privileges on Linux* OS or as an administrator on Windows* OS.
Linux OS Users
If you do not have root permissions on Linux, configure your system to enable collecting GPU metrics for non-root users:
Add your username to the video group. To check if you are already in the video group:
groups | grep video
If you are not part of the video group, add your username to it:
sudo usermod -a -G video <username>
Set the value of the dev.i915.perf_stream_paranoid sysctl option to 0:
sysctl -w dev.i915.perf_stream_paranoid=0
Disable time limits to run the OpenCL kernel for a longer period:
sudo sh -c "echo N> /sys/module/i915/parameters/enable_hangcheck"
All Users
Make sure that your SYCL code runs correctly on the GPU. To check which hardware you are running on, add the following to your SYCL code and run it:
sycl::default_selector selector; sycl::queue queue(delector); auto d = queue.get_device(); std::cout<<Running on :<<d.get_info<cl::sycl::info::device::name>()<<std::endl;
Set up the Intel Advisor environment for Linux OS:
source <advisor_install_dir>/env/vars.sh
and for Windows OS:
<install-dir>/advisor-vars.bat
To run the GPU Roofline analysis in the Intel Advisor CLI:
Run the Survey analysis with the profile-gpu option:
advisor -collect=survey --profile-gpu --project-dir=./advisor-project --search-dir src:r=./matrix_multiply -- matrix_multiply
Run the Trip Count and FLOP analysis with --profile-gpu::
advisor --collect=tripcounts --stacks --flop --profile-gpu --project-dir=./advisor-project --search-dir src:r=./matrix_multiply -- matrix_multiply
Open the generated GPU Roofline report in the Intel Advisor GUI. Review the following metrics for the DPC++ Matrix Multiply application:
In the Summary tab, view top hotspots and the memory layout in the Top Hotspots pane.
See how efficiently your application uses execution units for each GPU Kernel in the GPU Kernels pane. Clicking on a top hotspot in the Top Hotspot will show the XVE usage in GPU Kernels pane.
In the GPU Roofline tab, see the GPU Roofline chart and performance metrics.
The Matrix Multiply application achieves 396.97 GFLOPS of performance. It uses global memory and is not optimized for local (SLM) memory because the application uses a global accessor.
This Roofline view shows 3 dots for a single loop, each dot having the same performance (vertical position) and different arithmetic intensities (horizontal position).
The application is bounded by L3 bandwidth as represented by the green circle on the Roofline.
The gray circle on the left is CARM dot and it considers memory transfers between registers and L1.
The purple circle on the right is HBM dot and it considers memory transfers between L3 and HBM.
As the GPU Roofline chart suggests, several possible optimizations might result in more efficient memory usage:
Use local memory (SLM).
Use the cache blocking technique to better use SLM.
The following code is the optimized version of the Matrix Multiply application. In this version, we declare two tiles and define them as sycl::access::target:local. We also modify the kernel to process these tiles in some inner loops.
// Replaces accessorC reference with a local variable void multiply1_1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i, j, k; // Declare a deviceQueue sycl::queue q(sycl::default_selector_v, exception_handler); cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; // Declare a 2 dimensional range sycl::range<2> matrix_range{NUM, NUM}; // Declare 3 buffers and Initialize them sycl::buffer<TYPE, 2> bufferA((TYPE *)a, matrix_range); sycl::buffer<TYPE, 2> bufferB((TYPE *)b, matrix_range); sycl::buffer<TYPE, 2> bufferC((TYPE *)c, matrix_range); // Submit our job to the queue q.submit([&](auto &h) { // Declare 3 accessors to our buffers. The first 2 read and the last // read_write sycl::accessor accessorA(bufferA, h, sycl::read_only); sycl::accessor accessorB(bufferB, h, sycl::read_only); sycl::accessor accessorC(bufferC, h); // Execute matrix multiply in parallel over our matrix_range // ind is an index into this range h.parallel_for(matrix_range, [=](sycl::id<2> ind) { int k; TYPE acc = 0.0; for (k = 0; k < NUM; k++) { // Perform computation ind[0] is row, ind[1] is col acc += accessorA[ind[0]][k] * accessorB[k][ind[1]]; } accessorC[ind[0]][ind[1]] = acc; }); }).wait_and_throw(); } // Replaces accessorC reference with a local variable and adds matrix tiling void multiply1_2(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i, j, k; // Declare a deviceQueue sycl::queue q(sycl::default_selector_v, exception_handler); cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; // Declare a 2 dimensional range sycl::range<2> matrix_range{NUM, NUM}; sycl::range<2> tile_range{MATRIXTILESIZE, MATRIXTILESIZE}; // Declare 3 buffers and Initialize them sycl::buffer<TYPE, 2> bufferA((TYPE *)a, matrix_range); sycl::buffer<TYPE, 2> bufferB((TYPE *)b, matrix_range); sycl::buffer<TYPE, 2> bufferC((TYPE *)c, matrix_range); // Submit our job to the queue q.submit([&](auto &h) { // Declare 3 accessors to our buffers. The first 2 read and the last // read_write sycl::accessor accessorA(bufferA, h, sycl::read_only); sycl::accessor accessorB(bufferB, h, sycl::read_only); sycl::accessor accessorC(bufferC, h); // Create matrix tiles sycl::local_accessor<TYPE, 2> aTile( sycl::range<2>(MATRIXTILESIZE, MATRIXTILESIZE), h); sycl::local_accessor<TYPE, 2> bTile( sycl::range<2>(MATRIXTILESIZE, MATRIXTILESIZE), h); // Execute matrix multiply in parallel over our matrix_range // ind is an index into this range h.parallel_for(sycl::nd_range<2>(matrix_range, tile_range), [=](cl::sycl::nd_item<2> it) { int k; const int numTiles = NUM / MATRIXTILESIZE; const int row = it.get_local_id(0); const int col = it.get_local_id(1); const int globalRow = MATRIXTILESIZE * it.get_group(0) + row; const int globalCol = MATRIXTILESIZE * it.get_group(1) + col; TYPE acc = 0.0; for (int t = 0; t < numTiles; t++) { const int tiledRow = MATRIXTILESIZE * t + row; const int tiledCol = MATRIXTILESIZE * t + col; aTile[row][col] = accessorA[globalRow][tiledCol]; bTile[row][col] = accessorB[tiledRow][globalCol]; it.barrier(sycl::access::fence_space::local_space); for (k = 0; k < MATRIXTILESIZE; k++) { // Perform computation ind[0] is row, ind[1] is col acc += aTile[row][k] * bTile[k][col]; } it.barrier(sycl::access::fence_space::local_space); } accessorC[globalRow][globalCol] = acc; }); }).wait_and_throw(); } // multiply1_2
Save the optimized version as multiply_1_2 and rerun the GPU Roofline. As the new chart shows:
The optimized application achieves performance of 3994.98 GFLOPS.
The application uses global and SLM memory, which represents the 16x16 tile. This increases memory bandwidth.