Intel® VTune™ Profiler

User Guide

ID 766319
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Window: Graphics - GPU Compute/Media Hotspots

Use this window for GPU analysis with Intel® VTune™ Profiler to identify GPU tasks with high GPU utilization and estimate the effectiveness of this utilization. This view is particularly useful for analysis of OpenCL™, SYCL, and Intel Media SDK applications doing substantial computation work on the GPU.

To access this window: Select the GPU Compute/Media Hotspots viewpoint and click the Graphics sub-tab in the result tab.

Along with the regular bottom-up analysis and stack data, the Graphics window correlates CPU / GPU busyness and displays the distribution of the GPU metrics over time:

Grid. Analyze basic performance metrics per program unit and identify the most time-consuming units. If your application uses the OpenCL software technology and you ran the analysis with the Trace GPU Programming APIs option enabled, the grid is grouped by Computing Task Purpose granularity by default.

Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized by long Average Time values and kernels whose Average Time values are not long, but they are invoked more frequently than the others (see Instance Count values). Both groups deserve attention. For more details, see GPU OpenCL™ Application Analysis.

To understand the CPU activity (which module/function was executed and its CPU time) while the GPU execution units were idle, queued, or busy executing some code, use the GPU Render and EU Engine State grouping level:

Thread. Explore CPU and GPU utilization by a particular thread. The Platform tab displays the thread name as a name of the module where the thread function resides. For example, if you have a myFoo function that belongs to MyMegaFoo (Linux*) or MyMegaFoo.dll (Windows*) function, the thread name is displayed as MyMegaFoo (Linux*) or MyMegaFoo.dll (Windows*) . This approach helps easily identify the location of the thread code producing the work displayed on the timeline.

Windows* targets only: Correlate CPU and GPU usage and estimate whether your application is GPU bound. GPU Engines Usage bars show DMA packets on CPU threads originating GPU tasks. The bars are colored according to the type of used GPU engine (yellow bars in the example above correspond to the Render and GPGPU engine).

GPU hardware metrics. If you enabled the Analyze Processor Graphics hardware events option for GPU analysis on the processors with the Intel® HD and Intel® Iris® Graphics, the VTune Profiler displays the statistics for the selected group of metrics over time.

For example, for the default Overview group of metrics, you may start with GPU Execution Units: EU Array Idle metric. Idle cycles are wasted cycles. No threads are scheduled and the EUs' precious computational resources are not being utilized. If EU Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them.

In most cases the optimization strategy is to minimize the EU Array Stalled metric and maximize the EU Array Active. The exception is memory bandwidth-bound algorithms and workloads where optimization should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than maximize EU Array Active).

Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully designed memory accesses cannot be overestimated. If the EU Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.

Sampler accesses are expensive and can easily cause stalls. Sampler accesses are measured by the Sampler Is Bottleneck and Sampler Busy metrics.

NOTE:

To analyze Intel Graphics hardware events on a GPU, make sure to set up your system for GPU analysis.

Computing Queue. Analyze details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and identify the time spent in the queue, zoom in and explore the Computing Queue data. VTune Profiler displays kernels with the same name and global/local size in the same color. Synchronization tasks are marked with vertical hatching . Data transfers are marked with cross-diagonal hatching .

You can click a kernel task to highlight the whole queue to the execution displayed at the top layer. Hover over an object in the queue to see kernel execution parameters.

Windows targets only: Switch to the Platform window to explore how the execution path of the OpenCL device queue correlates to the DMA packets software queue.

GPU Usage metrics. GPU usage bars are colored according to the type of used GPU engine.

Theoretically, if the Platform tab shows that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.

For further OpenCL kernel analysis, select a computing task you are interested in (for example, AdvancedPaths) and switch to the Architecture Diagram tab. VTune Profiler displays performance data per GPU hardware metrics for the time range when the selected kernel was executed:

Flagged values signal a performance issue. In this example, ~50% of the GPU time was spent in stalls. This means that performance is limited by memory or sampler accesses.