XPU Offload Analysis

Intel® VTune™ Profiler

User Guide

Download PDF

ID 766319

Date 3/31/2025

Version

Public

XPU Offload Analysis

Use the XPU Offload analysis to profile and optimize artificial intelligence (AI) workloads running on Intel architectures like Graphics Processing Units(GPUs) and Neural Processing Units(NPUs).

XPUs are the collection of Neural Processing Units(NPUs), Graphical Processing Units(GPUs) and CPU device cores. GPUs are a popular hardware choice for compute-intensive or graphics-intensive applications. An NPU can accelerate the performance of AI workloads that have been explicitly offloaded onto it by an operating system. NPUs are uniquely designed to improve the performance of AI and machine-learning(ML) workloads.

Use the Intel® Distribution of OpenVINO™ toolkit to offload popular ML models (like speech or image recognition tasks) to Intel NPUs. Then use the XPU Offload analysis to profile AI and ML workloads. Collect performance data and optimize the performance of these AI/ML applications.

Default Settings for XPU Data Collection

When you run XPU Offload analysis to collect data for an XPU device, Intel® VTune™ Profiler collects the following information in the Time-based mode:

	Time-based mode
Data collection	Intel® VTune™ Profiler collects metrics system-wide, similar to CPU uncore metrics.
Size of typical workload	Large
Execution time of instance	>5 ms
Sampling interval	1 ms
Benefits	Use this mode for larger workloads. Optimize applications with reasonable efficiency and reduced overhead.
Usage considerations	Less overhead for application. This mode requires Level Zero backend to be installed, with normal NPU drivers. However, the mode does not require the application to use Level Zero to collect metrics, except for computing tasks.

Configure and Run Analysis

In the the VTune Profiler user interface, in the Accelerators group of the Analysis Tree, select XPU Offload(preview).
In the WHAT pane, specify the path to the AI/ML application in the Application bar.
If necessary, specify relevant Application parameters as well.
In the HOW pane, select your Target Devices.
Set these collection options as needed:
- Trace computing programming APIs - Set this option to analyze SYCL, Level-Zero, OpenCL™, and Intel® Video Processing Library(Intel® VPL) programs that run on Intel architectures (like GPUs or NPUs). Selecting this option can impact CPU performance.
- Collect host stacks - Set this option to analyze call stacks that are executed on the CPU and also identify critical paths. Examine the CPU-side stacks for GPU and NPU-computing tasks to investigate the efficiency of your XPU offload. When results display, use the Call Stack mode in the filter bar to sort through SYCL*, Level-Zero, or OpenCL™ runtime call stacks.
- Show GPU performance insights - Set this option to collect metrics based on the analysis of Processor Graphics events. Use these GPU performance metrics to estimate the efficiency of hardware usage and learn about next steps.
Click the Start button to run the analysis.

GPU Performance Metrics

The XPU Offload analysis profiles these metrics related to the performance of your GPU:

Performance Metric	Description
EU Array	The EU Array metric shows the breakdown of GPU core array cycles, where: Active: The normalized sum of all cycles on all cores spent actively executing instructions. Formula: Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is stalled for some reason. Formula: Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core. Formula:
EU Threads Occupancy	This metric shows the normalized sum of all cycles on all cores and thread slots when a slot has a thread scheduled.
Computing Threads Started	This metric shows the number of threads started across all EUs for compute work.

Performance Metric

Description

EU Array

The EU Array metric shows the breakdown of GPU core array cycles, where:

Active: The normalized sum of all cycles on all cores spent actively executing instructions. Formula:
Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is stalled for some reason. Formula:
Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core. Formula:

EU Threads Occupancy

This metric shows the normalized sum of all cycles on all cores and thread slots when a slot has a thread scheduled.

Computing Threads Started

This metric shows the number of threads started across all EUs for compute work.

Run from Command Line

To run the XPU Offload analysis from the command line, type:

$ vtune -collect xpu-offload [-knob <knob_name=knob_option>] -- <target> [target_options]

NOTE:

To generate the command line for any analysis configuration, use the Command Line button at the bottom of the user interface.

Once VTune Profiler completes data collection, the results of the XPU Offload analysis appear in the XPU Offload viewpoint.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

User Guide

XPU Offload Analysis

Default Settings for XPU Data Collection

Configure and Run Analysis

See Also