Enhance GenAI LLMs performance using Intel VTune Profiler

3/27/2025

Nikita Shiledarbaxi, software technical marketing, Intel Corporation

Rob Mueller-Albrecht, software technical marketing, Intel Corporation

Large Language Models, or LLMs, are the fundamental components of many Generative AI or GenAI applications. The efficiency of such workloads largely depends on that of the constituent LLMs. With the Intel® VTune™ Profiler, you can analyze and fix the performance bottlenecks of LLMs for highly optimized GenAI applications. The performance analysis and debugging tool allow you to examine a variety of aspects of your workloads, such as I/O performance, hotspots, memory allocation and consumption, HPC performance characterization, and more, with support for several programming languages and frameworks, including C/C++, SYCL*, Python*, JAVA*, OpenVINO™ and OpenCL*.

Analyzing LLMs for fixing performance issues is a challenging task due to several factors, such as:

Dynamic behavior of LLMs: LLMs respond dynamically based on input data and may generate different outputs for the same input, making it difficult to predict their performance across various tasks.

Scale and complexity: LLMs may contain billions of parameters, complicating performance analysis.

Tuning and optimization: Analyzing the impact of various hyperparameter tuning and optimization techniques requires expertise and extensive experimentation.

Bias, interpretability and environmental variability: LLMs may reflect biases in the training data and behave unpredictably in different environments. Interpreting why the model performs in a certain way on specific tasks is often non-trivial.

This blog discusses a recently added recipe in the VTune Profiler Cookbook that demonstrates how to leverage the tool for profiling the performance of LLMs in a GenAI application deployed on the Intel® Distribution of OpenVINO™ Toolkit and run on the latest Intel® Core™ Ultra 200V series processors (codenamed Lunar Lake).

Check out the full recipe: Profiling Large Language Models on Intel Core Ultra 200V.

Pre-requisites and Environment Setup

The recipe is based on the following hardware configurations and software tools:

GenAI software: Phi-3 application and Hugging Face* phi-3-mini-4k-instruct LLM.

Performance analysis tool: VTune Profiler (v2025.0 or newer)

Inference engine: Intel Distribution of OpenVINO Toolkit (v2024.3 or newer – direct download)

Hardware: Intel Core Ultra 5 Processor 238V

Operating system: Microsoft* Windows

Setting up the environment involves building the OpenVINO source code, followed by building and executing the LLM application.

Refer to the ‘Set Up Your Environment’ section in the recipe for detailed steps.

Analyze LLM Performance using Intel® VTune™ Profiler

The GPU Offload Analysis feature of the VTune Profiler allows you to analyze code execution on CPU and GPU and correlate activities between them. For instance, the Graphics window of the GPU Offload Analysis output gives you information about the total time spent on each computation task for various operations (resource allocation, data transfer, task execution, etc). The Platform window of the analysis output lets you know which models have been deployed on the CPU and which ones on the GPU. It also shows various metrics (such as GPU Memory Access, CPU Time, GPU Frequency, and more), as shown in Fig. 1 below. You can correlate those metrics to understand GPU resource utilization.

Fig.1: GPU Offload Analysis

The correlation among the above metrics allows you to identify which operation consumed the most time per kernel on the CPU or the GPU (also called the ‘hotspot’). You can then run the GPU Compute/Media Hotspots Analysis to understand the behavior of the hotspots in detail. The Summary window of the analysis output allows you to examine specific bandwidth utilizations at the memory level, such as read/write bandwidth of GPU and shared local memory. Fig.2 below shows an example histogram showing bandwidth utilized for GPU memory read operation.

Fig.2: Bandwidth Utilization Histogram

VTune Profiler also allows you to analyze memory-bound performance issues through a Memory Hierarchy Diagram. The Platform window shows you trends of various metrics such as GPU L3 cache bandwidth and misses, GPU frequency, GPU memory access, and more.

Optimize LLM Performance at various GenAI Development Stages using OpenVINO™

The VTune Profiler employs the Instrumentations and Tracing Technology (ITT API) to analyze the time spent on LLM inference tasks using OpenVINO. The recipe discusses four such scenarios where the VTune Profiler tools can help you optimize the GenAI workload:

Optimize model compilation: The GPU Compute/Media Hotspots Analysis output lets you know the compile time for every inference. You can then accelerate model compilations using the model caching feature of OpenVINO as follows:

ov::Core core; 
core.set_property(ov::cache_dir("/path/to/cache/dir"));   
auto compiled = core.compile_model(modelPath, device, config);

Learn more about the OpenVINO model caching feature.

Scaled Dot-Product Attention (SDPA) subgraph optimization: ‘Attention’ in neural networks refers to the model's mechanism of dynamically focusing on specific parts of the input data while making predictions. In Transformer models, SDPA is the core mechanism for computing attention scores. Multi-Head Attention (MHA) mechanism extends the capabilities of SPDA by capturing SPDA results from multiple model components.

OpenVINO provides a ScaledDotProductAttention operator for SDPA fusion (a process of combining subgraphs within a larger graph for improved performance). The operator helps reduce inference time and mitigate memory-bound issues in SDPA. It also helps improve parallelism and memory access patterns in MHA. Using the VTune Profiler, you can compare the inference times and memory bottlenecks before and after optimization with OpenVINO.

Optimize matrix multiplications (MatMul) and Logits: The term ‘vocabulary size’ in LLMs refers to the number of tokens or unique words the model can recognize. The larger the vocabulary size, the greater the number of MatMul computations and the amount of memory required. The recipe shows how you can reduce MatMul computations and memory usage for logits (unnormalized scores computed by the last layer of the neural network) by adding another graph optimization method. Using the VTune Profiler, you can then analyze the performance improvements and reduction in inference time.
Key-Value (KV) cache optimization: An LLM computes the key-value (KV) for each input token while generating text. The KV-cache optimization technique helps prevent recomputing the same KVs repeatedly by storing the past keys and values. OpenVINO provides a Stateful API that stores the KV-cache as an internal state of the model between two consecutive inference calls. This helps significantly reduce inference time and the overhead of memory copy.

What’s Next?

Check out the complete recipe for stepwise details about enhancing LLMs' performance using VTune Profiler for optimized GenAI applications. Get started with the VTune Profiler tool today and mitigate or resolve various hardware and software-level performance bottlenecks in your workloads.

We also encourage you to explore our other oneAPI-powered toolkits for AI and HPC.