Analyze Memory Access

Intel® VTune™ Profiler

Performance Analysis Tutorial for Windows* OS

Download PDF

ID 762031

Date 12/20/2024

Version

Public

Visible to Intel only — GUID: GUID-69376ECC-5228-4D37-B503-EA5C4B4C9F5B

View Details

Analyze Memory Access

Run the Memory Access analysis to understand why the sample application is memory-bound.

Run Memory Access Analysis

In the Performance Snapshot result, click the Memory Access icon. You can also click Configure Analysis in the Intel® VTune™ Profiler welcome screen and select Memory Access in the HOW pane.
In the HOW pane, disable the Analyze OpenMP regions option. This option is not necessary for the matrix application.
Click the Start button to run the analysis.

Interpret Memory Access Data

When the analysis is finished and Intel® VTune™ Profiler finalizes the result, the Summary pane opens in the Memory Usage viewpoint.

This result informs again that the sample application is severely bound by memory accesses. The system is not bound by the DRAM Bandwidth alone. This detail indicates that the application is bound by frequent albeit small requests to memory, rather than by the saturated physical DRAM Bandwidth.

Switch to the Bottom-up pane to see the exact metrics for the multiply1 function.

The multiply1 function is at the top of the grid with the high CPU Time and Memory Bound metric values.

Note that the LLC Miss Count metric is also very high. This high value indicates that the application employs a memory access pattern that in turn uses the cache poorly. The use of this pattern causes the processor to frequently miss the LLC and request data from the DRAM, which is expensive in terms of latency.

Double-click the multiply1 function in the Bottom-up grid to open the Source window.

In the Source window, you see that the most time-consuming line is attributed to the loop that performs the matrix multiplication in the multiply1 function.

As explained in the code sample, the iteration of matrix 'b' has a large stride. This stride can cause poor cache utilization.

One way to resolve this issue is to apply the Loop Interchange technique. In this example, the technique changes the way the rows and columns of the matrices are addressed in the main loop. This change eliminates the inefficient memory access pattern and enables the processor to make better use of the LLC.

Parent topic: Tutorial: Analyze Common Performance Bottlenecks in a C++ Application with Intel® VTune™ Profiler (Windows* OS)

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Performance Analysis Tutorial for Windows* OS

Analyze Memory Access

Run Memory Access Analysis

Interpret Memory Access Data