Intel® VTune™ Profiler

Performance Analysis Tutorial for Linux* OS

ID 762029
Date 12/20/2024
Public

Analyze Memory Access

Run the Memory Access analysis to understand why the sample application is memory-bound.

Run Memory Access Analysis

  1. In the Performance Snapshot result, click the Memory Access icon. You can also click Configure Analysis in the Intel® VTune™ Profiler welcome screen and select Memory Access in the HOW pane.

  2. In the HOW pane, set these options:

    • Set CPU sampling intervall to 1ms. Since the matrix sample is a small workload, setting a small sampling interval drives Intel® VTune™ Profiler to collect more samples for higher precision.
    • Select Analyze dynamic memory objects. If the Intel® VTune™ Profiler driver is available, this setting exposes the latency of specific memory allocations in the results.
    • Disable the Analyze OpenMP regions option. This option is not necessary for the matrix application.

  3. Click the Start button to run the analysis.

Interpret Memory Access Data

When the analysis is finished and Intel® VTune™ Profiler finalizes the result, the Summary pane opens in the Memory Usage viewpoint.

This result informs again that the sample application is severely bound by memory accesses. Although the load operations come from L3 cache primarily (and not the slower DRAM), the application should be able to use L1 or L2 cache. But this is not happening.

The Bandwidth Utilization Histogram informs that there is no use of DRAM.

The Top Memory Objects by Latency section shows that the total latency is caused by loads from one object in particular - the allocation at line 115 in the matrix.c file.

Switch to the Bottom-up pane to see the exact metrics for the multiply1 function.

In the bottom grid, select the grouping Function / Memory Object / Allocation Stack.

The multiply1 function is at the top of the grid with the high CPU Time and Memory Bound metric values. Expand the multiply1 function to see the collected memory objects.

Note that the Average Latency (cycles) value is high. This is primarily due to the object allocated at line 115 in matrix.c. Double-click this allocation to view it in the source code.

This allocation buf2 is ultimately assigned to matrix 'b' in line 133 below.

Switch to the Bottom-up tab. Doubleclick on the multiply1 function to see where loads from this matrix occur.

As explained in the code sample, the iteration of matrix 'b' has a large stride. This stride can cause poor cache utilization.

One way to resolve this issue is to apply the Loop Interchange technique. In this example, the technique changes the way the rows and columns of the matrices are addressed in the main loop. This change eliminates the inefficient memory access pattern and enables the processor to make better use of the LLC.