Intel® VTune™ Profiler

Performance Analysis Tutorial for Windows* OS

ID 762031
Date 12/20/2024
Public

Assess the Performance Improvement

After resolving the memory access issue, run the HPC Performance Characterization analysis. This is another recommendation from your Performance Snapshot result.

Run HPC Performance Characterization Analysis

  1. In the Intel® VTune™ Profiler welcome screen, click Configure Analysis.
  2. Click anywhere in the HOW pane to open the Analysis Tree.
  3. In the Parallelism group, select HPC Performance Characterization.
  4. Click Start to run the analysis.

Depending on your compiler and IDE, when configuring the analysis, you may need to browse to a different executable that was generated during recompilation in the previous step. For example, by default, Visual Studio* places the executable in [matrix]\vc15\x64\Release.

Interpret Your Result

Once the HPC Performance Characterization analysis is completed, the result displays in the Summary window.

In the Summary window, you can observe that:

  • The Elapsed Time has reduced significantly. This improvement happened because you removed the memory access bottleneck. These memory accesses caused the processor to miss the cache frequently and request data from the DRAM, which is very expensive in terms of latency.

  • The overall Vectorization metric is equal to 100%, which indicates that the code was vectorized. However, the metric is still flagged in red as a bottleneck because the vectorization was not optimal. This analysis ran on a machine that used an Intel® processor capable of using the AVX instruction set. The Vectorization metric indicates that 100% of instructions were executed using 128-bit registers. This implies that none of the 256-bit wide registers were used. Therefore, Intel® VTune™ Profiler flags the 100% utilization of 128-bit vector registers as an issue.

In the Vectorization section, focus on the Top Loops/Functions with FPU Usage by CPU Time subsection.

Note that the main loop of the multiply2 function was vectorized using the older SSE2 instruction set, while compilation and analysis were performed on a processor that supports AVX. Therefore, a portion of hardware resources remains underutilized.

The next step is to enable platform-appropriate vectorization.