Intel® VTune™ Profiler

Performance Analysis Tutorial for Linux* OS

ID 762029
Date 12/20/2024
Public

Interpret Your Performance Snapshot Result

Identify the main problem areas in the matrix application.

Once the Performance Snapshot analysis is finished, the Summary window displays the result.

Understand the Summary

In the Summary window,

  • In the Analysis Tree diagram, look for analysis types you should consider to investigate performance issues. These analysis types are highlighted in red.
  • To estimate the severity of an issue, see highlighted metrics in the right pane. Expand a metric to see lower-level metrics that contribute.
  • To learn about the system you used to run Performance Snapshot, expand Collection and Platform Info. This information can be useful when you compare results across different hardware platforms.

Identify Problem Areas

In the matrix sample, observe these indicators that highlight some performance bottlenecks:

  • The Elapsed Time for this application is low. This is because the workload is small for the capabilities of the platform.

  • The IPC (Instructions per Cycle) metric value is very low for a modern superscalar processor which is typically capable of completing ~4 instructions per cycle. This low value for IPC indicates that the processor was stalled for most of the run time.

  • The Vectorization section informs you that there is no vectorization happening, even though the sample application has floating point operations.

  • The Logical Core Utilization metric is low. This indicates a problem with inadequate threading.

At this point, you observe the following potential performance issues with analysis types that can help you investigate each of them. Additionally, Performance Snapshot recommends another analysis type - Hotspots analysis.

Performance Issue Analysis Type for Further Investigation
  Hotspots analysis
Poor Threading Threading analysis
No Vectorization HPC Performance Characterization analysis
Memory Access Memory Access analysis

The Hotspots analysis identifies hot spots, which are areas of code that contributed the most to the elapsed time. In large applications, this analysis is a good starting point to understand algorithm flow and identify the hottest functions in different sections of code. Since the matrix sample is small and has only one primary function, the hot spot is likely to be in the primary function. Rather than running the Hotspots analysis to confirm this detail, you may find it more useful to examine the root cause behind the performance problem.

You can run the Threading analysis to identify bottlenecks in the threading implementation, like poor concurrency due to locks or overhead related to spin time. In the matrix sample, the thread limit is set to 16. This count is far smaller than the high CPU counts that are typically available on Linux servers. Increasing the thread value can scale out the work to utilize the available cores. So this is not a critical issue to fix.

Vectorization increases the ability to execute more operations in parallel. However, the low IPC metric value causes all instructions to execute slowly. Therefore, improving vectorization before improving the IPC rate would not necessarily improve application performance.

For this reason, prioritize improving the IPC metric first. To do this, run the Memory Access analysis to further understand why the application is memory-bound.