Intel® VTune™ Profiler

Performance Analysis Tutorial for Linux* OS

ID 762029
Date 10/31/2024
Public

Summary

You have completed the Finding Common Bottlenecks tutorial. Here are some important things to remember when using the Intel® VTune™ Profiler to analyze your code for hotspots and hardware issues:

Step

Tutorial Recap

Key Tutorial Takeaways

1. Find the bottleneck

You started with Performance Snapshot to determine main limiting factors and next steps for optimization:

  1. Using the Hotspots analysis to isolate problem to a specific code area.
  2. Using the Memory Access analysis to understand the exact mechanics behind the bottleneck.
  • When you first analyze an application, it is a good idea to start with the Performance Snapshot analysis to determine main problem areas and next steps.
  • Use the Hotspots analysis to isolate the performance issue to a specific area of code. Click the hotspot function name in the Bottom-up window to see the code lines responsible for bottleneck.
  • Use the Memory Access analysis to determine issues related to inefficient DRAM accesses, one of the most common limiting factors in software.

2. Resolve issue and recompile application

You edited the code and recompiled the application to eliminate the cache-unfriendly DRAM access pattern.

This has resulted in a great decrease of application running time.

You've set compiler options to use a different optimization level to see how compiler options can influence vectorization.

  • Using efficient, cache-friendly DRAM access patterns can result in a significant increase in performance.
  • Compiler options can influence the behavior of the application in unobvious ways, especially when multiple different compilers are used. VTune Profiler can help identify issues related to the application being vectorized improperly, which underutilizes available hardware resources.

3. Resolve vectorization issues

You recompiled the application with a different optimization level, and the code was vectorized.

However, while using Performance Snapshot, you've noticed that only the 128-bit vector registers were utilized, while the 256-bit registers were not utilized at all.

By using the HPC Performance Characterization analysis, you've noticed that the vector instruction set extension SSE2 was used, which is an older instruction set extension. A portion of hardware resources remained underutilized.

You've recompiled the application again with different options to ensure vectorization was performed according to full platform capability.

  • Both the Performance Snapshot and the HPC Performance Characterization analysis types can help identify issues related to improper vectorization.
  • While compiler options are well-documented and their behavior is known, it is easy to miss a peculiarity of an option. This can lead to not compiling an application to make the best use of hardware resources straight away, no matter what compiler is used. VTune Profiler can help catch such issues on all stages of development.

4. Analyze Microarchitecture Usage

As recommended by Performance Snapshot, you used the Microarchitecture Exploration analysis to identify next optimization steps.

Using this analysis type, you saw that the best way to further optimize the application was the cache blocking technique.

  • VTune Profiler provides a large number of microarchitecture metrics tuned by Intel architects to enable you to make an informed optimization decision.
  • You used the metrics and the µPipe diagram to make the next optimization decision.

5. Check your work

You used the Compare Results feature to compare the performance of the application at different optimization stages.

Perform regular regression testing by comparing analysis results before and after optimization. From the GUI, click the Compare Results button on the VTune Profiler toolbar. From command line, use the vtune command.

Next step: Prepare your own application(s) for analysis. Then use the VTune Profiler to find and eliminate performance problems.