User Guide

Intel® VTune™ Profiler User Guide

ID 766319
Date 11/07/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Summary Report

Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Profiler automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.

Use the following syntax to generate the Summary report from a preexisting result:

vtune -report summary -result-dir <result_path>

The summary report output depends on the collection type:

User-mode Sampling and Tracing Collection Summary Report

For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:

  • Collection and Platform Information

  • CPU Information

  • Summary per basic analysis metrics

Example 1: User-Mode Sampling Hotspots Summary

This example generates the summary report for the r000hs Hotspots analysis result on Windows*:

vtune -report summary -r r000hs
Elapsed Time: 1.857s CPU Time: 10.069s Effective Time: 10.069s Idle: 0.000s Poor: 1.294s Ok: 6.381s Ideal: 2.395s Over: 0s Spin Time: 0s Overhead Time: 0s Total Thread Count: 9 Paused Time: 0s Top Hotspots Function Module CPU Time --------- ---------- -------- multiply1 matrix.exe 10.069s Collection and Platform Info Application Command Line: C:\temp\samples\en\C++\matrix_vtune\matrix\vc14\Win32\Release\matrix.exe Operating System: Microsoft Windows 10 Computer Name: my-computer Result Size: 5 MB Collection start time: 09:41:57 06/09/2018 UTC Collection stop time: 09:41:58 06/09/2018 UTC Collector Type: Event-based counting driver,User-mode sampling and tracing CPU Name: Intel(R) Processor code named Skylake Frequency: 4.008 GHz Logical CPU Count: 8

Example 2: Threading Summary

This example generates a summary report for the Threading analysis result r003tr. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:

vtune -report summary -r r003tr
Summary ------- Average Concurrency: 1.073 Elapsed Time: 13.911 CPU Time: 11.031 Wait Time: 64.468 Average CPU Usage: 0.768

To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.

Hardware Event-based Sampling Collection Summary Report

For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):

  • Collection and Platform information
  • Microarchitecture Exploration metrics
  • CPU information
  • GPU information
  • Summary per basic analysis metrics
  • Event summary
  • Uncore Event summary

For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:

  • Use the -report-knob show-issues=false option when generating the report, for example: vtune -report summary -r r001hpc -report-knob show-issues=false

  • Use the -format=csv option to view the report in the CSV format, for example: vtune -report summary -r r001hpc -format=csv

Example 3: Hardware Event-Based Sampling Hotspots Summary

This example generates the summary report for the r001hs Hotspots analysis (hardware event-based sampling mode) result on Windows* OS.

vtune -report summary -r r001hs
Elapsed Time: 3.986s CPU Time: 1.391s CPI Rate: 0.860 Wait Time: 65.023s Inactive Time: 14.819s Total Thread Count: 25 Paused Time: 0s Hardware Events Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample ----------------------------------- -------------------- --------------------------- ----------------- CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 24,832,593 8 1000030 CPU_CLK_UNHALTED.REF_TSC 3,471,208,416 120 24000000 CPU_CLK_UNHALTED.REF_XCLK 43,877,874 14 1000030 CPU_CLK_UNHALTED.THREAD 3,903,569,890 127 24000000 FP_ARITH_INST_RETIRED.SCALAR_DOUBLE 943,046,424 14 20000030 INST_RETIRED.ANY 4,536,715,682 140 24000000 UOPS_EXECUTED.THREAD 5,282,967,942 72 20000030 UOPS_RETIRED.RETIRE_SLOTS 5,587,595,565 76 20000030 Collection and Platform Info Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe C:\samples\tachyon\dat\balls.dat Operating System: Microsoft Windows 10 Computer Name: My Computer Result Size: 13 MB Collection start time: 12:12:52 24/07/2018 UTC Collection stop time: 12:13:03 24/07/2018 UTC Collector Type: Event-based sampling driver CPU Name: Intel(R) Processor code named Skylake ULT Frequency: 2.496 GHz Logical CPU Count: 4

Use the Elapsed Time metric as your performance baseline to estimate your optimizations.

Example 4: HPC Performance Characterization Summary

This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:

vtune -report summary -r r001hpc -report-knob show-issues=false
Elapsed Time: 23.182s GFLOPS: 14.748 Effective Physical Core Utilization: 58.0% Effective Logical Core Utilization: 13.920 Out of 24 logical CPUs Serial Time: 0.069s (0.3%) Parallel Region Time: 23.113s (99.7%) Estimated Ideal Time: 14.010s (60.4%) OpenMP Potential Gain: 9.103s (39.3%) Memory Bound: 0.446 Cache Bound: 0.175 DRAM Bound: 0.216 NUMA: % of Remote Accesses: 38.3% FPU Utilization: 2.7% GFLOPS: 14.748 Scalar GFLOPS: 4.801 Packed GFLOPS: 9.947 Collection and Platform Info Application Command Line: ./sp.B.x User Name: vtune Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7. 2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2" Computer Name: nntvtune235 Result Size: 1 GB Collection start time: 19:04:30 13/06/2017 UTC Collection stop time: 19:04:53 13/06/2017 UTC Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown Frequency: 2.694 GHz Logical CPU Count: 24 CPU Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown Frequency: 2.694 GHz Logical CPU Count: 24

Example 5: Memory Access Summary

This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:

vtune -report summary -r r001macc
Elapsed Time: 7.917s CPU Time: 6.473s Memory Bound: 21.9% of Pipeline Slots | The metric value is high. This may indicate that a significant fraction | of execution pipeline slots could be stalled due to demand memory load | and stores. Explore the metric breakdown by memory hierarchy, memory | bandwidth information, and correlation by memory objects. | L1 Bound: 8.0% of Clockticks | This metric shows how often machine was stalled without missing the | L1 data cache. The L1 cache typically has the shortest latency. | However, in certain cases like loads blocked on older stores, a load | might suffer a high latency even though it is being satisfied by the | L1. | L2 Bound: 3.0% of Clockticks L3 Bound: 5.0% of Clockticks | This metric shows how often CPU was stalled on L3 cache, or contended | with a sibling Core. Avoiding cache misses (L2 misses/L3 hits) | improves the latency and increases performance. | DRAM Bound: 4.1% of Clockticks DRAM Bandwidth Bound: 0.4% of Elapsed Time Memory Latency: 0.000 Loads: 10,137,704,122 Stores: 3,208,896,264 LLC Miss Count: 1,750,105 Average Latency (cycles): 11 Total Thread Count: 21 Paused Time: 0s System Bandwidth Max DRAM System Bandwidth: 15 GB Bandwidth Utilization Bandwidth Domain Platform Maximum Observed Maximum Average Bandwidth % of Elapsed Time with High BW Utilization(%) ---------------- ---------------- ---------------- ----------------- --------------------------------------------- DRAM, GB/sec 15 11.300 2.836 0.4% Collection and Platform Info Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat" Operating System: Microsoft Windows 10 Computer Name: My Computer Result Size: 31 MB Collection start time: 09:33:44 07/06/2017 UTC Collection stop time: 09:33:52 07/06/2017 UTC CPU Name: Intel(R) Processor code named Skylake ULT Frequency: 2.496 GHz Logical CPU Count: 4

The Bandwidth Utilization section in the summary report shows the following metrics:

  • Platform Maximum: Expected maximum bandwidth for the system. This value can be automatically estimated using micro-benchmark at the start of analysis or hard-coded based on theoretical bandwidth limits.

  • Observed Maximum: Maximum bandwidth observed during the analysis. If the value is close to the Platform Maximum, your workload is probably bandwidth-limited.

  • Average Bandwidth: Average bandwidth utilization during the analysis.

  • % of Elapsed Time with High BW Utilization: Percentage of Elapsed time spent heavily utilizing system bandwidth.

This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).

See Also