Application Performance Snapshot User Guide for Linux* OS

ID 772048
Date 10/31/2024
Public

Metrics Reference

The following list describes metrics supported by Application Performance Snapshot, along with their descriptions. If data for a metric is available in the statistics files, the data is displayed in the analysis summary on the command line and in the HTML report. Some metrics are platform-specific while others are available only if the application uses MPI or OpenMP*.

Cache Stalls

This metric indicates how often the machine was stalled on L1, L2, and L3 cache. While cache hits are serviced much more quickly than hits in DRAM, they can still incur a significant performance penalty. This metric also includes coherence penalties for shared data.

CCL Time

The amount of time spent inside the Intel® oneAPI Collective Communications Library (oneCCL) library. Values exceeding 15% may require further investigation into the efficiency of oneCCL communication. Possible reasons for values exceeding 15% could be:

  • High wait times within the library
  • Active communications
  • Suboptimal oneCCL settings
Check the oneCCL Time/Hotspots metric histogram/outliers to identify if your application has load balancing issues.

CPI (Cycles per Instruction Retired) Rate

The amount of time each executed instruction took measured by cycles. A CPI of 1 is considered acceptable for high performance computing (HPC) applications, but different application domains will have varied expected values. The CPI value tends to be greater when there is long-latency memory, floating-point, or SIMD operations, non-retired instructions due to branch mispredictions, or instruction starvation at the front end.

CPU Utilization

This metric helps evaluate the parallel efficiency of your application. It estimates the utilization of all the logical CPU cores in the system by your application. 100% utilization means that your application keeps all the logical CPU cores busy for the entire time that it runs. Note that the metric does not distinguish between useful application work and the time that is spent in parallel runtimes.

DP GFLOPS

Number of double precision giga-floating point operations calculated per second. DP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.

DRAM Bandwidth

The metrics in this section indicate the extent of high DRAM bandwidth utilization by the system during elapsed time. They include:

  • Average Bandwidth - Average memory bandwidth used by the system during elapsed time.
  • Peak - Maximum memory bandwidth used by the system during elapsed time.
  • Bound - The portion of elapsed time during which the utilization of memory bandwidth was above a 70% threshold value of the theoretical maximum memory bandwidth for the platform.
Some applications can execute in phases that use memory bandwidth in a non-uniform manner. For example, an application that has an initialization phase may use more memory bandwidth initially. Use these metrics to identify how the application uses memory through the duration of execution.

DRAM Stalls

This metric indicates how often the CPU was stalled on the main memory (DRAM) because of demand loads or stores.

Elapsed Time

Execution time of specified application in seconds.

FP Arith/Mem Rd Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory read instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.

FP Arith/Mem Wr Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution. The metric value might indicate unaligned access to data for vector operations.

GPU Metrics

This section contains metrics that enable you to analyze the efficiency of GPU utilization within your application.

GPU Accumulated Time

This is the sum total of all times when each GPU stack had at least one thread scheduled.

GPU IPC (Instructions Per Cycle)

This is the average number of instructions per cycle processed by the two FPU pipelines of Intel ®Integrated Graphics.

GPU Occupancy

This is the normalized sum of all cycles on all core and thread slots when a slot has a thread scheduled.

GPU Stack Utilization

The average portion of time during when at least one GPU XVE thread was scheduled on each GPU stack. This metric is a percentage of the GPU Accumulated Time. This metric has a second-level breakdown by state:

  • XVE Active: The normalized sum of all cycles on all cores spent actively executing instructions.
  • XVE Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core.
  • XVE Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread was loaded, but the core remained stalled.

Instruction Mix

This section contains the breakdown of micro-operations by single precision (SP FLOPs) and double precision (DP FLOPs) floating point and non-floating point (non-FP) operations. SP and DP FLOPs contain next level metrics that enable you to estimate the fractions of packed and scalar operations. Packed operations can be analyzed by the vector length (128, 256, 512-bit) used in the application.

Intel® Omni-Path Fabric Interconnect Bandwidth and Packet Rate

(Available for compute nodes equipped with Intel® Omni-Path Fabric (Intel® OP Fabric) and with Intel® VTune™ Profiler drivers installed)

Average interconnect bandwidth and packet rate per compute node, broken down by outgoing and incoming values. High values close to the interconnect limit might lead to higher latency network communications. The interconnect metrics are available for Intel Omni-Path Fabric when the Intel VTune Profiler driver is installed.

Memory Stalls

This metric indicates how memory subsystem issues affect the performance. It measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. The metric value can indicate that a significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. See the second level metrics to define if the application is cache- or DRAM-bound and the NUMA efficiency.

MPI Imbalance

Mean unproductive wait time per process spent in the MPI library calls when a process is waiting for data.

MPI Time

Time spent inside the MPI library. Values more than 15% might need additional exploration on MPI communication efficiency. This might be caused by high wait times inside the library, active communications, non-optimal settings of the MPI library. See MPI Imbalance metric to see if the application has load balancing problem.

NUMA: % of Remote Accesses

In non-uniform memory architecture (NUMA) machines, memory requests missing last level cache may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric indicates the percentage of remote accesses, the lower the better.

OpenMP Imbalance

The metric indicates the percentage of elapsed time that your application wastes at OpenMP* synchronization barriers because of load imbalance.

PCIe Metrics

Average bandwidth of inbound read and write operations initiated by PCIe devices. The data is shown for GPU and network controller devices.

Serial Time

Time spent by the application outside any OpenMP region in the master thread during collection. This directly impacts application Collection Time and scaling. High values might signal a performance problem to be solved via code parallelization or algorithm tuning.

SP GFLOPS

Number of single precision giga-floating point operations calculated per second. SP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.

Vectorization

The percentage of packed (vectorized) floating point operations. The higher the value, the bigger the vectorized portion of the code. This metric does not account for the actual vector length used for executing vector instructions. As a result, if the code is fully vectorized, but uses a legacy instruction set that only utilizes a half of the vector length, the Vectorization metric is still equal to 100%.

Xe Link Outgoing Bandwidth

The average memory bandwidth from one GPU to another GPU through an Xe link.

Xe Link Outgoing Traffic

The normalized sum of all data transfers from GPU to another GPU through an Xe link.