Metrics Reference
The following list describes metrics supported by Application Performance Snapshot, along with their descriptions. If data for a metric is available in the statistics files, the data is displayed in the analysis summary on the command line and in the HTML report. Some metrics are platform-specific while others are available only if the application uses MPI or OpenMP*.
Cache Stalls
This metric indicates how often the machine was stalled on L1, L2, and L3 cache. While cache hits are serviced much more quickly than hits in DRAM, they can still incur a significant performance penalty. This metric also includes coherence penalties for shared data.
CCL Time
The amount of time spent inside the Intel® oneAPI Collective Communications Library (oneCCL) library. Values exceeding 15% may require further investigation into the efficiency of oneCCL communication. Possible reasons for values exceeding 15% could be:
- High wait times within the library
- Active communications
- Suboptimal oneCCL settings
CPI (Cycles per Instruction Retired) Rate
The amount of time each executed instruction took measured by cycles. A CPI of 1 is considered acceptable for high performance computing (HPC) applications, but different application domains will have varied expected values. The CPI value tends to be greater when there is long-latency memory, floating-point, or SIMD operations, non-retired instructions due to branch mispredictions, or instruction starvation at the front end.
CPU Utilization
This metric helps evaluate the parallel efficiency of your application. It estimates the utilization of all the logical CPU cores in the system by your application. 100% utilization means that your application keeps all the logical CPU cores busy for the entire time that it runs. Note that the metric does not distinguish between useful application work and the time that is spent in parallel runtimes.
DP GFLOPS
Number of double precision giga-floating point operations calculated per second. DP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.
DRAM Bandwidth
The metrics in this section indicate the extent of high DRAM bandwidth utilization by the system during elapsed time. They include:
- Average Bandwidth - Average memory bandwidth used by the system during elapsed time.
- Peak - Maximum memory bandwidth used by the system during elapsed time.
- Bound - The portion of elapsed time during which the utilization of memory bandwidth was above a 70% threshold value of the theoretical maximum memory bandwidth for the platform.
DRAM Stalls
This metric indicates how often the CPU was stalled on the main memory (DRAM) because of demand loads or stores.
Elapsed Time
Execution time of specified application in seconds.
This metric represents the ratio between arithmetic floating point instructions and memory read instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.
This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution. The metric value might indicate unaligned access to data for vector operations.
GPU Metrics
This section contains metrics that enable you to analyze the efficiency of GPU utilization within your application.
This is the sum total of all times when each GPU stack had at least one thread scheduled.
This is the average number of instructions per cycle processed by the two FPU pipelines of Intel ®Integrated Graphics.
This is the normalized sum of all cycles on all core and thread slots when a slot has a thread scheduled.
The average portion of time during when at least one GPU XVE thread was scheduled on each GPU stack. This metric is a percentage of the GPU Accumulated Time. This metric has a second-level breakdown by state:
- XVE Active: The normalized sum of all cycles on all cores spent actively executing instructions.
- XVE Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core.
- XVE Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread was loaded, but the core remained stalled.
Instruction Mix
This section contains the breakdown of micro-operations by single precision (SP FLOPs) and double precision (DP FLOPs) floating point and non-floating point (non-FP) operations. SP and DP FLOPs contain next level metrics that enable you to estimate the fractions of packed and scalar operations. Packed operations can be analyzed by the vector length (128, 256, 512-bit) used in the application.
Intel® Omni-Path Fabric Interconnect Bandwidth and Packet Rate
(Available for compute nodes equipped with Intel® Omni-Path Fabric (Intel® OP Fabric) and with Intel® VTune™ Profiler drivers installed)
Average interconnect bandwidth and packet rate per compute node, broken down by outgoing and incoming values. High values close to the interconnect limit might lead to higher latency network communications. The interconnect metrics are available for Intel Omni-Path Fabric when the Intel VTune Profiler driver is installed.
Memory Stalls
This metric indicates how memory subsystem issues affect the performance. It measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. The metric value can indicate that a significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. See the second level metrics to define if the application is cache- or DRAM-bound and the NUMA efficiency.
MPI Imbalance
Mean unproductive wait time per process spent in the MPI library calls when a process is waiting for data.
MPI Time
Time spent inside the MPI library. Values more than 15% might need additional exploration on MPI communication efficiency. This might be caused by high wait times inside the library, active communications, non-optimal settings of the MPI library. See MPI Imbalance metric to see if the application has load balancing problem.
NUMA: % of Remote Accesses
In non-uniform memory architecture (NUMA) machines, memory requests missing last level cache may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric indicates the percentage of remote accesses, the lower the better.
OpenMP Imbalance
The metric indicates the percentage of elapsed time that your application wastes at OpenMP* synchronization barriers because of load imbalance.
PCIe Metrics
Average bandwidth of inbound read and write operations initiated by PCIe devices. The data is shown for GPU and network controller devices.
Serial Time
Time spent by the application outside any OpenMP region in the master thread during collection. This directly impacts application Collection Time and scaling. High values might signal a performance problem to be solved via code parallelization or algorithm tuning.
SP GFLOPS
Number of single precision giga-floating point operations calculated per second. SP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.
Vectorization
The percentage of packed (vectorized) floating point operations. The higher the value, the bigger the vectorized portion of the code. This metric does not account for the actual vector length used for executing vector instructions. As a result, if the code is fully vectorized, but uses a legacy instruction set that only utilizes a half of the vector length, the Vectorization metric is still equal to 100%.
Xe Link Outgoing Bandwidth
The average memory bandwidth from one GPU to another GPU through an Xe link.
Xe Link Outgoing Traffic
The normalized sum of all data transfers from GPU to another GPU through an Xe link.