Application Performance Snapshot User Guide for Linux* OS

ID 772048
Date 10/31/2024
Public

Outlier Detection

Application Performance Snapshot offers a mechanism to detect individual metric values that break a certain threshold or don't fit into the overall distribution.

Outliers are individual metric values from an overall average metric that show a significant disparity with other metric values that contribute to an average metric.

For example, for the DRAM Stalls metric, the value presented is an arithmetic mean of individual DRAM Stalls metric values from all nodes. If an MPI workload is being run on multiple nodes and one or more of the nodes reports a DRAM Stalls value that differs significantly from other nodes or breaks a certain threshold, APS marks this metric as having an outlier.

There are two types of outliers in APS:

  • Statistic: an individual metric value contributing to the overall average metric differs significantly from the overall distribution.
  • Threshold: an individual metric value breaks a certain pre-defined threshold, but the average metric value does not.

If APS indicates the presence of outliers, you can use the HTML report or the command-line interface to see the rank or node responsible for the outlier.

Outliers can:

  • Cause MPI Imbalance during your MPI application run.
  • Distort average metric values, making them less representative of real application performance.

Analyze Outliers from the HTML Report

To check your workload for outliers using the HTML report:

  1. Analyze your application to obtain profiling data.
  2. Open the HTML report in your browser of choice.
  3. Hover over a metric name to see the Metric Tooltip.

    If any outliers are present, APS shows up to three outliers on the metric tooltip, along with the responsible node or rank.

To see more than three outliers, you can use the command line interface of APS.

Analyze Outliers from the Command Line

To check for outliers from the command line:

  1. Analyze your application to obtain profiling data.
  2. Generate a summary APS report for the collected result using the command:
    aps --report /<aps_result>

    APS prints out the summary report with metrics relevant to your application.

  3. If a metric has any kind of outlier, a warning message is printed next to this metric, for example:
    |Some of the individual mertic values contributing to this average metric are
    |statistical outliers that can significantly distort the average metric value.

To determine the exact point where an outlier occurred, print a detailed report for a specific metric using the command:

aps --report --metrics="Metric Name" /<aps_result>

APS prints a table of all individual metric values that contributed to this average metric, showing each metric value, type of outlier, and the specific node or rank, where applicable. You can use this data to troubleshoot the root cause of the outlier. For example, if a single node consistently produces outliers in several hardware metrics, there may be a hardware or software issue with this exact node.

For a full, comprehensive report on all metrics and their outliers, use:

aps --report --counters /<aps_result>