Intel® VTune™ Profiler

Cookbook

ID 766316
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Page Faults

Identify and measure the impact of page faults on application performance. Use Microarchitecture Exploration, System Overview, and Memory Consumption analyses in Intel® VTune™ Profiler.

Content Expert: Jeffrey Reinemann

A page fault occurs when a running program accesses a memory page that is not currently mapped to the virtual address space of a process. The Memory-Management Unit (MMU) handles mapping. The MMU uses a Translation Lookaside Buffer (TLB) as a cache to reduce the time taken to access a memory location. When a TLB miss occurs, the page may be accessible to the process but not just actually mapped. Alternatively, the page content may need to be loaded from the storage device issuing a page fault exception. While page faults are a common mechanism for handling virtual memory, their impact on the performance of your application can be significant due to a variety of ways to increase the page size.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

  • Application: matrix application available in the product directory (<install-dir>/samples/en/C++). For this recipe,

    1. Change the size of matrices. In src/multiply.h, modify the NUM value from 2048 to 8192.
    2. Rebuild the matrix application. Run make from the /linux directory.

  • Performance analysis tools: Intel® VTune™ Profiler - Microarchitecture Exploration, System Overview, and Memory Consumption analysis types

  • Operating System: Ubuntu* 22.04.1 LTS 64-bit

Identify TLB Issues with Microarchitecture Exploration Analysis

Assess the usage of hardware resources by your application. Run the Microarchitecture Exploration analysis:

  1. Open Intel® VTune™ Profiler. By default, the sample (matrix) project opens as the current project. Make sure this project is configured to launch the matrix application with NUM=8192 in src/multiply.h. Otherwise, create a new project for the updated application.

  2. On the Welcome page, click Configure Analysis.

  3. In the HOW pane, select Microarchitecture Exploration from the Microarchitecture analysis group.

  4. Click the Start button to run the analysis.

When the analysis completes, Intel® VTune™ Profiler finalizes results and opens the Summary window with application-level statistics.

Explore the Back-End Bound issues caused by TLB misses:

The DTLB Overhead metric estimates the performance penalty paid for missing TLB. Most of the overhead is attributed to the Load STLB Hit metric, counting first-level (DTLB) misses that hit the second-level TLB (STLB).

There is a small value of the Load STLB Miss metric representing a fraction of cycles performing a hardware page walk. Know that these metrics do not account for the overall time spent within page fault exceptions. While the Microarchitecture Exploration analysis helps you diagnose TLB-related issues, you still need to estimate an impact of page fault exceptions on the application elapsed time.

Trace Kernel Activity with System Overview Analysis

A page fault triggers an interrupt caught by the Linux kernel. To measure the exact CPU time spent within the Linux kernel, you need an analysis that is more granular. The System Overview analysis in the Hardware Tracing mode uses Intel® Processor Trace technology to capture all the retired branch instructions on CPU cores. In particular, this analysis enables accurate tracing of all the kernel activities including interrupts:

Even with the Launch Application target configuration, this analysis performs a system-wide data collection.

Due to a significant amount of branch instructions, this analysis collects a lot of raw data. You can run the analysis from the command line and limit the scope of data collection scope to the first 3 seconds. Before you run the analysis from the command-line, make sure to set up environment variables by running this script from the product installation directory: source env/vars.sh.

Next, run the analysis:

vtune -collect system-overview -knob collecting-mode=hw-tracing -d 3 -r matrix-so ./matrix

Open the result in the VTune Profiler GUI:

vtune-gui ./matrix-so

When the result opens, switch to the Platform tab and filter the collected data by the matrix process using the filter bar drop-down menu:

In the Timeline pane, you can see that most of the CPU time is spent within the matrix module executing the multiply function. This function is not executed continuously. In a few milliseconds, the multiply function is interrupted, and the heaviest interrupts are caused by page faults:

The grid view helps you discover that overall time spent by the sample application within the Linux kernel is 6.1%, where 439K kernel entries occurred just within the first 3 seconds of the application execution. To resolve this, consider using huge pages.

Calculate the Amount of Allocated Memory with Memory Consumption Analysis

To switch to huge pages, define the number of pages you need.

To find this number, calculate the amount of memory allocated by the application. For simple applications like matrix, you can inspect the source code. For more complex applications, run the Memory Consumption analysis to find the exact allocated memory size or identify objects that should use huge pages.

  1. Click Configure Analysis to open your matrix project configuration.

  2. In the HOW pane, select Memory Consumption from the Hotspots analysis group.

  3. Change the Minimal dynamic memory object size to track option value to 1.

  4. Click the Start button to run the analysis.

    Once Intel® VTune™ Profiler completes data collection, the results are finalized and displayed in the Summary window with application-level statistics.

  5. Click the Bottom-up tab. In the Allocation Size column, right-click and select Show Data As > Counts for a bytes representation:

  6. Right-click the grid again and choose Select All (alternatively, press Ctrl-A) to see the total allocation size.

    The application allocates 2147557472 bytes:

Reduce Page Faults with Huge Pages

By default, a page size is 4Kb. With huge pages, the default page size is 2Mb and it can be increased up to 1Gb. To switch to huge pages, use libhugetlbfs.

First, calculate the number of 2Mb pages you need. The sample matrix allocates 2147557472 bytes. This means that you need 2147557472 / 2097152 = 1025 pages of 2Mb (using top rounding).

To switch to huge pages:

  1. Configure the number of pages:

    sudo hugeadm --pool-pages-min 2Mb:1025

  2. Create a matrix.sh script with this content:

    #!/bin/bash
    LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes ./matrix

  3. Set the executable mode for the script:

    chmod u+x ./matrix.sh

  4. Repeat the System Overview analysis.

    vtune -collect system-overview -knob collecting-mode=hw-tracing -d 3 -r matrix-so-hp ./matrix.sh

  5. Open the result in the Intel® VTune™ Profiler GUI:

    vtune-gui ./matrix-so-hp

The Platform view shows a 3.3% reduction of kernel CPU time and 8.1x reduction on kernel-mode entries:

The elapsed time of the matrix application with huge pages is reduced from 106.4s to 100.5s, which is around 5% of an overall elapsed time improvement without requiring any code change.