Intel® VTune™ Profiler

Cookbook

ID 766316
Date 6/03/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

OS Thread Migration

Identify OS thread migration on the NUMA architecture with the Hotspots analysis in Intel® VTune™ Profiler.

Complex operating systems use a scheduler to assign application threads to processor cores. These threads are called software threads. The scheduler may choose the placement of the application threads on the physical cores depending on a number of different factors such as system state or system policies.

A software thread can execute on a core for some period of time before it gets swapped out to wait. Several reasons can cause a software thread to wait. Getting blocked for I/O is one factor. If available, another software thread may be given a chance to execute on this core. When the original software thread is available to execute once again, the scheduler may migrate the thread over to another core to ensure timely execution.

This poses a problem to newer computing architectures as this software thread migration disassociates the thread from data that has already been fetched into the caches, resulting in longer data access latencies. This problem is further amplified in Non-Uniform Memory Access (NUMA) architectures, where each processor has its own local memory module that it can access directly with a distinct performance advantage. In a NUMA architecture, when a software thread is migrated to another core, the data stored in the local memory of the first core becomes remote and memory access times increase significantly. Hence, thread migration can hurt performance.

Follow this recipe to see if thread migration occurs in your application.

Content Expert: Jeffrey Reinemann

  1. INGREDIENTS

  2. DIRECTIONS:

    1. Run Hotspots Analysis with Hardware Event-Based Sampling.

    2. Identify thread migration.

    3. Correct thread migration.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

  • Application: Sample OpenMP* application. The application is used as a demo and not available for download.

  • Performance analysis tools: Intel® VTune™ Profiler version 2018 or newer - Hotspots analysis

  • Operating system: Linux*, Ubuntu* 22.04 64-bit

  • CPU: Intel® Core™ i7-6700K processor

Run Hotspots Analysis with Hardware Event-Based Sampling

  1. In the Intel® VTune™ Profiler UI, select Hotspots Analysis from the Analysis Tree.
  2. Configure the analysis. Select a sample application.
  3. Select Hardware Event-Based Sampling mode with a CPU sampling interval of 1 ms.
  4. Run the analysis.

Identify Thread Migration

Once the analysis completes, the Summary window opens with a list of top hotspots in your application.

Examine this list and then switch to the Bottom-up window.

Follow these steps:

Select the Core/Thread/Function/Call Stack grouping.

Expand core nodes to see the number of software threads. The number of software threads should be less than or equal to the total number of hardware threads which are supported by the CPU. Also, the software threads should be equally distributed across the cores. If you see a higher count of software threads under any core in your result, there is a thread migration occurring in your application. In this example, there are 12 OpenMP* worker threads in place of 2 threads. This example uses an Intel® Xeon® processor which supports Intel® Hyper-Threading Technology. In core_8, we see that thread migration is happening.

Next, analyze thread migration in the Timeline pane. Select the Thread/Logical Core grouping.

Expand the thread nodes to see the number of CPUs where this thread was executed. Analyze thread execution over time. In this example, OpenMP thread #0 was executing on cpu_23 and then migrated to cpu_47.

To run this analysis from the command line, type:

vtune -group-by thread,cpuid -report hotspots -r  /temp/test/omp -s "Logical Core" -q | less
		

Thread                  Logical Core  CPU Time:Self
------------------------------  -----------  -------------
OMP Worker Thread #5 (0x3d86)    cpu_0                0.004
matmul-intel64 (0x3d52)          cpu_1                0.013
OMP Worker Thread #15 (0x3d90)   cpu_10               2.418
matmul-intel64 (0x3d52)          cpu_10               2.023
OMP Worker Thread #8 (0x3d89)    cpu_10               0.687
OMP Worker Thread #13 (0x3d8e)   cpu_10               0.097
OMP Worker Thread #6 (0x3d87)    cpu_10               0.065
OMP Worker Thread #4 (0x3d85)    cpu_10               0.059
OMP Worker Thread #1 (0x3d82)    cpu_10               0.048
OMP Worker Thread #9 (0x3d8a)    cpu_10               0.034
OMP Worker Thread #11 (0x3d8c)   cpu_10               0.009

Similarly, you can notice the large number of OpenMP worker threads running on cpu_10.

Correct Thread Migration

You can correct the effects of thread migration by setting the thread affinity. Thread affinity refers to the action of restricting the execution of certain threads to a subset of the physical processing units in a multiprocessor computer.

To set thread affinity for your OpenMP application, use the Intel® runtime library which can bind OpenMP threads to physical processing units. You can also use one of these environment variables:

  • OMP_PROC_BIND
  • OMP_PLACES
  • Intel runtime specific KMP_AFFINITY