Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 10/31/2024
Public
Document Table of Contents

Managing Performance with Heterogeneous Cores

A hybrid architecture offers heterogeneous CPU cores. For example, the 12th Gen Intel® Core™ processor (Alder Lake) contains two types of cores: Performance-cores (P-cores) and Efficient-cores (E-cores).

Achieving the best performance on a hybrid architecture is harder because load balancing with heterogeneous cores is more complicated. Therefore, for hybrid architectures like Alder Lake, we recommend running threads on the P-cores only. This approach might not yield the best performance, but it is simple and predictable.

To specify P-cores with OpenMP, users can use the environment variable KMP_HW_SUBSET. For a detailed description of this environment variable, refer to the Intel® C++ Compiler Classic Developer Guide and Reference. In the case of an Alder Lake processor with eight P-cores, either of the following two commands can be used for restricting threads to run only on the P-cores:

export KMP_HW_SUBSET=8c:intel_core

—or—

export KMP_HW_SUBSET=8c:eff1

Note that for higher performance, Intel® Hyper-Threading Technology on P-cores must be disabled. You can achieve this either by changing the BIOS setting or by using KMP_HW_SUBSET to specify P-cores and one-thread-per-core with the following command:

export KMP_HW_SUBSET=8c:intel_core,1t

—or—

export KMP_HW_SUBSET=8c:eff1,1t

If the user decides to adopt the more difficult approach of running on both P-cores and E-cores to maximize performance, there are a few aspects to take into consideration:

  • Static versus dynamic load balancing
  • Problem size
  • Number of P-cores and E-cores
  • OpenMP versus oneTBB

If there are similar or equal numbers of P-cores and E-cores and if both core types are used, using static load balancing for splitting the work items is likely to result in lower performance because E-cores will take longer to complete the work items assigned to them. For large GEMMs and {S,D}GETRF routines, oneMKL has implemented dynamic load balancing with OpenMP and will automatically select the best load balancing scheme. For most cases with small or regular problem sizes, static load balancing on P-cores is likely to give better performance. If the problem size is very large, the overhead of dynamic scheduling is small compared to overall computation time and dynamic load balancing will make more efficient use of P-cores and E-cores.

If the number of P-cores is much smaller than the number of E-cores, running on all cores may outperform limiting computations to only P-cores. Additional performance measurements would be needed to determine the best strategy.

As an alternative to OpenMP, users can also try oneTBB, which might give better results for a given set of supported operations.

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201