Visible to Intel only — GUID: GUID-8BA55F4A-D5AE-4E27-8C25-058B68D280A4
Visible to Intel only — GUID: GUID-8BA55F4A-D5AE-4E27-8C25-058B68D280A4
Thread Affinity Interface
The Intel® runtime library has the ability to bind OpenMP* threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable. Depending on the system (machine) topology, application, and operating system, thread affinity can have a dramatic effect on the application speed.
Thread affinity restricts execution of certain threads (virtual execution units) to a subset of the physical processing units in a multiprocessor computer. Depending upon the topology of the machine, thread affinity can have a dramatic effect on the execution speed of a program.
Thread affinity is supported on Windows* systems and versions of Linux* systems that have kernel support for thread affinity, but is not supported by macOS.
The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. There are three types of interfaces you can use to specify this binding, which are collectively referred to as the Intel OpenMP Thread Affinity Interface:
The high-level affinity interface uses an environment variable to determine the machine topology and assigns OpenMP threads to the processors based upon their physical location in the machine. This interface is controlled entirely by the KMP_AFFINITY environment variable.
The mid-level affinity interface uses an environment variable to explicitly specifies which processors (labeled with integer IDs) are bound to OpenMP threads. This interface provides compatibility with the GCC* GOMP_AFFINITY environment variable, but you can also invoke it by using the KMP_AFFINITY environment variable. The GOMP_AFFINITY environment variable is supported on Linux systems only, but users on Windows or Linux systems can use the similar functionality provided by the KMP_AFFINITY environment variable.
The low-level affinity interface uses APIs to enable OpenMP threads to make calls into the OpenMP runtime library to explicitly specify the set of processors on which they are to be run. This interface is similar in nature to sched_setaffinity and related functions on Linux systems or to SetThreadAffinityMask and related functions on Windows systems. In addition, you can specify certain options of the KMP_AFFINITY environment variable to affect the behavior of the low-level API interface. For example, you can set the affinity type KMP_AFFINITY to disabled, which disables the low-level affinity interface, or you could use the KMP_AFFINITY or GOMP_AFFINITY environment variables to set the initial affinity mask, and then retrieve the mask with the low-level API interface.
The following terms are used in this section:
The total number of processing elements on the machine is referred to as the number of OS thread contexts.
Each processing element is referred to as an Operating System processor, or OS proc.
Each OS processor has a unique integer identifier associated with it, called an OS proc ID.
The term package refers to a single or multi-core processor chip.
The term OpenMP Global Thread ID (GTID) refers to an integer which uniquely identifies all threads known to the Intel OpenMP runtime library. The thread that first initializes the library is given GTID 0. In the normal case where all other threads are created by the library and when there is no nested parallelism, then n-threads-var - 1 new threads are created with GTIDs ranging from 1 to ntheads-var - 1, and each thread's GTID is equal to the OpenMP thread number returned by function omp_get_thread_num(). The high-level and mid-level interfaces rely heavily on this concept. Hence, their usefulness is limited in programs containing nested parallelism. The low-level interface does not make use of the concept of a GTID and can be used by programs containing arbitrarily many levels of parallelism.
Some environment variables are available for both Intel® microprocessors and non-Intel microprocessors, but may perform additional optimizations for Intel® microprocessors than for non-Intel microprocessors.
The KMP_AFFINITY Environment Variable
You must set the KMP_AFFINITY environment variable before the first parallel region, or certain API calls including omp_get_max_threads(), omp_get_num_procs() and any affinity API calls, as described in Low Level Affinity API, below.
The KMP_AFFINITY environment variable uses the following general syntax:
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]
For example, to list a machine topology map, specify KMP_AFFINITY=verbose,none to use a modifier of verbose and a type of none.
The following table describes the supported specific arguments.
Argument |
Default |
Description |
---|---|---|
noverbose respect granularity=core |
Optional. String consisting of keyword and specifier.
The syntax for <proc-list> is explained in mid-level affinity interface.
NOTE:
On Windows with multiple processor groups, the norespect affinity modifier is assumed when the process affinity mask equals a single processor group (which is default on Windows). Otherwise, the respect affinity modifier is used.
|
|
none |
Required string. Indicates the thread affinity to use.
The logical and physical types are deprecated but supported for backward compatibility. |
|
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. |
|
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. |
Affinity Types
Type is the only required argument.
type = none (default)
Does not bind OpenMP threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify KMP_AFFINITY=verbose,none to list a machine topology map.
type = balanced
Places threads on separate cores until all cores have at least one thread, similar to the scatter type. However, when the runtime must use multiple hardware thread contexts on the same core, the balanced type ensures that the OpenMP thread numbers are close to each other, which scatter does not do. This affinity type is supported on the CPU only for single socket systems.
The OpenMP* environment variable OMP_PROC_BIND=spread is similar to KMP_AFFINITY=balanced and is available on all platforms, including multi-socket CPU systems.
type = compact
Specifying compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. For example, in a topology map, the nearer a node is to the root, the more significance the node has when sorting the threads.
type = disabled
Specifying disabled completely disables the thread affinity interfaces. This forces the OpenMP runtime library to behave as if the affinity interface was not supported by the operating system. This includes the low-level API interfaces such as kmp_set_affinity and kmp_get_affinity, which have no effect and will return a nonzero error code.
type = explicit
Specifying explicit assigns OpenMP threads to a list of OS proc IDs that have been explicitly specified by using the proclist= modifier, which is required for this affinity type. See Explicitly Specifying OS Proc IDs (GOMP_CPU_AFFINITY).
type = scatter
Specifying scatter distributes the threads as evenly as possible across the entire system. scatter is the opposite of compact; so the leaves of the node are most significant when sorting through the machine topology map.
Deprecated Types: logical and physical
Types logical and physical are deprecated and may become unsupported in a future release. Both are supported for backward compatibility.
For logical and physical affinity types, a single trailing integer is interpreted as an offset specifier instead of a permute specifier. In contrast, with compact and scatter types, a single trailing integer is interpreted as a permute specifier.
Specifying logical assigns OpenMP threads to consecutive logical processors, which are also called hardware thread contexts. The type is equivalent to compact, except that the permute specifier is not allowed. Thus, KMP_AFFINITY=logical,n is equivalent to KMP_AFFINITY=compact,0,n (this equivalence is true regardless of the whether or not a granularity=fine modifier is present).
Specifying physical assigns threads to consecutive physical processors (cores). For systems where there is only a single thread context per core, the type is equivalent to logical. For systems where multiple thread contexts exist per core, physical is equivalent to compact with a permute specifier of 1; that is, KMP_AFFINITY=physical,n is equivalent to KMP_AFFINITY=compact,1,n (regardless of the whether or not a granularity=fine modifier is present). This equivalence means that when the compiler sorts the map it should permute the innermost level of the machine topology map to the outermost, presumably the thread context level. This type does not support the permute specifier.
Examples of Types compact and scatter
The following figure illustrates the topology for a machine with two processors, and each processor has two cores; further, each core has Intel® Hyper-Threading Technology (Intel® HT Technology) enabled.
The following figure also illustrates the binding of OpenMP thread to hardware thread contexts when specifying KMP_AFFINITY=granularity=fine,compact.
Specifying scatter on the same system as shown in the figure above, the OpenMP threads would be assigned the thread contexts as shown in the following figure, which shows the result of specifying KMP_AFFINITY=granularity=fine,scatter.
permute and offset Combinations
For both compact and scatter, permute and offset are allowed; however, if you specify only one integer, the compiler interprets the value as a permute specifier. Both permute and offset default to 0.
The permute specifier controls which levels are most significant when sorting the machine topology map. A value for permute forces the mappings to make the specified number of most significant levels of the sort the least significant, and it inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations.
The offset specifier indicates the starting position for thread assignment.
The following figure illustrates the result of specifying KMP_AFFINITY=granularity=fine,compact,0,5.
Consider the hardware configuration from the previous example, running an OpenMP application which exhibits data sharing between consecutive iterations of loops. We would therefore like consecutive threads to be bound close together, as is done with KMP_AFFINITY=compact, so that communication overhead, cache line invalidation overhead, and page thrashing are minimized. Now, suppose the application also had a number of parallel regions which did not utilize all of the available OpenMP threads. It is desirable to avoid binding multiple threads to the same core and leaving other cores not utilized, since a thread normally executes faster on a core where it is not competing for resources with another active thread on the same core. Since a thread normally executes faster on a core where it is not competing for resources with another active thread on the same core, you might want to avoid binding multiple threads to the same core while leaving other cores unused. The following figure illustrates this strategy of using KMP_AFFINITY=granularity=fine,compact,1,0 as a setting.
The OpenMP thread n+1 is bound to a thread context as close as possible to OpenMP thread n, but on a different core. Once each core has been assigned one OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.
Modifier Values for Affinity Types
Modifiers are optional arguments that precede type. If you do not specify a modifier, the noverbose, respect, and granularity=core modifiers are used automatically.
Modifiers are interpreted in order from left to right, and they may conflict. Following conflicting modifier is ignored. For example, specifying KMP_AFFINITY=verbose,noverbose,scatter is therefore equivalent to setting KMP_AFFINITY=verbose,scatter.
modifier = noverbose (default)
Does not print verbose messages.
modifier = verbose
Prints messages concerning the supported affinity. The messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts.
Information about binding OpenMP threads to physical thread contexts is indirectly shown in the form of the mappings between hardware thread contexts and the operating system (OS) processor (proc) IDs. The affinity mask for each OpenMP thread is printed as a set of OS processor IDs.
For example, specifying KMP_AFFINITY=verbose,scatter on a dual core system with two processors, with Intel® Hyper-Threading Technology (Intel® HT Technology) disabled, results in a message listing similar to the following when then program is executed:
... KMP_AFFINITY: Initial OS proc set respected: 0,1,2,3 KMP_AFFINITY: affinity capable, using hwloc. KMP_AFFINITY: 4 available OS procs KMP_AFFINITY: Uniform topology KMP_AFFINITY: 2 sockets x 2 cores/socket x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 KMP_AFFINITY: OS proc 1 maps to socket 3 core 0 thread 0 KMP_AFFINITY: OS proc 3 maps to socket 3 core 1 thread 0 KMP_AFFINITY: pid 79739 tid 79739 thread 0 bound to OS proc set 0 KMP_AFFINITY: pid 79739 tid 79740 thread 2 bound to OS proc set 2 KMP_AFFINITY: pid 79739 tid 79741 thread 3 bound to OS proc set 3 KMP_AFFINITY: pid 79739 tid 79742 thread 1 bound to OS proc set 1
The verbose modifier generates several standard, general messages. The following table summarizes how to read the messages.
Message String |
Description |
---|---|
"affinity capable" |
Indicates that all components (compiler, operating system, and hardware) support affinity, so thread binding is possible. |
"decoding x2APIC ids" |
Indicates that the machine topology was discovered by binding a thread to each operating system processor and decoding the output of the cpuid instruction. |
"using hwloc" |
Indicates that the Portable Hardware Locality* (hwloc) library used to determine machine topology. |
"using /proc/cpuinfo" |
Linux only. Indicates that cpuinfo is being used to determine machine topology. |
"using flat" |
Operating system processor ID is assumed to be equivalent to physical package ID. This method of determining machine topology is used if none of the other methods will work, but may not accurately detect the actual machine topology. |
"uniform topology" |
The machine topology map is a full tree with no missing leaves at any level. |
The mapping from the operating system processors to thread context ID is printed next. The binding of OpenMP thread context ID is printed next unless the affinity type is none. For more information, see Determining Machine Topology.
modifier = granularity
Binding OpenMP threads to particular packages and cores will often result in a performance gain on systems with Intel processors with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled; however, it is usually not beneficial to bind each OpenMP thread to a particular thread context on a specific core. Granularity describes the lowest levels that OpenMP threads are allowed to float within a topology map.
This modifier supports the following additional specifiers.
Specifier |
Description |
---|---|
core |
Default. Allows all the OpenMP threads bound to a core to float between the different thread contexts. |
fine or thread |
The finest granularity level. Causes each OpenMP thread to be bound to a single thread context. The two specifiers are functionally equivalent. |
tile, or die, or node, or group, or socket |
Allows all the OpenMP threads bound to a tile, or die, or NUMA node, or group, or socket to float between the different thread contexts of cores the tile, or die, or NUMA node, or group, or socket consists of. |
Specifying KMP_AFFINITY=verbose,granularity=core,compact on the same dual core system with two processors as in the previous section, but with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, results in a message listing similar to the following when the program is executed:
KMP_AFFINITY: Initial OS proc set respected: 0-7 KMP_AFFINITY: decoding x2APIC ids. KMP_AFFINITY: 8 available OS procs KMP_AFFINITY: Uniform topology KMP_AFFINITY: 2 sockects x 2 cores/socket x 2 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 KMP_AFFINITY: OS proc 4 maps to socket 0 core 0 thread 1 KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 KMP_AFFINITY: OS proc 6 maps to socket 0 core 1 thread 1 KMP_AFFINITY: OS proc 1 maps to socket 3 core 0 thread 0 KMP_AFFINITY: OS proc 5 maps to socket 3 core 0 thread 1 KMP_AFFINITY: OS proc 3 maps to socket 3 core 1 thread 0 KMP_AFFINITY: OS proc 7 maps to socket 3 core 1 thread 1 KMP_AFFINITY: pid 40880 tid 40880 thread 0 bound to OS proc set 0,4 KMP_AFFINITY: pid 40880 tid 40881 thread 1 bound to OS proc set 0,4 KMP_AFFINITY: pid 40880 tid 40882 thread 2 bound to OS proc set 2,6 KMP_AFFINITY: pid 40880 tid 40883 thread 3 bound to OS proc set 2,6 KMP_AFFINITY: pid 40880 tid 40884 thread 4 bound to OS proc set 1,5 KMP_AFFINITY: pid 40880 tid 40885 thread 5 bound to OS proc set 1,5 KMP_AFFINITY: pid 40880 tid 40886 thread 6 bound to OS proc set 3,7 KMP_AFFINITY: pid 40880 tid 40887 thread 7 bound to OS proc set 3,7
The affinity mask for each OpenMP thread is shown in the listing (above) as the set of operating system processor to which the OpenMP thread is bound.
The following figure illustrates the machine topology map, for the above listing, with OpenMP thread bindings.
In contrast, specifying KMP_AFFINITY=verbose,granularity=fine,compact or KMP_AFFINITY=verbose,granularity=thread,compact binds each OpenMP thread to a single hardware thread context when the program is executed:
KMP_AFFINITY: Initial OS proc set respected: 0-7 KMP_AFFINITY: decoding x2APIC ids. KMP_AFFINITY: 8 available OS procs KMP_AFFINITY: Uniform topology KMP_AFFINITY: 2 sockets x 2 cores/socket x 2 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 KMP_AFFINITY: OS proc 4 maps to socket 0 core 0 thread 1 KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 KMP_AFFINITY: OS proc 6 maps to socket 0 core 1 thread 1 KMP_AFFINITY: OS proc 1 maps to socket 3 core 0 thread 0 KMP_AFFINITY: OS proc 5 maps to socket 3 core 0 thread 1 KMP_AFFINITY: OS proc 3 maps to socket 3 core 1 thread 0 KMP_AFFINITY: OS proc 7 maps to socket 3 core 1 thread 1 KMP_AFFINITY: pid 40895 tid 40895 thread 0 bound to OS proc set 0 KMP_AFFINITY: pid 40895 tid 40896 thread 1 bound to OS proc set 4 KMP_AFFINITY: pid 40895 tid 40897 thread 2 bound to OS proc set 2 KMP_AFFINITY: pid 40895 tid 40898 thread 3 bound to OS proc set 6 KMP_AFFINITY: pid 40895 tid 40899 thread 4 bound to OS proc set 1 KMP_AFFINITY: pid 40895 tid 40900 thread 5 bound to OS proc set 5 KMP_AFFINITY: pid 40895 tid 40901 thread 6 bound to OS proc set 3 KMP_AFFINITY: pid 40895 tid 40902 thread 7 bound to OS proc set 7
The OpenMP to hardware context binding for this example was illustrated in the first example.
Specifying granularity=fine will always cause each OpenMP thread to be bound to a single OS processor. This is equivalent to granularity=thread, currently the finest granularity level.
modifier = respect (Default)
Respect the process' original affinity mask, or more specifically, the affinity mask in place for the thread that initializes the OpenMP runtime library. The behavior differs between Linux and Windows:
Linux
Respect the affinity mask for the thread that initializes the OpenMP runtime library.
Windows
Respect original affinity mask for the process.
Specifying KMP_AFFINITY=verbose,compact for the same system used in the previous example, with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, and invoking the library with an initial affinity mask of {4,5,6,7} (thread context 1 on every core) causes the compiler to model the machine as a dual core, two-processor system with Intel® HT Technology disabled.
KMP_AFFINITY: Initial OS proc set respected: 4-7 KMP_AFFINITY: decoding x2APIC ids. KMP_AFFINITY: 4 available OS procs KMP_AFFINITY: Uniform topology KMP_AFFINITY: 2 sockets x 2 cores/socket x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map: KMP_AFFINITY: OS proc 4 maps to socket 0 core 0 thread 1 KMP_AFFINITY: OS proc 6 maps to socket 0 core 1 thread 1 KMP_AFFINITY: OS proc 5 maps to socket 3 core 0 thread 1 KMP_AFFINITY: OS proc 7 maps to socket 3 core 1 thread 1 KMP_AFFINITY: pid 41032 tid 41032 thread 0 bound to OS proc set 4 KMP_AFFINITY: pid 41032 tid 41033 thread 1 bound to OS proc set 6 KMP_AFFINITY: pid 41032 tid 41034 thread 2 bound to OS proc set 5 KMP_AFFINITY: pid 41032 tid 41035 thread 3 bound to OS proc set 7
Because there are four thread contexts accessible on the machine, by default the compiler created four threads for an OpenMP parallel construct.
The following figure illustrates the corresponding machine topology map and threads placement in case eight OpenMP threads requested via OMP_NUM_THREADS=8
When using the local cpuid information to determine the machine topology, it is not always possible to distinguish between a machine that does not support Intel® Hyper-Threading Technology (Intel® HT Technology) and a machine that supports it, but has it disabled. Therefore, the compiler does not include a level in the map if the elements (nodes) at that level had no siblings, with the exception that the package level is always modeled. As mentioned earlier, the package level will always appear in the topology map, even if there only a single package in the machine.
modifier = norespect
Do not respect original affinity mask for the process. Binds OpenMP threads to all operating system processors.
In early versions of the OpenMP runtime library that supported only the physical and logical affinity types, norespect was the default and was not recognized as a modifier.
The default was changed to respect when types compact and scatter were added; therefore, thread bindings may have changed with the newer compilers in situations where the application specified a partial initial thread affinity mask.
modifier = nowarnings
Do not print warning messages from the affinity interface.
modifier = warnings (Default)
Print warning messages from the affinity interface (default).
modifier = noreset (Default)
Do not reset the primary thread's affinity after each outermost parallel region is complete. This setting preserves the primary thread's OpenMP affinity setting between parallel regions. For example, if KMP_AFFINITY=compact,granularity=core, then the primary thread's affinity is set to the first core for the first parallel region and kept that way for the thread's lifetime, even during serial regions.
modifier = reset
Reset the primary thread's affinity after each outermost parallel region is complete. This setting will reset the primary thread's affinity back to the initial affinity before OpenMP was initialized after each outermost parallel region is complete.
Determine Machine Topology
If the package has an APIC (Advanced Programmable Interrupt Controller), the compiler will use the cpuid instruction to obtain the package id, core id, and thread context id. Under normal conditions, each thread context on the system is assigned a unique APIC ID at boot time. The compiler obtains other pieces of information obtained by using the cpuid instruction, which together with the number of OS thread contexts (total number of processing elements on the machine), determine how to break the APIC ID down into the package ID, core ID, and thread context ID.
There are several ways to specify the APIC ID in the cpuid instruction - the legacy method in leaf 4, and the more modern method in leaf 11 and leaf 31. Only 256 unique APIC IDs are available in leaf 4. Leaf 11 and leaf 31 have no such limitation.
Normally, all core ids on a package and all thread context ids on a core are contiguous; however, numbering assignment gaps are common for package ids, as shown in the figure above.
If the compiler cannot determine the machine topology using any other method, but the operating system supports affinity, a warning message is printed, and the topology is assumed to be flat. For example, a flat topology assumes the operating system process N maps to package N, and there exists only one thread context per core and only one core for each package.
If the machine topology cannot be accurately determined as described above, the user can manually copy /proc/cpuinfo to a temporary file, correct any errors, and specify the machine topology to the OpenMP runtime library via the environment variable KMP_CPUINFO_FILE=<temp_filename>, as described in the section KMP_CPUINFO_FILE and /proc/cpuinfo.
Regardless of the method used in determining the machine topology, if there is only one thread context per core for every core on the machine, the thread context level will not appear in the topology map. If there is only one core per package for every package in the machine, the core level will not appear in the machine topology map. The topology map need not be a full tree, because different packages may contain a different number of cores, and different cores may support a different number of thread contexts.
The package level will always appear in the topology map, even if there only a single package in the machine.
KMP_CPUINFO_FILE and /proc/cpuinfo
One of the methods the OpenMP runtime library can use to detect the machine topology on Linux systems is to parse the contents of /proc/cpuinfo. If the contents of this file (or a device mapped into the Linux file system) are insufficient or erroneous, you can consider copying its contents to a writable temporary file <temp_file>, correct it or extend it with the necessary information, and set KMP_CPUINFO_FILE=<temp_file>.
If you do this, the OpenMP runtime library will read the <temp_file> location pointed to by KMP_CPUINFO_FILE instead of the information contained in /proc/cpuinfo or attempting to detect the machine topology by decoding the APIC IDs. That is, the information contained in the <temp_file> overrides these other methods. You can use the KMP_CPUINFO_FILE interface on Windows systems, where /proc/cpuinfo does not exist.
The content of /proc/cpuinfo or <temp_file> should contain a list of entries for each processing element on the machine. Each processor element contains a list of entries (descriptive name and value on each line). A blank line separates the entries for each processor element. Only the following fields are used to determine the machine topology from each entry, either in <temp_file> or /proc/cpuinfo:
Field |
Description |
---|---|
processor : |
Specifies the OS ID for the processing element. The OS ID must be unique. The processor and physical id fields are the only ones that are required to use the interface. |
physical id : |
Specifies the package ID, which is a physical chip ID. Each package may contain multiple cores. The package level always exists in the compiler's OpenMP runtime library model of the machine topology. |
core id : |
Specifies the core ID. If it does not exist, it defaults to 0. If every package on the machine contains only a single core, the core level will not exist in the machine topology map (even if some of the core ID fields are non-zero). |
apicid : |
Specifies the thread ID. If it does not exist, it defaults to 0. If every core on the machine contains only a single thread, the thread level will not exist in the machine topology map (even if some thread ID fields are non-zero). |
node_n id : |
This is a extension to the normal contents of /proc/cpuinfo that can be used to specify the nodes at different levels of the memory interconnect on Non-Uniform Memory Access (NUMA) systems. Arbitrarily many levels n are supported. The node_0 level is closest to the package level; multiple packages comprise a node at level 0. Multiple nodes at level 0 comprise a node at level 1, and so on. |
Each entry must be spelled exactly as shown, in lowercase, followed by optional whitespace, a colon (:), more optional whitespace, then the integer ID. Fields other than those listed are simply ignored.
It is common for the thread id field to be missing from /proc/cpuinfo on many Linux variants, and for a field labeled siblings to specify the number of threads per node or number of nodes per package. However, the Intel OpenMP runtime library ignores fields labeled siblings so it can distinguish between the thread id and siblings fields. When this situation arises, the warning message Physical node/pkg/core/thread ids not unique appears (unless the type specified is nowarnings).
Windows Processor Groups
On a 64-bit Windows operating system, it is possible for multiple processor groups to accommodate more than 64 processors. Each group is limited in size, up to a maximum value of sixty-four (64) processors.
If multiple processor groups are detected, the default is to model the machine as a 2-level tree, where level 0 are for the processors in a group, and level 1 are for the different groups. Threads are assigned to a group until there are as many OpenMP threads bound to the groups as there are processors in the group. Subsequent threads are assigned to the next group, and so on.
By default, threads are allowed to float among all processors in a group, that is to say, granularity equals the group [granularity=group]. You can override this binding and explicitly use another affinity type like compact, scatter, and so on. If you do so, the granularity must be sufficiently fine to prevent a thread from being bound to multiple processors in different groups.
Use a Specific Machine Topology Modeling Method (KMP_TOPOLOGY_METHOD)
You can set the KMP_TOPOLOGY_METHOD environment variable to force OpenMP to use a particular machine topology modeling method.
Value |
Description |
---|---|
cpuid_leaf11 |
Decodes the APIC identifiers as specified by leaf 11 of the cpuid instruction. |
cpuid_leaf4 |
Decodes the APIC identifiers as specified in leaf 4 of the cpuid instruction. |
cpuinfo |
If KMP_CPUINFO_FILE is not specified, forces OpenMP to parse /proc/cpuinfo to determine the topology (Linux only). If KMP_CPUINFO_FILE is specified as described above, uses it (Windows or Linux). |
group |
Models the machine as a 2-level map, with level 0 specifying the different processors in a group, and level 1 specifying the different groups (Windows 64-bit only) . |
flat |
Models the machine as a flat (linear) list of processors. |
hwloc |
Models the machine as the Portable Hardware Locality* (hwloc) library does. This model is the most detailed and includes, but is not limited to: numa nodes, packages, cores, hardware threads, caches, and Windows processor groups. |
Explicitly Specify OS Processor IDs (GOMP_CPU_AFFINITY)
You must set the GOMP_CPU_AFFINITY environment variable before the first parallel region, or certain API calls including omp_get_max_threads(), omp_get_num_procs() and any affinity API calls, as described in Low Level Affinity API, below.
Instead of allowing the library to detect the hardware topology and automatically assign OpenMP threads to processing elements, the user may explicitly specify the assignment by using a list of operating system (OS) processor (proc) IDs. However, this requires knowledge of which processing elements the OS proc IDs represent.
On Linux systems, when using the Intel OpenMP compatibility libraries enabled by the compiler option -qopenmp-lib=compat, you can use the GOMP_AFFINITY environment variable to specify a list of OS processor IDs. Its syntax is identical to that accepted by libgomp (assume that <proc_list> produces the entire GOMP_AFFINITY environment string):
Value |
Description |
---|---|
<proc_list> := |
<entry> | <elem> , <list> | <elem> <whitespace> <list> |
<elem> := |
<proc_spec> | <range> |
<proc_spec> := |
<proc_id> |
<range> := |
<proc_id> - <proc_id> | <proc_id> - <proc_id> : <int> |
<proc_id> := |
<positive_int> |
OS processors specified in this list are then assigned to OpenMP threads, in order of OpenMP Global Thread IDs. If more OpenMP threads are created than there are elements in the list, then the assignment occurs modulo the size of the list. That is, OpenMP Global Thread ID n is bound to list element n mod <list_size>.
Consider the machine previously mentioned: a dual core, dual-package machine without Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, where the OS proc IDs are assigned in the same manner as the example in a previous figure. Suppose that the application creates six OpenMP threads instead of 4 (the default), oversubscribing the machine. If GOMP_AFFINITY=3,0-2, then OpenMP threads are bound as shown in the figure below, just as should happen when compiling with gcc and linking with libgomp:
The same syntax can be used to specify the OS proc ID list in the proclist=[<proc_list>] modifier in the KMP_AFFINITY environment variable string. There is a slight difference: in order to have strictly the same semantics as in the gcc OpenMP runtime library libgomp: the GOMP_AFFINITY environment variable implies granularity=fine. If you specify the OS proc list in the KMP_AFFINITY environment variable without a granularity= specifier, then the default granularity is not changed. That is, OpenMP threads are allowed to float between the different thread contexts on a single core. Thus GOMP_AFFINITY=<proc_list> is an alias for KMP_AFFINITY="granularity=fine,proclist=[<proc_list>],explicit".
In the KMP_AFFINITY environment variable string, the syntax is extended to handle operating system processor ID sets. The user may specify a set of operating system processor IDs among which an OpenMP thread may execute ("float") enclosed in brackets:
Value |
Description |
---|---|
<proc_list> := |
<proc_id> | { <float_list> } |
<float_list> := |
<proc_id> | <proc_id> , <float_list> |
This allows functionality similarity to the granularity= specifier, but it is more flexible. The OS processors on which an OpenMP thread executes may exclude other OS processors nearby in the machine topology, but include other distant OS processors. Building upon the previous example, we may allow OpenMP threads 2 and 3 to "float" between OS processor 1 and OS processor 2 by using KMP_AFFINITY="granularity=verbose,fine,proclist=[3,0,{1,2},{1,2}],explicit", as shown in the figure below:
If verbose were also specified, the output when the application is executed would include:
KMP_AFFINITY: Initial OS proc set respected: 0,1,2,3 KMP_AFFINITY: decoding x2APIC ids. KMP_AFFINITY: 4 available OS procs KMP_AFFINITY: Uniform topology KMP_AFFINITY: 2 sockets x 2 cores/socket x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 KMP_AFFINITY: OS proc 1 maps to socket 3 core 0 thread 0 KMP_AFFINITY: OS proc 3 maps to socket 3 core 1 thread 0 KMP_AFFINITY: pid 41464 tid 41464 thread 0 bound to OS proc set 3 KMP_AFFINITY: pid 41464 tid 41465 thread 1 bound to OS proc set 0 KMP_AFFINITY: pid 41464 tid 41466 thread 2 bound to OS proc set 1,2 KMP_AFFINITY: pid 41464 tid 41467 thread 3 bound to OS proc set 1,2 KMP_AFFINITY: pid 41464 tid 41468 thread 4 bound to OS proc set 3 KMP_AFFINITY: pid 41464 tid 41469 thread 5 bound to OS proc set 0
Low Level Affinity API
Instead of relying on the user to specify the OpenMP thread to OS proc binding by setting an environment variable before program execution starts (or by using the kmp_settings interface before the first parallel region is reached), each OpenMP thread can determine the desired set of OS procs on which it is to execute and bind to them with the kmp_set_affinity API call.
When you use this affinity interface you take complete control of the hardware resources on which your threads run. To do that sensibly you need to understand in detail how the logical CPUs, the enumeration of hardware threads controlled by the OS, map to the physical hardware of the specific machine on which you are running. That mapping can be, and likely is, different on different machines, so you risk binding machine-specific information into your code, which can result in explicitly forcing bad affinities when your code runs on a different machine. And if you are concerned with optimization at this level of detail, your code is probably valuable, and therefore will probably move to another machine.
This interface may also allow you to ignore the resource limitations that were set by the program startup mechanism, such as Message Passing Interface (MPI), specifically to prevent multiple OpenMP processes on the same node from using the same hardware threads. Again, this can result in explicitly forcing affinities that cause bad performance, and the OpenMP runtime will neither prevent this from happening, nor warn you when it does. These are expert interfaces and you must use them with caution.
It is recommended, therefore, to use the higher level affinity settings if you possibly can, because they are more portable and do not require this low level knowledge.
The Fortran API interfaces follow, where the type name kmp_affinity_mask_kind is defined in omp_lib.h or omp_lib.mod:
Syntax |
Description |
---|---|
integer function kmp_set_affinity(mask) |
Sets the affinity mask for the current OpenMP thread to mask, where mask is a set of OS proc IDs that has been created using the API calls listed below, and the thread will only execute on OS procs in the set. Returns either a zero (0) upon success or a nonzero error code. |
integer function kmp_get_affinity(mask) |
Retrieves the affinity mask for the current OpenMP thread, and stores it in mask, which must have previously been initialized with a call to kmp_create_affinity_mask(). Returns either a zero (0) upon success or a nonzero error code. |
integer function kmp_get_affinity_max_proc() |
Returns the maximum OS proc ID that is on the machine, plus 1. All OS proc IDs are guaranteed to be between 0 (inclusive) and kmp_get_affinity_max_proc() (exclusive). |
subroutine kmp_create_affinity_mask(mask) |
Allocates a new OpenMP thread affinity mask, and initializes mask to the empty set of OS procs. The implementation is free to use an object of kmp_affinity_mask_t either as the set itself, a pointer to the actual set, or an index into a table describing the set. Do not make any assumption as to what the actual representation is. |
subroutine kmp_destroy_affinity_mask(mask) |
Deallocates the OpenMP thread affinity mask. For each call to kmp_create_affinity_mask(), there should be a corresponding call to kmp_destroy_affinity_mask(). |
integer function kmp_set_affinity_mask_proc(proc, mask) |
Adds the OS proc ID proc to the set mask, if it is not already. Returns either a zero (0) upon success or a nonzero error code. |
integer function kmp_unset_affinity_mask_proc(proc, mask) |
If the OS proc ID proc is in the set mask, it removes it. Returns either a zero (0) upon success or a nonzero error code. |
integer function kmp_get_affinity_mask_proc(proc, mask)integer proc |
Returns 1 if the OS proc ID proc is in the set mask; if not, it returns 0. |
Once an OpenMP thread has set its own affinity mask via a successful call to kmp_set_affinity(), then that thread remains bound to the corresponding OS proc set until at least the end of the parallel region, unless reset via a subsequent call to kmp_set_affinity().
Between parallel regions, the affinity mask (and the corresponding OpenMP thread to OS proc bindings) can be considered thread private data objects, and have the same persistence as described in the OpenMP Application Program Interface. For more information, see the OpenMP API specification (http://www.openmp.org), some relevant parts of which are provided below:
In order for the affinity mask and thread binding to persist between two consecutive active parallel regions, all three of the following conditions must hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions."
Therefore, by creating a parallel region at the start of the program whose sole purpose is to set the affinity mask for each thread, you can mimic the behavior of the KMP_AFFINITY environment variable with low-level affinity API calls, if program execution obeys the three aforementioned rules from the OpenMP specification.
The following example shows how these low-level interfaces can be used. This code binds the executing thread to the specified logical CPU:
! Force the executing thread to execute on logical CPU i ! Returns .TRUE. on success, .FALSE. on failure function forceAffinity (i) use omp_lib logical forceAffinity integer, intent(in) :: i integer(kmp_affinity_mask_kind) :: mask call kmp_create_affinity_mask(mask) forceAffinity = (kmp_set_affinity_mask_proc(i, mask) == 0) if (.not. forceAffinity) return forceAffinity = (kmp_set_affinity_mask(mask) == 0) return end function forceAffinity
This program fragment was written with knowledge about the mapping of the OS proc IDs to the physical processing elements of the target machine. On another machine, or on the same machine with a different OS installed, the program would still run, but the OpenMP thread to physical processing element bindings could differ and you might be explicitly force a bad distribution.