Intel® MPI Library Developer Reference for Linux* OS

ID 768732
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

GPU Pinning

Use this feature to distribute Intel GPU devices between MPI ranks.

To enable the GPU Pinning, set I_MPI_OFFLOAD_PIN=1. You can also enable all GPU features and optimizations with I_MPI_OFFLOAD=1.

This feature requires the Level-Zero* library to be installed on the nodes.

NOTE:
This feature is not yet supported by CUDA backend.

Default Settings

I_MPI_OFFLOAD_CELL=tile

I_MPI_OFFLOAD_DOMAIN_SIZE=-1

I_MPI_OFFLOAD_DEVICES=all

By default, all available resources are distributed between MPI ranks as equally as possible given the position of the ranks; that is, the distribution of resources takes into account on which NUMA node the rank and the resource are located. Ideally, the rank will have resources only on the same NUMA node on which the rank is located.

When the GPU Pinning is applied, only pinned devices are available for the rank. It means, other devices are hidden for the rank and cannot be used or discovered. For more details, the Level-Zero* Core Programming Guide.

NOTE:
If you set the ZE_AFFINITY_MASK variable, the GPU Pinning is automatically disabled.

Starting with I_MPI_DEBUG=3, Intel(R) MPI prints the GPU topology with the number of detected devices (GPUs) and sub-devices (stacks or tiles).

NOTE:
The node topology reported by Intel(R) MPI and other tools, such as sycl-ls/clinfo, may differ in the FLAT hierarchy mode (see ZE_FLAT_DEVICE_HIERARCHY). In this mode, Level-Zero* driver exposes each sub-device (stack or tile) as a separate device. However, Intel(R) MPI recognizes all sub-devices as if they are in the COMPOSITE mode.

Examples

The examples below represent a machine configuration with two NUMA nodes and two GPUs with two stacks (tiles).

Figure 1. Four MPI Ranks

Debug output I_MPI_DEBUG=3:

[0] MPI startup(): ===== GPU pinning on host1 =====

[0] MPI startup(): Rank Pin stack (tile)

[0] MPI startup(): 0       {0}

[0] MPI startup(): 1       {1}

[0] MPI startup(): 2       {2}

[0] MPI startup(): 3       {3}

Figure 2. Three MPI Ranks

Debug output I_MPI_DEBUG=3:

[0] MPI startup(): ===== GPU pinning on host1 =====

[0] MPI startup(): Rank Pin stack (tile)

[0] MPI startup(): 0       {0}

[0] MPI startup(): 1       {1}

[0] MPI startup(): 2       {2,3}

I_MPI_OFFLOAD_PIN

Control whether GPU pinning is enabled.

Syntax

I_MPI_OFFLOAD_PIN=<value>

Arguments

Value Description
<0> Disabled.
<l>   Enabled.

Description

By default, GPU pinning is disabled. To enable, set I_MPI_OFFLOAD_PIN=1 or I_MPI_OFFLOAD=1.

I_MPI_OFFLOAD_TOPOLIB

Set the interface for GPU topology recognition.

Syntax

I_MPI_OFFLOAD_TOPOLIB=<arg>

Arguments

<arg> is a string parameter

Value Description
level_zero    Use Level-Zero library for GPU topology recognition.
none Disable GPU recognition and GPU pinning.

Description

Set this environment variable to define the interface for GPU topology recognition.

I_MPI_OFFLOAD_CELL

Set this variable to define the base unit: tile (stack, sub-device) or device (GPU).

Syntax

I_MPI_OFFLOAD_CELL=<cell>

Arguments

Value Description
<cell> Specify the base unit.
tile One tile (stack, sub-device). This is the default value.
device Whole device (GPU) with all sub-devices.

Description

Set this variable to define the base unit. This variable may affect other GPU pinning variables.

Example

Figure 3. Four MPI ranks, I_MPI_OFFLOAD_CELL=device

I_MPI_OFFLOAD_DOMAIN_SIZE

Control the number of base units per MPI rank.

Syntax

I_MPI_OFFLOAD_DOMAIN_SIZE=<value>

Arguments

<value> is an integer number. 

Value Description
-1 Auto. Each MPI rank may have a different domain size to use all available resources. This is the default value.
> 0   Custom domain size.

Description

Set this variable to define how many base units will be pinned to the MPI rank. I_MPI_OFFLOAD_CELL variable defines the base unit: stack (tile) or device.

Examples

Figure 4. Three MPI ranks, I_MPI_OFFLOAD_DOMAIN_SIZE=1

I_MPI_OFFLOAD_DEVICES

Define a list of available devices.

Syntax

I_MPI_OFFLOAD_DEVICES=<devicelist>

Arguments

<devicelist> is a comma-separated list of available devices.

Value Description
all All devices are available. This is the default value.
<l> Device with logical number <l>.
<l>-<m> Range of devices with logical numbers from <l> to <m>.
<k>,<l>-<> Device <k> and devices from <l> to <m>.

Description

Set this variable to define the available devices. This variable also gives you the ability to exclude devices.

Example

Figure 5. Four MPI ranks, I_MPI_OFFLOAD_DEVICES=0

I_MPI_OFFLOAD_CELL_LIST

Define a list of base units to pin for each MPI rank.

Syntax

I_MPI_OFFLOAD_DEVICE_LIST=<base_units_list>

Arguments

<base_units_list> is a comma-separated list of base units. The process with the i-th rank is pinned to the i-th base unit in the list.

Value Description
<l> Base unit with logical number <l>.
<l>-<m> Range of base units with logical numbers from <l> to <m>.
<k>,<l>-<m> Base unit <k> and base units from <l> to <m>.

Description

Set this variable to define the list of base units to pin for each MPI rank. The process with the i-th rank is pinned to the i-th base unit in the list.

  • I_MPI_OFFLOAD_CELL variable defines the base unit: stack (tile) or device.
  • I_MPI_OFFLOAD_DEVICE_LIST variable has less priority than the I_MPI_OFFLOAD_DOMAIN variable.

Example

Figure 6. Four MPI ranks, I_MPI_OFFLOAD_DEVICE_LIST=3,2,0,1

I_MPI_OFFLOAD_DOMAIN

Define domains through the comma separated list of hexadecimal numbers for each MPI rank.

Syntax

I_MPI_OFFLOAD_DOMAIN=<masklist>

Arguments

<masklist> is a comma-separated list of hexadecimal numbers.

Value Description
[m1,...,mn ] For <masklist>, each mi is a hexadecimal bit mask defining an individual domain.

The following rule is used: the i-th base unit is included into the domain if the corresponding bit in mi value is set to 1.

Description

Set this variable to define the list of hexadecimal bit masks. For the i-th bit mask, if the j-th bit set to 1, then the j-th base unit will be pinned to the i-th MPI rank.

I_MPI_OFFLOAD_CELL variable defines the base unit: stack (tile) or device.

I_MPI_OFFLOAD_DOMAIN variable has higher priority than the I_MPI_OFFLOAD_DEVICE_LIST.

Example

Figure 7. Four MPI ranks, I_MPI_OFFLOAD_DOMAIN=[B,2,5,C]. Parsed bit masks: [1101,0100,1010,0011]

I_MPI_OFFLOAD_PRINT_TOPOLOGY

Print GPU pinning and GPU topology regardless of the the I_MPI_DEBUG level.

Syntax

I_MPI_OFFLOAD_PRINT_TOPOLOGY=<value>

Arguments

Value Description
0

GPU pinning and GPU topology printing depends on I_MPI_DEBUG level:

  • If I_MPI_DEBUG >= 3, print GPU information only from the first host
  • If I_MPI_DEBUG >= 120, print GPU information from all hosts
1 Print GPU pinning and GPU topology from all hosts.

Description

Set this environment variable to enable GPU pinning and GPU topology printing regardless of the I_MPI_DEBUG level.