Intel® MPI Library Developer Reference for Linux* OS

ID 768732
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

GPU Buffers Support

Short Description

This feature enables handling of device buffers in MPI functions such as MPI_Send, MPI_Recv, MPI_Bcast, MPI_Allreduce, and so on by using the Level Zero* library specified in the I_MPI_OFFLOAD_LEVEL_ZERO_LIBRARY variable.

Tto pass a pointer of an offloaded memory region to MPI, you may need to use specific compiler directives or get it from corresponding acceleration runtime API. For example, use_device_ptr and use_device_addr are useful keywords to obtain device pointers in the OpenMP environment, as shown in the following code sample:

/* Copy data from host to device */
#pragma omp target data map(to: rank, values[0:num_values]) use_device_ptr(values)
{
    /* Compute something on device */
    #pragma omp target parallel for is_device_ptr(values)
    for (unsigned i = 0; i < num_values; ++i) {
        values[i] *= (rank + 1);
    }
    /* Send device buffer to another rank */
    MPI_Send(values, num_values, MPI_INT, dest_rank, tag, MPI_COMM_WORLD);
}

To achieve the best performance, use the same GPU buffer in MPI communications if possible. It helps Intel® MPI Library cache necessary structures to communicate with the device and reuse them in next iterations.

Set I_MPI_OFFLOAD=0 to disable this feature if you do not provide device buffers to MPI primitives, since handling of device buffers can affect performance.

I_MPI_OFFLOAD_MEMCPY

Set this environment variable to select the GPU memcpy kind

Syntax

I_MPI_OFFLOAD_MEMCPY=<value>

Arguments

Value Description
cached Cache created objects for communication with GPU so that they can be reused if the same device buffer is later provided to the MPI function. Default value.
blocked Copy device buffer to host and wait for the copy to be completed inside MPI function.
nonblocked Copy device buffer to host and do not wait for the copy to be completed inside MPI function. Wait for the operation completion in MPI_Wait.

Description

Set this environment variable to select the GPU memcpy kind. The best performed option is chosen by default. Nonblocked memcpy can be used with MPI non-blocked point-to-point operations to achieve the overlap with compute part. Blocked memcpy can be used if other types are not stable.

I_MPI_OFFLOAD_PIPELINE

Set this environment variable to enable pipeline algorithm.

Syntax

I_MPI_OFFLOAD_PIPELINE=<value>

Arguments

Value Description
0 Disable pipeline algorithm.
1 Enable pipeline algorithm. Default value.

Description

Set this environment variable to enable pipeline algorithm, which can improve performance for large message sizes. The main idea of the algorithm is to split user buffer into several segment, and copy the segments to the host and send them to another rank.

I_MPI_OFFLOAD_PIPELINE_THRESHOLD

Set this environment variable to control the threshold for pipeline algorithm.

Syntax

I_MPI_OFFLOAD_PIPELINE_THRESHOLD=<value>

Arguments

Value Description
0 Threshold in bytes. The default value is 65536

I_MPI_OFFLOAD_RDMA

Set this environment variable to enable GPU RDMA.

Syntax

I_MPI_OFFLOAD_RDMA=<value>

Arguments

Value Description
0 Disable RDMA. Default value
1 Enable RDMA

Description

Set this environment variable to enable GPU direct transfer using GPU RDMA. When this capability is supported by the network, enabling this environment variable enables direct data transfer between two GPUs.

I_MPI_OFFLOAD_FAST_MEMCPY

Set this environment variable to enable/disable fast memcpy for GPU buffers.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY=<value>

Arguments

Value Description
0 Disable fast memcpy
1 Enable fast memcpy. Default value

Description

Set this environment variable to enable/disable fast memcpy to optimize performance for small message sizes.

NOTE:
GPU fast memcpy does not support implicit scaling. Implicit scaling can be disabled by setting these environment variables to the corresponding values:

NEOReadDebugKeys=1

EnableImplicitScaling=0

I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE

Set this environment variable to enable/disable GPU IPC

Syntax

I_MPI_OFFLOAD_IPC=<value>

Arguments

Value Description
0 Disable IPC path
1 Enable IPC path. Default value

Description

Set this environment variable to enable/disable GPU IPC. When this capability is supported by the system and devices, enabling this environment variable enables direct data transfer between two GPUs on the same node.

I_MPI_OFFLOAD_IPC

Set this environment variable to enable/disable GPU IPC.

Syntax

I_MPI_OFFLOAD_IPC=<value>

Arguments

Value Description
0 Disable IPC path
1 Enable IPC path. Default value

Description

Set this environment variable to enable/disable GPU IPC. When this capability is supported by the system and devices, enabling this environment variable enables direct data transfer between two GPUs on the same node.

I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE

NOTE:
The I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE variable is under technology preview.

Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE=<value>

Arguments

Value Description
Threshold in bytes The default value is -1 (all sizes)

Description

Set this environment variable to control the message size, over which copy-in/copy-out is used for collectives on GPU buffers. When CBWR is disabled using I_MPI_OFFLOAD_CBWR=0, for message sizes <= I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE, GPU buffers are copied into the host before executing the collective and back to the device after the collective complete.

When CBWR mode is enabled (default), this environment variable has no effect.

I_MPI_OFFLOAD_FAST_MEMCPY_COLL

NOTE:
The I_MPI_OFFLOAD_FAST_MEMCPY_COLL variable is under technology preview.

Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY_COLL=<value>

Arguments

Value Description
0 Disabled. Default value
1 Enabled. Collectives with GPU buffers use the fast-copy if applicable.

Description

Set this environment variable to enable the fast-copy for collectives on GPU buffers.

I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE

NOTE:
The I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE variable is under technology preview.

Set this environment variable to control the threshold, over which fast-copy is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE=<value>

Arguments

Value Description
Threshold in bytes The default value is 1024

Description

Set this environment variable to control the message size, over which fast-copy is used for collectives on GPU buffers.

When you enable it using I_MPI_OFFLOAD_FAST_MEMCPY_COLL=1, the fast-copy is used for message sizes <= I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE.