Intel® MPI Library Developer Reference for Linux* OS

ID 768732
Date 6/24/2024
Public
Document Table of Contents

GPU Buffers Support

This feature enables handling of device buffers in MPI functions such as MPI_Send, MPI_Recv, MPI_Bcast, MPI_Allreduce, and so on by using the Level Zero* library or the CUDA* Library.

To pass a pointer of an offloaded memory region to MPI, you may need to use specific compiler directives or get it from corresponding acceleration runtime API. For example, use_device_ptr and use_device_addr are useful keywords to obtain device pointers in the OpenMP environment, as shown in the following example.

OpenMP Example

/* Copy data from host to device */
#pragma omp target data map(to: rank, values[0:num_values]) use_device_ptr(values)
{
    /* Compute something on device */
    #pragma omp target parallel for is_device_ptr(values)
    for (unsigned i = 0; i < num_values; ++i) {
        values[i] *= (rank + 1);
    }
    /* Send device buffer to another rank */
    MPI_Send(values, num_values, MPI_INT, dest_rank, tag, MPI_COMM_WORLD);
}

To achieve the best performance, use the same GPU buffer in MPI communications if possible. It helps Intel® MPI Library cache necessary structures to communicate with the device and reuse them in next iterations.

Set I_MPI_OFFLOAD=0 to disable this feature if you do not provide device buffers to MPI primitives, since handling of device buffers can affect performance.

Device buffer may be used as a parameter of MPI_Win_create function to create a RMA-window placed in device memory. Alternatively, device buffer and/or device-allocated window may be passed to the MPI_put or MPI_Get primitives.

OpenMP Example

/* Allocate device memory */
char win_buffer = (char) omp_target_alloc(win_size, device_id);
/* Create MPI One-sided Window uisng GPU memory */
mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

SYCL* Example

sycl::queue q;
/* Allocate device memory */
char *win_buffer = sycl::malloc_device<char>(win_size, q);
/* Create MPI One-sided Window uisng GPU memory */
mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); 

NOTE:
Only contiguous MPI datatypes are supported.

I_MPI_OFFLOAD_MEMCPY_KIND

Set this environment variable to select the GPU memcpy kind.

Syntax

I_MPI_OFFLOAD_MEMCPY_KIND=<value>

Arguments

Value Description
cached Cache created objects for communication with GPU so that they can be reused if the same device buffer is later provided to the MPI function. Default value.
blocked Copy device buffer to host and wait for the copy to be completed inside MPI function.
nonblocked Copy device buffer to host and do not wait for the copy to be completed inside MPI function. Wait for the operation completion in MPI_Wait.

Description

Set this environment variable to select the GPU memcpy kind. The best performed option is chosen by default. Nonblocked memcpy can be used with MPI non-blocked point-to-point operations to achieve the overlap with compute part. Blocked memcpy can be used if other types are not stable.

I_MPI_OFFLOAD_PIPELINE

Set this environment variable to enable pipeline algorithm.

Syntax

I_MPI_OFFLOAD_PIPELINE=<value>

Arguments

Value Description
0 Disable pipeline algorithm.
1 Enable pipeline algorithm. Default value.

Description

Set this environment variable to enable pipeline algorithm, which can improve performance for large message sizes. The main idea of the algorithm is to split user buffer into several segment, and copy the segments to the host and send them to another rank.

I_MPI_OFFLOAD_PIPELINE_THRESHOLD

Set this environment variable to control the threshold for pipeline algorithm.

Syntax

I_MPI_OFFLOAD_PIPELINE_THRESHOLD=<value>

Arguments

Value Description
0 Threshold in bytes. The default value is 65536.

I_MPI_OFFLOAD_COLL_PIPELINE

Turn on GPU pipelining implementations for MPI collective functions.

Syntax

I_MPI_OFFLOAD_COLL_PIPELINE=<arg>

Arguments

Value Description
disable | no | off | 0 Disables GPU pipelining for MPI collective functions.
enable | yes | on | 1 Enables GPU pipelining for MPI collective functions. This is the default value.

I_MPI_OFFLOAD_COLL_PIPELINE_ALLREDUCE_SEGMENTS_SIZE

Explicitly set the GPU pipelining data size in bytes per stage for the Allreduce operation.

Syntax

I_MPI_OFFLOAD_COLL_PIPELINE_ALLREDUCE_SEGMENTS_SIZE=<arg>

Arguments

Value Description
Positive integer Number of bytes per iteration in the pipeline.
-1 Dynamically set the optimal size in runtime by Intel MPI. This is the default value.

I_MPI_OFFLOAD_RDMA

Set this environment variable to enable GPU RDMA.

Syntax

I_MPI_OFFLOAD_RDMA=<value>

Arguments

Value Description
0 Disable RDMA. This is the default value.
1 Enable RDMA.

Description

Set this environment variable to enable GPU-direct transfer using GPU RDMA. When this capability is supported by the network, enabling this environment variable enables direct data transfer between two GPUs.

NOTE:
Before setting this environment variable to use GPU-direct transfer, ensure that a GPU-aware provider with FI_HMEM capability is available in the FI_PROVIDER_PATH.

I_MPI_OFFLOAD_FAST_MEMCPY

Set this environment variable to enable/disable fast memcpy for GPU buffers.

NOTE:
This feature is not yet supported by CUDA backend.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY=<value>

Arguments

Value Description
0 Disable fast memcpy.
1 Enable fast memcpy. This is the default value.

Description

Set this environment variable to enable/disable fast memcpy to optimize performance for small message sizes.

I_MPI_OFFLOAD_IPC

Set this environment variable to enable/disable GPU IPC.

Syntax

I_MPI_OFFLOAD_IPC=<value>

Arguments

Value Description
0 Disable IPC path.
1 Enable IPC path. This is the default value.

Description

Set this environment variable to enable/disable GPU IPC. When this capability is supported by the system and devices, enabling this environment variable enables direct data transfer between two GPUs on the same node.

NOTE:
The default permissions for the cross-process access on Linux* OS Ubuntu* may prevent GPU IPC from working. To resolve this, enable the CAP_SYS_PTRACE capability on the system. Otherwise, you can disable GPU IPC by setting I_MPI_OFFLOAD_IPC=0, but it affects the intra-node performance.

I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE

NOTE:
The I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE variable is under technology preview.

Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE=<value>

Arguments

Value Description
Threshold in bytes The default value is -1 (all sizes).

Description

Set this environment variable to control the message size, over which copy-in/copy-out is used for collectives on GPU buffers. When CBWR is disabled using I_MPI_OFFLOAD_CBWR=0, for message sizes <= I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE, GPU buffers are copied into the host before executing the collective and back to the device after the collective complete.

When CBWR mode is enabled (default), this environment variable has no effect.

I_MPI_OFFLOAD_FAST_MEMCPY_COLL

NOTE:
The I_MPI_OFFLOAD_FAST_MEMCPY_COLL variable is under technology preview.

Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY_COLL=<value>

Arguments

Value Description
0 Disabled.
1 Enabled. Collectives with GPU buffers use the fast-copy if applicable. This is the default value.

Description

Set this environment variable to enable the fast-copy for collectives on GPU buffers.

I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE

NOTE:
The I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE variable is under technology preview.

Set this environment variable to control the threshold, over which fast-copy is used for collectives on GPU buffers.

Syntax

I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE=<value>

Arguments

Value Description
Threshold in bytes The default value is 512.

Description

Set this environment variable to control the message size, over which fast-copy is used for collectives on GPU buffers.

When you enable it using I_MPI_OFFLOAD_FAST_MEMCPY_COLL=1, the fast-copy is used for message sizes <= I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE.

I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED

NOTE:
The I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED variable is under technology preview.

Set this environment variable to enable device-initiated MPI one-sided communications.

Syntax

I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=<value>

Arguments

Value Description
0 Disabled. Default value.
1 Enabled.

Description

Set this environment variable to enable device-initiated MPI one-sided communications. This feature allows direct calls to MPI primitives listed below from OpenMP (both C and Fortran are supported) offload section or SYCL* kernel.

Supported primitives:

  • MPI_Put
  • MPI_Get
  • MPI_Win_lock
  • MPI_Win_lock_all
  • MPI_Win_unlock
  • MPI_Win_unlock_all
  • MPI_Win_flush/MPI_Win_flush_all
  • MPI_Win_fence
  • MPI_Win_get_group
  • MPI_Group_rank
  • MPI_Group_size
  • MPI_Group_free
NOTE:
Only contiguous MPI datatypes are supported.

Host-Initiated Communications Example

sycl::queue q;
/* Allocate device memory */
char *win_buffer = sycl::malloc_device<char>(win_size, q);
char *local_buffer = sycl::malloc_device<char>(win_size, q);

/* Create MPI One-sided Window uisng GPU memory */
mpi_errno = MPI_Win_create(win_buffer, win_size, 1,
                                         MPI_INFO_NULL, MPI_COMM_WORLD, &win);

for (int iteration =0; iteration < num_iterations; ++iteration ) {
   q.submit([&](sycl::handler &h) {
      h.single_task([=]() {
          /* Perform compute using local_buffer */
            ...
      });
   }).wait();
   /* Lock target window copy to place data */
   mpi_errno = MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target_rank, 0,  win);

   /* Perform one-sided communication using device-allocated window. */
   mpi_errno = MPI_Put(local_buffer, win_count, MPI_CHAR, target_rank, 0, win_size, MPI_CHAR, win);

   /* Unlock target window */
   mpi_errno = MPI_Win_unlock(target_rank, win);
 }

Device-Initiated Communications Example

sycl::queue q;
/* Allocate device memory */
char *win_buffer = sycl::malloc_device<char>(win_size, q);
char *local_buffer = sycl::malloc_device<char>(win_size, q);

/* Create MPI One-sided Window uisng GPU memory */
mpi_errno = MPI_Win_create(win_buffer, win_size, 1,
                                          MPI_INFO_NULL, MPI_COMM_WORLD, &win);

q.submit([&](sycl::handler &h) {
   h.single_task([=]() {
          for (int iteration =0; iteration < num_iterations; ++iteration ) {
                 /* Perform compute using local_buffer */
                 ...
                 /* Lock target window copy to place data */
                 mpi_errno = MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target_rank, 0,  win);

                 /* Perform one-sided communication using device-allocated window. */
                 mpi_errno = MPI_Put(local_buffer, win_count, MPI_CHAR, target_rank, 0, win_size, MPI_CHAR, win);

                 /* Unlock target window */
                 mpi_errno = MPI_Win_unlock(target_rank, win);
         }
    });
}).wait();

Device-Initiated Communications Example (Fortran90 with the OpenMP offload)

!$omp target data map(to: count, ierr, win, target_rank) use_device_ptr(buf)
 !$omp target parallel do is_device_ptr(buf)
 do loop_count = 1, 10
   err = 0
     ! Compute part of a loop
     ...
     ! Communication part of a loop
     call MPI_Put(in_buf(1), count, MPI_INT, target_rank,  1, count, MPI_INT, win, ierr)
     call MPI_Win_flush(target_rank, win, ierr)
     ...
end do
!$omp end target data