GPU Buffers Support
This feature enables handling of device buffers in MPI functions such as MPI_Send, MPI_Recv, MPI_Bcast, MPI_Allreduce, and so on by using the Level Zero* library or the CUDA* Library.
To pass a pointer of an offloaded memory region to MPI, you may need to use specific compiler directives or get it from corresponding acceleration runtime API. For example, use_device_ptr and use_device_addr are useful keywords to obtain device pointers in the OpenMP environment, as shown in the following example.
OpenMP Example
/* Copy data from host to device */ #pragma omp target data map(to: rank, values[0:num_values]) use_device_ptr(values) { /* Compute something on device */ #pragma omp target parallel for is_device_ptr(values) for (unsigned i = 0; i < num_values; ++i) { values[i] *= (rank + 1); } /* Send device buffer to another rank */ MPI_Send(values, num_values, MPI_INT, dest_rank, tag, MPI_COMM_WORLD); }
To achieve the best performance, use the same GPU buffer in MPI communications if possible. It helps Intel® MPI Library cache necessary structures to communicate with the device and reuse them in next iterations.
Set I_MPI_OFFLOAD=0 to disable this feature if you do not provide device buffers to MPI primitives, since handling of device buffers can affect performance.
Device buffer may be used as a parameter of MPI_Win_create function to create a RMA-window placed in device memory. Alternatively, device buffer and/or device-allocated window may be passed to the MPI_put or MPI_Get primitives.
OpenMP Example
/* Allocate device memory */ char win_buffer = (char) omp_target_alloc(win_size, device_id); /* Create MPI One-sided Window uisng GPU memory */ mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
SYCL* Example
sycl::queue q; /* Allocate device memory */ char *win_buffer = sycl::malloc_device<char>(win_size, q); /* Create MPI One-sided Window uisng GPU memory */ mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
I_MPI_OFFLOAD_MEMCPY_KIND
Set this environment variable to select the GPU memcpy kind.
Syntax
I_MPI_OFFLOAD_MEMCPY_KIND=<value>
Arguments
Value | Description |
---|---|
cached | Cache created objects for communication with GPU so that they can be reused if the same device buffer is later provided to the MPI function. Default value. |
blocked | Copy device buffer to host and wait for the copy to be completed inside MPI function. |
nonblocked | Copy device buffer to host and do not wait for the copy to be completed inside MPI function. Wait for the operation completion in MPI_Wait. |
Description
Set this environment variable to select the GPU memcpy kind. The best performed option is chosen by default. Nonblocked memcpy can be used with MPI non-blocked point-to-point operations to achieve the overlap with compute part. Blocked memcpy can be used if other types are not stable.
I_MPI_OFFLOAD_PIPELINE
Set this environment variable to enable pipeline algorithm.
Syntax
I_MPI_OFFLOAD_PIPELINE=<value>
Arguments
Value | Description |
---|---|
0 | Disable pipeline algorithm. |
1 | Enable pipeline algorithm. Default value. |
Description
Set this environment variable to enable pipeline algorithm, which can improve performance for large message sizes. The main idea of the algorithm is to split user buffer into several segment, and copy the segments to the host and send them to another rank.
I_MPI_OFFLOAD_PIPELINE_THRESHOLD
Set this environment variable to control the threshold for pipeline algorithm.
Syntax
I_MPI_OFFLOAD_PIPELINE_THRESHOLD=<value>
Arguments
Value | Description |
---|---|
0 | Threshold in bytes. The default value is 65536. |
I_MPI_OFFLOAD_COLL_PIPELINE
Turn on GPU pipelining implementations for MPI collective functions.
Syntax
I_MPI_OFFLOAD_COLL_PIPELINE=<arg>
Arguments
Value | Description |
---|---|
disable | no | off | 0 | Disables GPU pipelining for MPI collective functions. |
enable | yes | on | 1 | Enables GPU pipelining for MPI collective functions. This is the default value. |
I_MPI_OFFLOAD_COLL_PIPELINE_ALLREDUCE_SEGMENTS_SIZE
Explicitly set the GPU pipelining data size in bytes per stage for the Allreduce operation.
Syntax
I_MPI_OFFLOAD_COLL_PIPELINE_ALLREDUCE_SEGMENTS_SIZE=<arg>
Arguments
Value | Description |
---|---|
Positive integer | Number of bytes per iteration in the pipeline. |
-1 | Dynamically set the optimal size in runtime by Intel MPI. This is the default value. |
I_MPI_OFFLOAD_RDMA
Set this environment variable to enable GPU RDMA.
Syntax
I_MPI_OFFLOAD_RDMA=<value>
Arguments
Value | Description |
---|---|
0 | Disable RDMA. This is the default value. |
1 | Enable RDMA. |
Description
Set this environment variable to enable GPU-direct transfer using GPU RDMA. When this capability is supported by the network, enabling this environment variable enables direct data transfer between two GPUs.
I_MPI_OFFLOAD_FAST_MEMCPY
Set this environment variable to enable/disable fast memcpy for GPU buffers.
Syntax
I_MPI_OFFLOAD_FAST_MEMCPY=<value>
Arguments
Value | Description |
---|---|
0 | Disable fast memcpy. |
1 | Enable fast memcpy. This is the default value. |
Description
Set this environment variable to enable/disable fast memcpy to optimize performance for small message sizes.
I_MPI_OFFLOAD_IPC
Set this environment variable to enable/disable GPU IPC.
Syntax
I_MPI_OFFLOAD_IPC=<value>
Arguments
Value | Description |
---|---|
0 | Disable IPC path. |
1 | Enable IPC path. This is the default value. |
Description
Set this environment variable to enable/disable GPU IPC. When this capability is supported by the system and devices, enabling this environment variable enables direct data transfer between two GPUs on the same node.
I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE
Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.
Syntax
I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE=<value>
Arguments
Value | Description |
---|---|
Threshold in bytes | The default value is -1 (all sizes). |
Description
Set this environment variable to control the message size, over which copy-in/copy-out is used for collectives on GPU buffers. When CBWR is disabled using I_MPI_OFFLOAD_CBWR=0, for message sizes <= I_MPI_OFFLOAD_COPY_COLL_MAX_SIZE, GPU buffers are copied into the host before executing the collective and back to the device after the collective complete.
When CBWR mode is enabled (default), this environment variable has no effect.
I_MPI_OFFLOAD_FAST_MEMCPY_COLL
Set this environment variable to control the threshold, over which copy-in/copy-out is used for collectives on GPU buffers.
Syntax
I_MPI_OFFLOAD_FAST_MEMCPY_COLL=<value>
Arguments
Value | Description |
---|---|
0 | Disabled. |
1 | Enabled. Collectives with GPU buffers use the fast-copy if applicable. This is the default value. |
Description
Set this environment variable to enable the fast-copy for collectives on GPU buffers.
I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE
Set this environment variable to control the threshold, over which fast-copy is used for collectives on GPU buffers.
Syntax
I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE=<value>
Arguments
Value | Description |
---|---|
Threshold in bytes | The default value is 512. |
Description
Set this environment variable to control the message size, over which fast-copy is used for collectives on GPU buffers.
When you enable it using I_MPI_OFFLOAD_FAST_MEMCPY_COLL=1, the fast-copy is used for message sizes <= I_MPI_OFFLOAD_FAST_MEMCPY_COLL_MAX_SIZE.
I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED
Set this environment variable to enable device-initiated MPI one-sided communications.
Syntax
I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=<value>
Arguments
Value | Description |
---|---|
0 | Disabled. Default value. |
1 | Enabled. |
Description
Set this environment variable to enable device-initiated MPI one-sided communications. This feature allows direct calls to MPI primitives listed below from OpenMP offload section or SYCL* kernel.
Supported primitives:
- MPI_Put
- MPI_Get
- MPI_Win_lock
- MPI_Win_lock_all
- MPI_Win_unlock
- MPI_Win_unlock_all
- MPI_Win_flush/MPI_Win_flush_all
- MPI_Win_fence
Host-initiated Communications Example
sycl::queue q; /* Allocate device memory */ char *win_buffer = sycl::malloc_device<char>(win_size, q); char *local_buffer = sycl::malloc_device<char>(win_size, q); /* Create MPI One-sided Window uisng GPU memory */ mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); for (int iteration =0; iteration < num_iterations; ++iteration ) { q.submit([&](sycl::handler &h) { h.single_task([=]() { /* Perform compute using local_buffer */ ... }); }).wait(); /* Lock target window copy to place data */ mpi_errno = MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target_rank, 0, win); /* Perform one-sided communication using device-allocated window. */ mpi_errno = MPI_Put(local_buffer, win_count, MPI_CHAR, target_rank, 0, win_size, MPI_CHAR, win); /* Unlock target window */ mpi_errno = MPI_Win_unlock(target_rank, win); }
Device-initiated Communications Example
sycl::queue q; /* Allocate device memory */ char *win_buffer = sycl::malloc_device<char>(win_size, q); char *local_buffer = sycl::malloc_device<char>(win_size, q); /* Create MPI One-sided Window uisng GPU memory */ mpi_errno = MPI_Win_create(win_buffer, win_size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); q.submit([&](sycl::handler &h) { h.single_task([=]() { for (int iteration =0; iteration < num_iterations; ++iteration ) { /* Perform compute using local_buffer */ ... /* Lock target window copy to place data */ mpi_errno = MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target_rank, 0, win); /* Perform one-sided communication using device-allocated window. */ mpi_errno = MPI_Put(local_buffer, win_count, MPI_CHAR, target_rank, 0, win_size, MPI_CHAR, win); /* Unlock target window */ mpi_errno = MPI_Win_unlock(target_rank, win); } }); }).wait();