Intel® MPI Library Developer Guide for Linux* OS

ID 768728
Date 10/31/2024
Public
Document Table of Contents

Device-Initiated Communications

Intel® MPI Library supports device-initiated one-sided communications to provide you with the ability to perform communications directly from the kernels executed on GPU without interrupting the kernel.

This feature allows you to use the existing MPI primitives without changing semantics with SYCL*, OpenMP* offload, and Intel® Data Center GPU Max Series. To enable device-initiated communications, in addition to the I_MPI_OFFLOAD=1 environment variable, set I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1.

The current version supports the next set of primitives available on the device:

  • Communication primitives:
    • MPI_Put
    • MPI_Get
  • Passive-target synchronization primitives:
    • MPI_Win_lock
    • MPI_Win_lock_all
    • MPI_Win_unlock
    • MPI_Win_unlock_all
    • MPI_Win_flush
    • MPI_Win_flush_all
  • Active-target synchronization primitives:
    • MPI_Win_fence
  • Window-query primitives and group management:
    • MPI_Win_get_attr
    • MPI_Win_shared_query
    • MPI_Win_get_group
    • MPI_Group_free
    • MPI_Group_size
  • Intel® MPI notified one-sided communication extension:
    • MPIX_Win_set_notify
    • MPIX_Win_get_notify
    • MPIX_Get_notify
    • MPIX_Put_notify

Language Support

  Communication Primitives Passive-Target Synchronization Primitives Active-Target Synchronization Primitives Window-Query Primitives and Group Management Intel® MPI Notified One-Sided Communication Extension
SYCL* Supported Supported Supported Supported Supported
C/OpenMP* Supported Supported Supported Supported Supported
F77*/OpenMP* Supported Supported Supported N/A N/A
F90*/OpenMP* Supported Supported Supported N/A N/A

Examples

You can modify your code to incorporate device-initiated communications:

sycl::queue q;
MPI_Win win;
int peer_rank = X;
// Allocate device memory local to a process
int *buf = sycl::malloc_device<int>(size, q);
// Create a RMA-Window device memory
MPI_Win_create(buf, (MPI_Aint) size * sizeof(int),
                sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// SYCL automatically capture local variables 
q.submit([&](sycl::handler &h) {
    h.single_task([=]() {
        int var = 0;
        // Perfrom computations updating "var"
        // ...

        // Initiate communication from the device
        MPI_Win_lock(MPI_LOCK_SHARED, peer_rank, 0, win);
        MPI_Put(&var, 1, MPI_INT, peer_rank, 0, 1, MPI_INT, win);
        MPI_Win_unlock(peer_rank, win);
        // Continue computations on GPU
    );
}).wait();

According to the MPI-4.0 standard, you can call MPI_Win_shared_query for an RMA window constructed using any available window creation method. Using MPI_Win_shared_query, an application may efficiently implement any communication pattern using direct access to the device memory of the peer process.