Visible to Intel only — GUID: GUID-94E21494-FC64-4DE8-8CF1-F246A309C5CF
Visible to Intel only — GUID: GUID-94E21494-FC64-4DE8-8CF1-F246A309C5CF
Intra-Device and Inter-Device Data Transfers for MPI+OpenMP Programs
Some applications are just pure OpenMP programs where a single process uses all available GPU resources. But many other applications use a combination of MPI and OpenMP where multiple MPI ranks are used and each of these MPI ranks uses in turn OpenMP to access some of the GPU resources (either a full GPU or a GPU subdevice).
Typically, such applications will need at some point to exchange data between the GPU resources used by the different ranks. In order to do this, normally, first the data needs to be transferred from the GPU to the CPU that will do the MPI transfer, and once it is received by the target CPU, transferred again to the target GPU.
In principle, transferring data between subdevices of the same GPU should not require to transfer the data first back to the CPU. Also, for systems that use the Intel® Xe Link which allows GPUs to exchange data directly this should not be necessary either.
Besides Intel® MPI Library, the Intel® Data Center GPU Max Series enabled MPICH library supports direct transfers between GPUs and GPU subdevices to improve communication efficiency.
The MPI primitives of this library are able to determine whether a pointer points to data that is in the GPU even from the CPU and, therefore, it can activate a different communication path that transfers the data directly between the GPU devices or subdevices without the need of transferring the data through the CPUs.
With this in mind, after installing the library, in order to take advantage of it we need to use OpenMP APIs and directives to obtain device pointers that we can then use in the MPI calls to activate the direct GPU communication path. There are two possible scenarios for this:
1. Data was directly allocated on the device by means of omp_target_alloc.
When the data is allocated on the device using the omp_target_alloc routine (or a similar routine), the returned pointer is a device pointer that can be used directly in the MPI calls. For example:
double *dst_buff = omp_target_alloc(device_id, 1000 * sizeof(double));
MPI_Recv(dst_buff,...);
Data that was mapped on the device.
If the data was allocated on the device using a map clause or some of OpenMP implicit mapping rules, then we need to use the use_device_ptr or use_device_addr clauses of the target data directive to get a device pointer that we can use on the MPI calls. For example:
#pragma omp target data map(dst_buff)
{
#pragma omp target data use_device_addr(dst_buff)
{
MPI_Recv(dst_buff,...)
}
}
Now take a look at a more complex example which allows the user to select by means of a flag if device pointers (in case the Intel® Data Center GPU Max Series enabled MPICH library is being used) or host pointers should be used in MPI calls. We will also use this code to showcase the performance difference that can be achieved by using device to device transfers. The code just keeps rotating some buffers across a number of MPI ranks while increasing their value.
We are going to control whether to use device to device transfers or regular MPI transfers by means of the mpi_aware variable.
int mpi_aware = 0;
if ( argc > 1 ) {
mpi_aware = 1;
printf("MPI device aware path enabled\n");
} // argc check
The application uses two buffers, buf1 and buf2, so we start by mapping them normally on the device by using a target data construct. Next we use a conditional target data construct to convert the addresses of buf1 and buf2 to device addresses only if mpi_aware is true. So, if mpi_aware is true, the curr and next pointers will hold device addresses. Otherwise, they will hold host addresses. This can be observed with the printf statement.
#pragma omp target data map(buf1,buf2)
{
#pragma omp target data use_device_addr(buf1,buf2) if(mpi_aware)
{
curr = buf1;
next = buf2;
}
printf("curr=%p next=%p\n",curr,next);
If mpi_aware is false, printf will print values similar to the following which are host addresses:
curr=0x7ffdffc11850 next=0x7ffdff470650
On the other hand, if mpi_aware is true, printf will print values similar to the following which are device addresses:
curr=0xff00000000200000 next=0xff00000000a00000
Finally before and after the MPI communication calls we use two conditional target update constructs to update the GPU variables only if mpi_aware was false as this is not needed if device to device transfers are used.
if ( nranks > 1 ) {
#pragma omp target update from(curr[0:N]) if(!mpi_aware)
MPI_Request srq;
MPI_Isend(curr,N,MPI_DOUBLE,next_rank,0,MPI_COMM_WORLD,&srq);
// we need to make sure that the MPI_Isend of the previous
// iteration finished before doing the MPI_Recv of this
// iteration
if ( step > 0 ) MPI_Wait(&psrq,MPI_STATUS_IGNORE);
psrq = srq;
MPI_Recv(next,N,MPI_DOUBLE,prev_rank,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
#pragma omp target update to(next[0:N]) if(!mpi_aware)
} // nranks
We, first, use this program to evaluate the performance difference in a system with a single GPU which is divided in two subdevices, one for each MPI rank. We can see a significant time difference that is a direct consequence of the reduction of the number of memory operations from the host to the device (M2D operations) and vice versa (D2M operations) which can be obtained with the unitrace tool.
Version |
Time (s.) |
M2D operations (per rank) |
D2M operations (per rank) |
---|---|---|---|
mpi_aware = 0 |
3.07 |
1002 |
1002 |
mpi_aware = 1 |
0.13 |
2 |
2 |
We can now use the same program to evaluate the performance difference in a system with two GPU devices connected with Intel® Xe Link. In this case, there will be 4 MPI ranks as each GPU will be programmed as two subdevices. We can observe a similar reduction in time and number of memory operations between the host and the device as in the previous case.
Version |
Time (s.) |
M2D operations (per rank) |
D2M operations (per rank) |
---|---|---|---|
mpi_aware = 0 |
3.45 |
1002 |
1002 |
mpi_aware = 1 |
0.44 |
2 |
2 |
As we have seen a significant improvement in the communication efficiency between GPUs can be achieved when these are connected with the Intel® Xe Link and the Intel® Data Center GPU Max Series enabled MPICH library is used.
Note: Using the mpi_aware path with an MPI library that does not support device-to-device transfers may result in an abnormal termination of the program.