Intel® MPI for GPU Clusters
GPGPUs have become common in HPC clusters and the Intel® MPI Library now supports scale-up and scale-out executions of large applications on such heterogeneous machines. GPU support was first introduced in Intel® MPI Library 2019 U8 (Intel® MPI Library Release Notes). Several enhancements have since been added in this direction. Unless otherwise stated, all the tests made in this section used Intel® MPI Library 2021.5, which was the latest available public release at the time of writing this section.
This section covers the following GPU-specific details of Intel® MPI Library:
Topology Detection
Execution Models
GPU Pinning
Multi-Stack, Multi-GPU and Multi-Node Support
GPU Benchmarks in IMB (Intel® MPI Benchmarks)
Profiling MPI Applications on GPUs
Recommendations for better performance
Topology Detection
The figure below illustrates a typical GPU compute node, with two CPU sockets. Each CPU socket has N cores and is connected to a single GPU card with M stacks, making a total of 2M stacks per node.
Variations of the above topology are possible. Use the cpuinfo command to see details:
$ cpuinfo -g
Intel(R) processor family information utility, Version 2021.5 Build 20211102 (id: 9279b7d62) Copyright (C) 2005-2021 Intel Corporation. All rights reserved. ===== Processor composition ===== Processor name : Intel(R) Xeon(R) Gold 6346 Packages(sockets) : 2 Cores : 32 Processors(CPUs) : 64 Cores per package : 16 Threads per core : 2
In this example, we have a dual socket node with 16 cores per socket, making a total of 32 physical cores per node.
$ clinfo -l
Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM) `-- Device #0: Intel(R) FPGA Emulation Device Platform #1: Intel(R) OpenCL `-- Device #0: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz Platform #2: Intel(R) OpenCL HD Graphics +-- Device #0: Intel(R) Graphics [0x020a] `-- Device #1: Intel(R) Graphics [0x020a]
Here, under Platform #2, we see that two GPU cards with PCI ID 020a are present. To check the associated code names for released platforms, please refer to the Intel® Graphics Processor Table.
$ clinfo … … Number of devices 2 Device Name Intel(R) Graphics [0x020a] … … Device Partition (core) Max number of sub-devices 2 … …
The above truncated output shows that there are 2 sub-devices or stacks per GPU card. Putting together the information from the above commands, one can understand the node topology.
Execution Models
Developers may use the Intel MPI Library in a GPU environment in many ways. Some of the most common scenarios are described below.
Naïve
In this model, the MPI functions are executed on the host (CPU for all practical purposes) alone and the underlying MPI library is not required to be device (here, GPU) aware. As shown in the figure below, developers are responsible for taking care of data movements between the host and device (use of to: and tofrom:). In this model, applications making frequent MPI communications might suffer from excessive data movements between the host and device. However, this approach might still be useful for very small message sizes.
Please note that MPI_Send appears outside the scope of both offload pragmas in the figure above. Both SYCL and OpenMP offload constructs may be used for the required data movements between the host and device.
GPU Aware MPI
In this model, the underlying MPI library is aware of GPU buffers, hence developers are relieved from the task of explicit data movement between the host and device. However, to pass a pointer of an offload-able memory region to MPI, developers may need to use specific compiler directives or get it from corresponding acceleration runtime API. Apart from moving the data as required, The Intel MPI Library also optimizes data movement between the host and device to hide the data transfer costs as much as possible. The MPI calls still run on the host. A new environment variable called I_MPI_OFFLOAD was introduced for GPU buffer support. As shown in the figure below, MPI_Send appears outside the scope of omp target parallel for pragma but inside the scope of the target pragma responsible for data movement. Also note that the pointer values is marked using the use_device_ptr and is_device_ptr keywords to enable values pointer to be correctly processed by the Intel MPI Library. GPU aware features of Intel MPI Library can be used from within both SYCL and OpenMP-offload based applications.
Build command
$ mpiicc -cc=icx -qopenmp -fopenmp-targets=spir64 test.c -o test
where, -cc=icx enforces the use of offload capable compiler driver, instead of the traditional icc compiler driver; -qopenmp enables a middle-end that supports the transformation of OpenMP in LLVM; and -fopenmp-targets=spir64 enables the compiler to generate a x86 + SPIR64 fat binary for the GPU device binary generation.
Run command
$ export I_MPI_OFFLOAD=1 $ mpiexec.hydra -n 2 ./test
I_MPI_OFFLOAD controls handling of device buffers in MPI functions. A value of 1 enables handling of device buffers under the assumption that libze_loader.so is already loaded. If the library is not already loaded (which is unexpected), then GPU buffer support will not be enabled. To enforce Intel MPI Library to load libze_loader.so I_MPI_OFFLOAD should be set to 2.
Kernel Based Collectives
Unlike the GPU aware MPI model, this model supports execution of MPI scale-up kernels by the device directly. This approach helps eliminate data transfers between the host and device that were made owing to the execution of MPI calls on the host. MPI-like implementations like Intel® oneAPI Collective Communications Library (Intel® oneCCL) already support such functionalities for collective calls.
GPU Pinning
For a heterogeneous (CPU + device) MPI program, apart from allocating CPU cores to MPI ranks, it is also necessary to allocate available GPU resources to ranks. In the context of the Intel® MPI Library, by default the base unit of work on Intel’s discrete GPU is assumed to be a stack. Many environment variables have been introduced to achieve user desired pinning schemes. By default the Intel MPI Library enforces a balanced pinning scheme, i.e., if the number of ranks per node are equal to the number of stacks, then 1 rank is assigned per stack. In case of oversubscription, a round robin scheme is used for the best possible balanced distribution. The Intel MPI Library needs GPU topology awareness to enforce GPU pinning and this is achieved with the help of the Intel® oneAPI Level Zero. The following environment variable enables the use of Level Zero for topology detection,
$ export I_MPI_OFFLOAD_TOPOLIB=level_zero
Starting with Intel MPI Library 2021 U6, I_MPI_OFFLOAD > 0 will automatically set I_MPI_OFFLOAD_TOPOLIB to level_zero.
At debug level 3, Intel MPI Library prints the GPU pinning scheme used for the application.
$ export I_MPI_DEBUG=3
Following is a sample debug output for an application running with 4 ranks on a 4-stack machine:
[0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1} [0] MPI startup(): 2 {2} [0] MPI startup(): 3 {3}
At debug level 120, placement related details are presented.
$ export I_MPI_DEBUG=120
Following is a sample debug output for an application running with 4 ranks on a 4-stack machine:
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0,1 [0] MPI startup(): 1 1 (2,3) 2,3
This information under “GPU placement on packages” section should not be treated as GPU pinning information. Rather this section presents how the ranks are placed on NUMA nodes. For a node with just 1 NUMA domain per socket, NUMA nodes can be treated as CPU sockets. Under this assumption, one can interpret the above debug information as follows:
GPU 0 is connected to CPU socket 0.
GPU 0 contains stacks 0 and 1.
Ranks 0 and 1 are placed on CPU socket 0.
Similarly,
GPU 1 is connected to CPU socket 1.
GPU 1 contains stacks 2 and 3.
Ranks 2 and 3 are placed on CPU socket 1.
For more clarity, let’s consider another example. For the node depicted in the figure above, let’s assume that one would like to restrict execution to GPU 0 alone. This can be achieved using the following environment variable:
$ export I_MPI_OFFLOAD_DEVICES=0
As shown in the debug level 120 output (below), ranks 0, 1 and ranks 2, 3 reside on CPU sockets 0 and 1 respectively. However, since I_MPI_OFFLOAD_DEVICES was set to 0, all the ranks were pinned in a round robin manner to stacks of GPU 0 only. Please refer to the figure above for a pictorial representation of the same.
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0,1 [0] MPI startup(): 1 1 (2,3) 2,3 [0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1} [0] MPI startup(): 2 {0} [0] MPI startup(): 3 {1}
Like the I_MPI_PIN_PROCESOR_LIST environment variable, which is available for defining a custom pinning scheme on CPU cores, a similar variable called I_MPI_OFFLOAD_DEVICE_LIST is also available for GPU pinning. For more fine-grained pinning control, more variables like I_MPI_OFFLOAD_CELL, I_MPI_OFFLOAD_DOMAIN_SIZE, I_MPI_OFFLOAD_DOMAIN, etc. are available. Please refer to GPU Pinning for more details.
Multi-Stack, Multi-GPU and Multi-Node support
Modern heterogeneous (CPU + GPU) clusters present various GPU-specific scaling possibilities:
Stack scaling: scaling across stacks within the same GPU card (intra-GPU, inter-stack).
Scale-up: Scaling across multiple GPU cards connected to the same node (intra-node, inter-GPU).
Scale-out: Scaling across GPU cards connected to different nodes (inter-GPU, inter-node).
All of the above scenarios are transparently handled by the Intel MPI Library and application developers are not required to make additional source changes for this purpose. Sample command lines for each scenario are presented below.
Baseline: All ranks run on stack 0
$ I_MPI_OFFLOAD_DEVICE_LIST=0 I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 2 ./mpi-binary
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0 [0] MPI startup(): 1 1 (2,3) 1 [0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {0}
Stack scaling: 1 rank per stack on GPU 0
$ I_MPI_OFFLOAD_DEVICE_LIST=0,1 I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 2 ./mpi-binary
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0 [0] MPI startup(): 1 1 (2,3) 1 [0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1}
Scale-up: 1 rank per stack on GPU 0 and GPU 1
$ I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 4 ./mpi-binary
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0,1 [0] MPI startup(): 1 1 (2,3) 2,3 [0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1} [0] MPI startup(): 2 {2} [0] MPI startup(): 3 {3}
Scale-out: 1 rank per stack on GPUs 0, 1 of host1, GPUs 2, 3 of host2
$ I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 8 -ppn 4 -hosts host1,host2 ./mpi-binary
[0] MPI startup(): ===== GPU Placement on packages ===== [0] MPI startup(): NUMA Id GPU Id Stacks Ranks [0] MPI startup(): 0 0 (0,1) 0,1 [0] MPI startup(): 1 1 (2,3) 2,3 [0] MPI startup(): ===== GPU pinning on host1 ===== [0] MPI startup(): Rank Pin stack [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1} [0] MPI startup(): 2 {2} [0] MPI startup(): 3 {3}
Similar pinning occurred on host2 as well (not shown here).
GPU Benchmarks in IMB
The Intel® MPI Benchmarks perform a set of measurements for point-to-point and collective communication operations for a range of message sizes. The generated benchmark data helps assess the performance characteristics of a cluster and efficiency of the MPI implementation. Traditionally, IMB benchmarks ran on CPU only, GPU support was recently introduced. A pre-built binary called IMB-MPI1-GPU is shipped along with the Intel MPI Library package and can be found in $I_MPI_ROOT/bin folder on a Linux machine. This binary can run in the GPU aware MPI model described earlier.
The following is an example command line for running IMB-MPI1-GPU for the default 0 - 4 MB message size range with 200 iterations per message size on a single node with 2 GPU cards. In this case, just the MPI_Allreduce function was executed. Many other MPI functions are also supported.
$ I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 4 -ppn 4 IMB-MPI1-GPU allreduce -mem_alloc_type device -npmin 4 -iter 200 -iter_policy off
The -mem_alloc_type flag controls where the MPI buffers reside. Since Intel® MPI Library 2021.4, the default value for -mem_alloc_type is device.
-npmin 4 restricts the minimum number of MPI ranks for IMB to 4 (if not, the target number of MPI ranks as specified by -n is reached in an incremental manner).
-iter 200 requests the number of iterations per message size to 200.
-iter_policy off switches off the auto reduction of iterations for large message sizes. In other words, -iter_policy off ensure that -iter 200 request is obeyed for all message sizes.
Line Saturation Test
In this sub-section we demonstrate Intel® MPI Library’s ability to saturate the interconnect bandwidth when GPU buffers are employed. Tests were run on two nodes, each with a dual socket Intel® Xeon® Platinum 8480+ processor and 2 Intel® Data Center GPU Max 1550 (Code name: Ponte Vecchio). The nodes were connected using Mellanox Quantum HDR interconnect which is rated at 200 Gb/s or 25 GB/s. A total of 2 MPI ranks were launched, i.e. 1 rank per node and the PingPong test was selected from the IMB-MPI1-GPU binary as shown in the following command:
$ I_MPI_DEBUG=5 I_MPI_OFFLOAD=2 mpiexec.hydra -n 2 -ppn 1 -hosts host1,host2 IMB-MPI1-GPU pingpong -iter 100 -iter_policy off -msglog 0:28
Intel® MPI Library 2021.8 was used for this line saturation test. Peak bandwidth of 23,948.98 MB/s was achieved which corresponds to a line rate efficiency of 93.6%.
Profiling MPI Applications on GPUs
Tools like Intel® VTune™ Profiler, Intel® Advisor, Intel® Inspector, etc. have been enhanced to work on Intel GPUs. As was the case with CPU profiling, these tools continue to be available for MPI applications in the context of GPUs. Some of the capabilities shall be presented here in the context of MPI applications.
The IMB-MPI1-GPU benchmark of the previous section will be considered here. One way to confirm if the MPI buffers were really allocated on the GPU memory is to check for PCIe traffic, which is bound to increase (versus the -mem_alloc_type CPU case) since the GPUs are connected to the CPU via PCIe. Intel® VTune™ Profiler’s Input and Output Analysis captures and displays several PCIe metrics.
Several modes of working with Intel® VTune™ Profiler on remote machines are available. Here the profiling data was collected from the command line interface of a remote machine. The result folder was then moved to a local machine for viewing on a local GUI. The following command lines were executed on the remote machine which hosted the GPU node.
Buffer allocation on CPU -
$ I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 4 -ppn 4 -f hostfile -gtool "vtune -collect io -r ./vtune_data1:0" IMB-MPI1-GPU allreduce -npmin 4 -iter 200 -iter_policy off -mem_alloc_type CPU
Buffer allocation on GPU
$ I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 4 -ppn 4 -f hostfile -gtool "vtune -collect io -r ./vtune_data2:0" IMB-MPI1-GPU allreduce -npmin 4 -iter 200 -iter_policy off
Here the -gtool flag was used to selectively launch VTune on rank 0 only (as specified by the last two arguments to -gtool, i.e. :0).
As can be observed from the Bandwidth Utilization Histogram in the Summary tab of the Input and Output Analysis (figures below), PCIe traffic is significantly higher in the case where the MPI buffers were allocated on the device. The observed maximum bandwidth in the -mem_alloc_type device case was 24,020 MB/s versus 27 MB/s for the CPU buffer case.
To see the amount of data transferred between host and device one may use Intel® VTune™ Profiler’s GPU Offload analysis. Here Host-to-Device and Device-to-Host transfer times and transfer volumes are visible under the Graphics tab, as shown in Figure 7. The following command line was used for running the GPU Offload analysis:
$ I_MPI_DEBUG=120 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_TOPOLIB=level_zero mpiexec.hydra -n 4 -ppn 4 -f hostfile -gtool "vtune -collect gpu-offload -knob collect-host-gpu-bandwidth=true -r ./vtune_data3:0" IMB-MPI1-GPU allreduce -npmin 4 -iter 200 -iter_policy off
The purpose of this sub-section was to demonstrate the use of analysis tools in the context of MPI applications running in a heterogeneous environment. For more details on the various GPU specific capabilities of Intel® VTuneTM Profiler, please refer to Intel® VTune™ Profiler User Guide. One can use the other analysis tools like Intel® Advisor and Intel® Inspector with MPI applications using the -gtool flag as demonstrated earlier. Please refer to their respective user guides Intel® Advisor User Guide, Intel® Inspector User Guide for Linux for more details.
Recommendations
It is a good practice to reuse the same GPU buffers in MPI communications as much as possible. Buffer creation in the Intel® MPI Library is an expensive operation and unnecessary allocation and deallocation of buffers must be avoided.
If your application uses several GPU buffers for non-blocking point-to-point operations, it is possible to increase the number of parallel buffers used by Intel® MPI Library for better performance. An environment variable named I_MPI_OFFLOAD_BUFFER_NUM is available for this purpose. Its default value is 4, i.e. Intel® MPI Library can handle 4 user’s device buffers in parallel. However, increasing the value of this variable also increases memory consumption.
If your application uses large device buffers, then increasing the size of the scratch buffers used by Intel® MPI Library might improve performance in some cases. An environment variable named I_MPI_OFFLOAD_BUFFER_SIZE is available for this purpose. Its default value is 8 MB. Increasing the value of this variable will also result in increased memory consumption.
In the GPU aware model, Intel® MPI Library uses pipelining for better performance. This technique, however, is better suited for large message sizes and an environment variable called I_MPI_OFFLOAD_PIPELINE_THRESHOLD is available to control when pipelining algorithm is activated. The default value for this variable is 524288 bytes. Users may optimize this control for the value best suited in their environments.
Another variable called I_MPI_OFFLOAD_CACHE_TOTAL_MAX_SIZE is available to increase the size of all internal caches used by Intel® MPI Library. Its default value is 67108864 bytes. If an application uses many GPU buffers, it might not be possible to cache all of them and increasing the value of this variable can potentially improve performance.