Intel® oneAPI Collective Communications Library Benchmark User Guide

ID 816913
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

oneCCL Benchmark User Guide

The oneCCL benchmark provides performance measurements for the collective operations in oneCCL, such as:

  • allreduce

  • reduce

  • allgather

  • alltoall

  • alltoallv

  • reduce-scatter

  • broadcast

The benchmark is distributed with the oneCCL package. You can find it in the examples directory within the oneCCL installation path.

Build oneCCL Benchmark

CPU-Only

To build the benchmark, complete the following steps:

  1. Configure your environment. Source the installed oneCCL library for the CPU-only support:

    source <ccl installation dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu
  2. Navigate to <oneCCL install location>/share/doc/ccl/examples

  3. Build the benchmark with the following command:

    cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install

CPU-GPU

  1. Configure your environment.

    • Source the Intel(R) oneAPI DPC++/C++ Compiler. See the documentation for the instructions.

    • Source the installed oneCCL library for the CPU-GPU support:

      source <ccl installation dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu_gpu_dpcpp
  2. Navigate to <oneCCL install location>/share/doc/ccl/examples.

  3. Build the SYCL benchmark with the following command:

    cmake -S . -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install

Run oneCCL Benchmark

To run the benchmark, use the following command:

mpirun -np <N> -ppn <P> benchmark [arguments]

Where:

  • N is the overall number of processes

  • P is the number of processes within a node

The benchmark reports:

  • #bytes - the message size in the number of bytes

  • #repetitions - the number of iterations

  • t_min - the average time across iterations of the fastest process in each iteration

  • t_max - the average time across iterations of the slowest process in each iteration

  • t_avg - the average time across processes and iterations

  • stddev - standard deviation

  • wait_t_avg - the average wait time after the collective call returns and until it completes To enable, use the -x option.

Notice that t_min, t_max, and t_avg measure the total collective execution time. It means the timer starts before calling oneCCL API and ends once the collective completes. While wait_t_avg only measures the wait time. It means the timer starts after the collective API call returns control to the host/calling thread and ends once the collective completes. Thus, wait_t_avg does not include the time spent on the oneCCL API call, while t_min, t_max, and t_avg include that time. Time is reported in μsec.

Benchmark Arguments

To see the benchmark arguments, use the --help argument.

The benchmark accepts the following arguments:

Option

Description

Default Value

-b, --backend

Specify the backend. The possible values are host and sycl. For a CPU-only build, the backend is automatically set to host, and only the host option is available. For a CPU-GPU build, host and sycl options are available, and sycl is the default value. The host value allocates buffers in the host (CPU) memory, while the sycl value allocates buffers in the device (GPU) memory.

sycl

-i, --iters

Specify the number of iterations executed by the benchmark.

16

-w, --warmup_iters

Specify the number of the warmup iterations. It means the number of iterations the benchmark runs before starting the timing of the iterations specified with the -i argument.

16

-j, --iter_policy

Specify the iteration policy. Possible values are off and auto. When the iteration policy is off, the number of iterations is the same across the message sizes. When the iteration policy is auto, the number of iterations reduces based on the message size of the collective operation.

auto

-n, --buf_count

Specify the number of collective operations the benchmark calls in each iteration. Each collective uses different send and receive buffers. The explicit wait calls are placed for each collective after all of them are called.

1

NOTE:
The -t and -f options specify the count in number of elements, so the total number of bytes is obtained by multiplying the number of elements by the number of bytes of the data type the collective uses. For instance, with -f 128 and fp32 datatype, the total amount of bytes is 512 (128 element count * 4 bytes FP32). The benchmark runs and reports time for message sizes that correspond to the -t and -f arguments and all message sizes that are powers of two in between these two numbers.

Example

GPU

The following example shows how to run the benchmark with the GPU buffers:

mpirun -n <N> -ppn <P> benchmark -a gpu -m usm -u device -l allreduce -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0 -e in_order

The above command runs:

  • The allreduce collective operation

  • With a total of N processes

  • With P processes per node allocating the memory in the GPU

  • Using SYCL Unified Shared Memory (USM) of the device type

  • 20 iterations

  • With the element count from 1024 to 67108864 (the benchmark runs with all the powers on two in that range) of float32 datatype, assuming the collective is not persistent and using a SYCL in-order queue

Similar for allreduce and reduce_scatter:

mpirun -n <N> -ppn <P> benchmark -a gpu -m usm -u device -l allreduce,reduce_scatter -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0 -e in_order
NOTE:
In this case, the time reported is the accumulated time corresponding to the execution time of allreduce and reduce_scatter.

CPU

mpirun -b host -n <N> -ppn <P> benchmark -l allreduce -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0

The above command specifies to run

  • The allreduce collective operation

  • With a total of N processes

  • With P processes per node

  • 20 iterations

  • With the element count from 1024 to 67108864 (the benchmark runs with all the powers on two in that range) of float32 datatype, assuming the collective is not persistent

Similar for allreduce and reduce_scatter:

mpirun -b host -n <N> -ppn <P> benchmark -l allreduce,reduce_scatter -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0