Get Started with Intel® oneAPI Collective Communications Library

ID 772605
Date 6/30/2025
Public

Get Started with Intel® oneAPI Collective Communications Library

Intel® oneAPI Collective Communications Library (oneCCL) is a scalable and high-performance communication library for Deep Learning (DL) and Machine Learning (ML) workloads. The library builds upon the ideas introduced in the Intel(R) Machine Learning Scaling Library, enhancing the design and API to encompass new features and use cases.

oneCCL exposes a collective API for scaling ML and DL workloads in multi-GPU distributed environments:

  • Commonly used collective operations found in ML and DL workloads such as ALLREDUCE, BROADCAST, ALLGATHER

  • Interoperability with SYCL from the Khronos Group

  • Integrated with PyTorch

The runtime implementation of the oneCCL library enables several optimizations, including:

  • Asynchronous progress for compute communication overlap

  • Dedication of one or more cores to ensure optimal network use

  • Support for low-precision data types

The following are the required steps to get started with using the oneCCL library:

  1. Install oneCCL

  2. Set the Environment Variables

Install oneCCL

The oneCCL library is available as a stand-alone product and as part of the Intel® oneAPI Base Toolkit.

Prerequisites

See oneCCL System Requirements to learn about hardware and software requirements for oneCCL.

Download and install the library using one of the following options:

Set the Environment variables

After installing oneCCL, set the environment variables:

  • To load oneCCL package, run:

    source <install_dir>/ccl/latest/env/vars.sh
  • To load all installed oneAPI components, run:

    source <install_dir>/setvars.sh
NOTE:
To set up the standalone package environment with setvars.sh script, install Intel® oneAPI DPC++/C++ Compiler for oneCCL with SYCL support. oneCCL installed as part of Intel® oneAPI Base Toolkit does not require external dependencies.

You can also modify the oneCCL setup by using two flags when sourcing the vars.sh script:

  • ccl-configuration=[cpu_gpu_dpcpp/cpu] - Allows to choose between a SYCL-based version represented by cpu_gpu_dpcpp (default) and a CPU version, which does not require SYCL runtime libraries.

  • ccl-bundled-mpi=[yes|no] - Controls if Intel(R) MPI is used or not. Default value is yes.

    • To use Intel(R) MPI, run:

      source intel/oneapi/ccl/<latest>/env/vars.sh --ccl-bundled-mpi=yes

      oneCCL uses a bundled IMPI implementation, possibly overriding a user-supplied setup.

    • To use an MPI implementation different from Intel(R) MPI, such as MPICH, run:

      source intel/oneapi/ccl/<latest>/env/vars.sh --ccl-bundled-mpi=no

For more information about setvars.sh, see Use the setvars and oneapi-vars Scripts with Linux*.

After the environment variable setup is complete, you can build and execute an example.

Build and Run a Sample Application

The following example demonstrates how to use the oneCCL API to perform an ALLREDUCE communication operation on SYCL Unified Shared Memory (USM).

Prerequisites

  • oneCCL with SYCL support is installed and oneCCL environment is set up (see installation instructions)

  • Intel® MPI Library is installed and MPI environment is set up

Steps

  1. Create an example.cpp file in your project.

  2. Copy the following code in the file.

    #include <iostream> #include <mpi.h> #include "oneapi/ccl.hpp" void mpi_finalize() { int is_finalized = 0; MPI_Finalized(&is_finalized); if (!is_finalized) { MPI_Finalize(); } } int main(int argc, char* argv[]) { constexpr size_t count = 10 * 1024 * 1024; int size = 0; int rank = 0; ccl::init(); MPI_Init(nullptr, nullptr); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); atexit(mpi_finalize); sycl::default_selector device_selector; sycl::queue q(device_selector); std::cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; /* create kvs */ ccl::shared_ptr_class<ccl::kvs> kvs; ccl::kvs::address_type main_addr; if (rank == 0) { kvs = ccl::create_main_kvs(); main_addr = kvs->get_address(); MPI_Bcast((void*)main_addr.data(), main_addr.size(), MPI_BYTE, 0, MPI_COMM_WORLD); } else { MPI_Bcast((void*)main_addr.data(), main_addr.size(), MPI_BYTE, 0, MPI_COMM_WORLD); kvs = ccl::create_kvs(main_addr); } /* create communicator */ auto dev = ccl::create_device(q.get_device()); auto ctx = ccl::create_context(q.get_context()); auto comm = ccl::create_communicator(size, rank, dev, ctx, kvs); /* create stream */ auto stream = ccl::create_stream(q); /* create buffers */ auto send_buf = sycl::malloc_device<int>(count, q); auto recv_buf = sycl::malloc_device<int>(count, q); /* open buffers and modify them on the device side */ auto e = q.submit([&](auto& h) { h.parallel_for(count, [=](auto id) { send_buf[id] = rank + id + 1; recv_buf[id] = -1; }); }); int check_sum = 0; for (int i = 1; i <= size; ++i) { check_sum += i; } /* do not wait completion of kernel and provide it as dependency for operation */ std::vector<ccl::event> deps; deps.push_back(ccl::create_event(e)); /* invoke allreduce */ auto attr = ccl::create_operation_attr<ccl::allreduce_attr>(); ccl::allreduce(send_buf, recv_buf, count, ccl::reduction::sum, comm, stream, attr, deps).wait(); /* open recv_buf and check its correctness on the device side */ sycl::buffer<int> check_buf(count); q.submit([&](auto& h) { sycl::accessor check_buf_acc(check_buf, h, sycl::write_only); h.parallel_for(count, [=](auto id) { if (recv_buf[id] != static_cast<int>(check_sum + size * id)) { check_buf_acc[id] = -1; } }); }); q.wait_and_throw(); /* print out the result of the test on the host side */ { sycl::host_accessor check_buf_acc(check_buf, sycl::read_only); size_t i; for (i = 0; i < count; i++) { if (check_buf_acc[i] == -1) { std::cout << "FAILED\n"; break; } } if (i == count) { std::cout << "PASSED\n"; } } sycl::free(send_buf, q); sycl::free(recv_buf, q); }
  3. Use the icpx C++ compiler with the -fsycl option to build the sample:

    icpx -fsycl -o sample sample.cpp -lccl -lmpi
  4. Run the sample:

    mpiexec <parameters> ./sample

Where <parameters> represents optional mpiexec parameters, such as node count, processes per node, hosts, and so on.

A successful execution indicates that the operation has been completed. If you encounter an error, make sure the oneCCL environment is configured correctly.

Integrate oneCCL

If you want to improve the performance and scalability of your application, you can integrate oneCCL into your project. You can use the pkg-config tool to simplify the process of integrating oneCCL into your project and handling its dependencies.

Compile and Build Applications with pkg-config

The pkg-config tool is widely used to simplify building software with library dependencies. It provides command line options for compiling and linking applications to a library. Intel® oneAPI Collective Communications Library provides pkg-config metadata files for this tool starting with the oneCCL 2021.4 release.

The oneCCL pkg-config metadata files cover both configurations of oneCCL: with and without SYCL support.

Compile

To compile a test sample.cpp program with oneCCL, run:

icpx -fsycl -o sample sample.cpp $(pkg-config --libs --cflags ccl)

--cflags provides the include path to the API directory:

pkg-config --cflags ccl

The output:

-I/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//include/ -I/opt/intel/oneapi/ccl/latest/lib/pkgconfig/../..//include/cpu_gpu_icpx

--libs provides the oneCCL library name, all other dependencies (such as SYCL and MPI), and the search path to find it:

pkg-config --libs ccl

The output:

-L/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//lib/ -L/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//lib/release/ -L/opt/intel/oneapi/ccl/latest/lib/pkgconfig/../..//lib/cpu_gpu_icpx -lccl -lsycl -lmpi -lmpicxx -lmpifort

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.