Get Started with Intel® oneAPI Collective Communications Library

ID 772605
Date 6/24/2024
Public

Get Started with Intel® oneAPI Collective Communications Library

Intel® oneAPI Collective Communications Library (oneCCL) is a scalable and high-performance communication library for Deep Learning (DL) and Machine Learning (ML) workloads. It develops the ideas originated in Intel(R) Machine Learning Scaling Library and expands the design and API to encompass new features and use cases.

System Requirements

Refer to the oneCCL System Requirements page.

Install

See Intel® oneAPI Toolkits Installation Guide for Linux* OS to learn about oneCCL installation.

Before You Begin

After installing oneCCL, set the environment variables:

  • To load oneCCL package, run:

    source <install_dir>/ccl/latest/env/vars.sh
  • To load all installed oneAPI components, run:

    source <install_dir>/setvars.sh
NOTE:
To set up the standalone package environment with setvars.sh script, install Intel® oneAPI DPC++/C++ Compiler for oneCCL with SYCL support. oneCCL installed as part of Intel® oneAPI Base Toolkit does not require external dependencies.

You can also modify the oneCCL setup by using two flags when sourcing the vars.sh script:

  • ccl-configuration=[cpu_gpu_dpcpp/cpu] - allows to choose between a SYCL-based version represented by cpu_gpu_dpcpp (default) and a CPU version, which does not require SYCL runtime libraries.

  • ccl-bundled-mpi=[yes|no] - controls if Intel(R) MPI is used or not. Default value is yes.

    • To use Intel(R) MPI, run:

      source intel/oneapi/ccl/2021.11/env/vars.sh --ccl-bundled-mpi=yes

      oneCCL uses bundled IMPI implementation, possibly overriding a user-supplied setup.

    • To use MPI implementation different from Intel(R) MPI, such as MPICH, run:

      source intel/oneapi/ccl/2021.11/env/vars.sh --ccl-bundled-mpi=no

For more information about setvars.sh, see Use the setvars and oneapi-vars Scripts with Linux*.

Sample Application

The sample code below shows how to use oneCCL API to perform allreduce communication for SYCL USM memory.

#include <iostream>
#include <mpi.h>
#include "oneapi/ccl.hpp"

void mpi_finalize() {
    int is_finalized = 0;
    MPI_Finalized(&is_finalized);

    if (!is_finalized) {
        MPI_Finalize();
    }
}

int main(int argc, char* argv[]) {
    constexpr size_t count = 10 * 1024 * 1024;

    int size = 0;
    int rank = 0;

    ccl::init();

    MPI_Init(nullptr, nullptr);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    atexit(mpi_finalize);

    sycl::default_selector device_selector;
    sycl::queue q(device_selector);
    std::cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n";

    /* create kvs */
    ccl::shared_ptr_class<ccl::kvs> kvs;
    ccl::kvs::address_type main_addr;
    if (rank == 0) {
        kvs = ccl::create_main_kvs();
        main_addr = kvs->get_address();
        MPI_Bcast((void*)main_addr.data(), main_addr.size(), MPI_BYTE, 0, MPI_COMM_WORLD);
    }
    else {
        MPI_Bcast((void*)main_addr.data(), main_addr.size(), MPI_BYTE, 0, MPI_COMM_WORLD);
        kvs = ccl::create_kvs(main_addr);
    }

    /* create communicator */
    auto dev = ccl::create_device(q.get_device());
    auto ctx = ccl::create_context(q.get_context());
    auto comm = ccl::create_communicator(size, rank, dev, ctx, kvs);

    /* create stream */
    auto stream = ccl::create_stream(q);

    /* create buffers */
    auto send_buf = sycl::malloc_device<int>(count, q);
    auto recv_buf = sycl::malloc_device<int>(count, q);

    /* open buffers and modify them on the device side */
    auto e = q.submit([&](auto& h) {
        h.parallel_for(count, [=](auto id) {
            send_buf[id] = rank + id + 1;
            recv_buf[id] = -1;
        });
    });

    int check_sum = 0;
    for (int i = 1; i <= size; ++i) {
        check_sum += i;
    }

    /* do not wait completion of kernel and provide it as dependency for operation */
    std::vector<ccl::event> deps;
    deps.push_back(ccl::create_event(e));

    /* invoke allreduce */
    auto attr = ccl::create_operation_attr<ccl::allreduce_attr>();
    ccl::allreduce(send_buf, recv_buf, count, ccl::reduction::sum, comm, stream, attr, deps).wait();

    /* open recv_buf and check its correctness on the device side */
    sycl::buffer<int> check_buf(count);
    q.submit([&](auto& h) {
        sycl::accessor check_buf_acc(check_buf, h, sycl::write_only);
        h.parallel_for(count, [=](auto id) {
            if (recv_buf[id] != static_cast<int>(check_sum + size * id)) {
                check_buf_acc[id] = -1;
            }
        });
    });

    q.wait_and_throw();

    /* print out the result of the test on the host side */
    {
        sycl::host_accessor check_buf_acc(check_buf, sycl::read_only);
        size_t i;
        for (i = 0; i < count; i++) {
            if (check_buf_acc[i] == -1) {
                std::cout << "FAILED\n";
                break;
            }
        }
        if (i == count) {
            std::cout << "PASSED\n";
        }
    }

    sycl::free(send_buf, q);
    sycl::free(recv_buf, q);
}

Prerequisites

  • oneCCL with SYCL support is installed and oneCCL environment is set up (see installation instructions)

  • Intel® MPI Library is installed and MPI environment is set up

Run the Sample

  1. Use the C++ driver with the -fsycl option to build the sample:

icpx -fsycl -o sample sample.cpp -lccl -lmpi
  1. Run the sample:

    mpiexec <parameters> ./sample

Where <parameters> represents optional mpiexec parameters, such as node count, processes per node, hosts, and so on.

Compile and Build Applications with pkg-config

The pkg-config tool is widely used to simplify building software with library dependencies. It provides command line options for compiling and linking applications to a library. Intel® oneAPI Collective Communications Library provides pkg-config metadata files for this tool starting with the oneCCL 2021.4 release.

The oneCCL pkg-config metadata files cover both configurations of oneCCL: with and without SYCL support.

Compile

To compile a test sample.cpp program with oneCCL, run:

icpx -fsycl -o sample  sample.cpp $(pkg-config --libs --cflags ccl)

--cflags provides the include path to the API directory:

pkg-config --cflags ccl

The output:

-I/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//include/ -I/opt/intel/oneapi/ccl/latest/lib/pkgconfig/../..//include/cpu_gpu_icpx

--libs provides the oneCCL library name, all other dependencies (such as SYCL and MPI), and the search path to find it:

pkg-config --libs ccl

The output:

-L/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//lib/ -L/opt/intel/oneapi/mpi/latest/lib/pkgconfig/../..//lib/release/ -L/opt/intel/oneapi/ccl/latest/lib/pkgconfig/../..//lib/cpu_gpu_icpx -lccl -lsycl -lmpi -lmpicxx -lmpifort

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.