oneAPI Collective Communications Library Release Notes

ID 763686
Updated 10/15/2024
Version 2021.14
Public

author-image

By

Overview

The Intel® oneAPI Collective Communications Library (oneCCL) enables developers and researchers to more quickly train newer and deeper models. This is done by using optimized communication patterns to distribute model training across multiple nodes.

The library is designed for easy integration into deep learning (DL) frameworks, whether you are implementing them from scratch or customizing existing ones.

  • Built on top of lower-level communication middleware - MPI and OFI (libfabrics) which transparently support many interconnects, such as Cornelis Networks, InfiniBand, and Ethernet.
  • Optimized for high performance on Intel® CPUs and GPUs.
  • Allows the tradeoff of compute for communication performance to drive scalability of communication patterns.
  • Enables efficient implementations of collectives that are heavily used for neural network training, including allreduce, and allgather.

2021.14 Release

Major Features Supported

  • Table1
    Functionality Subitems CPU GPU
    Collective operations Allgather X X
      Allgatherv X X
      Allreduce X X
      Alltoall X X
      Alltoallv X X
      Barrier X X
      Broadcast X X
      Reduce X X
      ReduceScatter X X
    Point to Point Send X X
      Recv X X
    Data types [u]int[8, 16, 32, 64] X X
      fp[16, 32, 64], bf16 X X
    Scaling Scale-up X X
      Scale-out X X
    Programming model Rank = device 1 rank per process 1 rank per process

     

  • Service functionality

    • Interoperability with SYCL*:
    • Construction of oneCCL communicator object based on SYCL context and SYCL device
    • Construction of oneCCL stream object based on SYCL queue
    • Construction of oneCCL event object based on SYCL event
    • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
    • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New 2021.14

  • Optimizations on key-value store support to scale up to 3000 nodes
  • New APIs for Allgather, Broadcast and group API calls
  • Performance Optimizations for scaleup for Allgather, Allreduce, and Reduce-scatter for scaleup and scaleout
  • Performance Optimizations for CPU single node
  • Optimizations to reuse Level Zero events.
  • Change of the default mechanism for IPC exchange to pidfd

System Requirements

please see system requirements.  

 

Previous Releases

2021.13.0 and 2021.13.1

2021.13.1

Bug fixes.

2021.13.0

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Point to Point Send X X
    Recv X X
Data types [u]int[8, 16, 32, 64] X X
   fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Optimizations to limit the memory consumed by oneCCL
  • Optimizations to limit the number of file descriptors maintained opened by oneCCL.
  • Align the support for in-place for the Allgatherv and Reduce-scatter collectives to follow the same behavior as NCCL.
    In particular, the Allgatherv collective is in place when:
    send_buff == recv_buff + rank_offset, where rank_offset = sum (recv_counts[i]), for all I<rank.
    Reduce-scatter is in-place when recv_buff == send_buff + rank *recv_count. 
  • When using the environment variable CCL_WORKER_AFFINITY, oneCCL enforces the requirement that the length of the list should be equal to the number of workers. 
  • Bug fixes.

Removals

  • oneCCL 2021.12 removes support for PMIx 4.2.2

 

2021.12

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Point to Point Send X X
    Recv X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Performance improvements for scaleup for all message sizes for AllReduce, Allgather, and Reduce_Scatter.
    Optimizations also include small message sizes that appear in inference apps. 
  • Performance improvements for scaleout for Allreduce, Reduce, Allgather, and Reduce_Scatter.
  • Optimized memory usage of oneCCL.
  • Support for PMIx 4.2.6.
  • For the 2021.12 release, the Third Party Programs file has been included as a section in this product’s release notes rather than as a separate text file.
  • Bug fixes.

Removals

  • oneCCL 2021.12 removes support for PMIx 4.2.2

 

2021.11.0, 2021.11.1 and 2021.11.2

2021.11.2 Release

This update provides bug fixes to maintain driver compatibility for Intel® Data Center GPU Max Series.

2021.11.1 Release

This update addresses stability issues with distributed Training and Inference workloads on Intel® Data Center GPU Max Series.

2021.11.0 Release

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Point to Point Send X X
    Recv X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Added point to point blocking communication operations for send and receive.
  • Performance optimizations for Reduce-Scatter.
  • Improved profiling with Intel® Instrumentation and Tracing Technology (ITT) profiling level.

Removals

  • Support of Intel® C++ Compiler Classic (icc) is removed

2021.10.0

 

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Improved scaling efficiency of the Scaleup algorithms for ReduceScatter
  • Optimize performance of oneCCL scaleup collectives by utilizing the new embedded Data Streaming Accelerator in Intel® 4th Generation Xeon Scalable Processors (formerly code-named Sapphire Rapids)

Removals

  • Sockets provider will be removed starting with 2021.10 release
  • Support of Intel® C++ Compiler Classic (icc) will be removed starting with 2021.11 release

2021.9

 

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Improved scaling efficiency of the Scaleup algorithms for Alltoall and Allgather
  • Add collective selection for scaleout algorithm for device (GPU) buffers

2021.8

 

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Provides optimized performance for Intel® Data Center GPU Max Series utilizing oneCCL.
  • Enables support for Allreduce, Allgather, Reduce, and Alltoall connectivity for GPUs on the same node

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

2021.7 and 2021.7.1

2021.7.1 Release

Intel® oneAPI Collective Communications Library 2021.7.1 has been updated to include functional and security updates. Users should update to the latest version as it becomes available.

2021.7 Release

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • no change from previous release.

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

 

2021.6

 

Major Features Supported

Table1

Functionality Subitems CPU GPU
Collective operations Allgatherv X X
  Allreduce X X
    Alltoall X X
    Alltoallv X X
    Barrier X X
    Broadcast X X
    Reduce X X
    ReduceScatter X X
Data types [u]int[8, 16, 32, 64] X X
    fp[16, 32, 64], bf16 X X
Scaling Scale-up X X
  Scale-out X X
Programming model Rank = device 1 rank per process 1 rank per process

Service functionality

  • Interoperability with SYCL*:
  • Construction of oneCCL communicator object based on SYCL context and SYCL device
  • Construction of oneCCL stream object based on SYCL queue
  • Construction of oneCCL event object based on SYCL event
  • Retrieving of SYCL event from oneCCL event associated with oneCCL collective operation
  • Passing SYCL buffer as source/destination parameter of oneCCL collective operation

What's New

  • Intel® oneAPI Collective Communications Library now supports Intel® Instrumentation and Tracing Technology (ITT) profiling
  • Intel® oneAPI Collective Communications Library can be seamlessly integrated with Windows platforms with WSL2 (Windows Subsystem for Linux 2) support
  • Enhanced application stability with runtime dependency check for Level Zero, in Intel® oneAPI Collective Communications Library

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

2021.5

 

What's New

  • Added support for output SYCL event to track status of CCL operation
  • Added OFI/verbs provider with dmabuf support into package
  • Bug fixes

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues
    oneccl::allreduce(…);

2021.4

 

What's New

  • Memory binding of worker threads is now supported
  • NIC filtering by name is now supported for OFI-based multi-NIC
  • IPv6 is now supported for key-value store (KVS)

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

 

2021.3

 

What's New

  • Added OFI-based multi-NIC support
  • Added OFI/psm3 provider support
  • Bug fixes

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

 

2021.2

 

What's New

  • Added float16 datatype support.
  • Added ip-port hint for customization of KVS creation.
  • Optimized communicator creation phase.
  • Optimized multi-GPU collectives for single-node case.
  • Bug fixes

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

 

2021.1

What's New

  • Added [u]int16 support
  • Added initial support for external launch mechanism
  • Fixed bugs

Known issues and limitations

  • The 'using namespace oneapi;' directive is not recommended, as it may result in compilation errors  when oneCCL is used with other oneAPI libraries. You can instead create a namespace alias for oneCCL, e.g. 

namespace oneccl = ::oneapi::ccl; oneccl::allreduce(…);

  • Limitations imposed by Intel® oneAPI DPC++ Compiler:
  • SYCL buffers cannot be used from different queues

 

 

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Third Party Programs File