Environment Variables

Intel® oneAPI Collective Communications Library (oneCCL) Developer Guide and Reference

Download PDF

ID 772609

Date 10/31/2024

Version

Public

Visible to Intel only — GUID: GUID-889FE8F2-838E-448A-B4FB-93215864D1EB

View Details

Environment Variables

Collective Algorithms Selection

oneCCL supports collective operations for the host (CPU) memory buffers and device (GPU) memory buffers. In addition, oneCCL has two different paths to support collectives with GPU buffers; one directly uses Level Zero, and the other uses SYCL. The SYCL path is a new code being developed and not all collectives are supported.

For the Level Zero implementation, in the case of GPU buffers, oneCCL collectives are optimized to execute a hierarchical algorithm composed of an optimized scale-up phase (communication between ranks/processes in the same node) and a scaleout phase (communication between ranks/processes on different nodes). In the case of CPU buffers, the current collective algorithms do not have support for scale-up and scaleout phases; only a non-hierarchical algorithm can be chosen.

With CCL_<coll_name> = <algo_name>, you can select the algorithm for the collective in <coll_name>. For GPU buffers, the default algorithm is topo, which refers to the scale-up algorithm. If you select an algorithm different from topo, oneCCL will implement a non-hierarchical algorithm, where it will copy the GPU buffers to the Host (CPU) and will run the specified algorithm.

For CPU buffers, topo is not available; you can only select one of the other algorithms in the table for a given collective.

If the collective uses GPU buffers, you can select whether the implementation of the scale-up algorithm should use copy engines or kernels. There is also the option to select the scaleout algorithm using CCL_<coll_name>_SCALEOUT=<algo_name>.

Next, environment variables for collective algorithm selection are explained based on the code path (Level Zero or SYCL), the collective being called, and the type of buffer (GPU or CPU).

Level Zero Path

ALLGATHER/ALLGATHERV

CCL_ALLGATHER/CCL_ALLGATHERV

Syntax

For the whole message size:

CCL_ALLGATHER=<algo_name>
CCL_ALLGATHERV=<algo_name>

For a specific message size range:

CCL_ALLGATHER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of the available collective algorithms.
<size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLGATHER="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV="direct:0-8192;ring:8193-max"

Arguments

<algo_name>	Description
`topo`	Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on `MPI_Iallgather`/ `MPI_Iallgatherv`.
`naive`	Send to all, receive from all.
`flat`	alltoall-based algorithm.
`multi_bcast`	Series of broadcast operations with different root ranks.
`ring`	ring-based algorithm.

Description

Use this environment variable to specify the algorithm for ALLGATHER/ ALLGATERV.

If using GPU buffers, select CCL_ALLGATHER=topo or CCL_ALLGATHERV=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL

Syntax

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the allgather phase of `ALLGATHER` / `ALLGATHERV`. The default value.
`0`	Uses copy engines to transfer data across GPUs for the allgather phase of the `ALLGATHER` / `ALLGATHERV` collectives.

Description

Set this environment variable to use GPU buffers to specify the scale-up phase of the algorithm for ALLGATHER/ ALLGATHERV. This environment variable allows the user to choose between using compute kernels or copy engines.

This option is only available if CCL_ALLGATHER = topo or CCL_ALLGATHERV=topo (the default for GPU buffers).

NOTE:

This environment variable applies to both ALLGATHER and ALLGATHERV.

CCL_ALLGATHER_SCALEOUT/CCL_ALLGATHERV_SCALEOUT

Syntax

For the whole message size:

CCL_ALLGATHER_SCALEOUT=<algo_name>
CCL_ALLGATHERV_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLGATHER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
CCL_ALLGATHERV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of the available scaleout collective algorithms.
<size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLGATHER_SCALEOUT="direct:0-8192;ring:8193-max"
CCL_ALLGATHERV_SCALEOUT="direct:0-8192;ring:8193-max"

Arguments

<algo_name>	Description
`direct`	Based on `MPI_Iallgather`/ `MPI_Iallgatherv`.
`naive`	Send to all, receive from all.
`flat`	alltoall-based algorithm.
`multi_bcast`	Series of broadcast operations with different root ranks.
`ring`	ring-based algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout phase of the algorithm for ALLGATHER / ALLGATHERV. This option is only available if CCL_ALLGATHERV = topo or CCL_ALLGATHER = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

ALLREDUCE

CCL_ALLREDUCE

Syntax

For the whole message size:

CCL_ALLREDUCE=<algo_name>

For a specific message size range:

CCL_ALLREDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLREDUCE="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max"

Arguments

<algo_name>	Description
`topo`	Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on `MPI_Iallreduce`.
`rabenseifner`	Rabenseifner algorithm.
`nreduce`	May be beneficial for imbalanced workloads.
`ring`	reduce_scatter + allgather ring. Use `CCL_RS_CHUNK_COUNT` and `CCL_RS_MIN_CHUNK_SIZE` to control pipelining on reduce_scatter phase.
`double_tree`	double-tree algorithm.
`recursive_doubling`	Recursive doubling algorithm.
`2d`	Two-dimensional algorithm (reduce_scatter + allreduce + allgather).

Description

Use this environment variable to specify the algorithm for ALLREDUCE.

If using GPU buffers, select CCL_ALLREDUCE=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the `ALLREDUCE` collectives. The default value.
`0`	Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the `ALLREDUCE`.

Description

Set this environment variable to use GPU buffers to specify how to perform the reduce_scatter portion of the scale-up ALLREDUCE collective. This variable allows you to choose between using compute kernels or copy engines.

This option is only available if CCL_ALLREDUCE=topo (the default for GPU buffers).

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)

Syntax

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the allgather phase of `ALLREDUCE`. The default value.
`0`	Uses copy engines to transfer data across GPUs for the allgather phase of the `ALLREDUCE` collective.

Description

ALLREDUCE is implemented as a reduce-scatter phase followed by an allgather phase.

Set this environment variable to use GPU buffers to specify how to perform the allgather portion of the scale-up ALLREDUCE collective. This environment variable allows the user to choose between using compute kernels or using copy engines. This option is only available if CCL_ALLGATHERV=topo (the default for GPU buffers).

CCL_ALLREDUCE_SCALEOUT (GPU buffers only)

Syntax

For the whole message size:

CCL_ALLREDUCE_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLREDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLREDUCE_SCALEOUT="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max

Arguments

`direct`	Based on `MPI_allreduce`
`rabenseifner`	Rabenseifner algorithm.
`nreduce`	May be beneficial for imbalanced workloads.
`ring`	reduce_scatter + allgather ring. Use `CCL_RS_CHUNK_COUNT` and `CCL_RS_MIN_CHUNK_SIZE` to control pipelining on reduce_scatter phase. The default value.
`double_tree`	double-tree algorithm.
`ring`	Recursive doubling algorithm.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE. This option is only available if CCL_ALLREDUCE = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

ALLTOALL, ALLTOALLV

CCL_ALLTOALL, CCL_ALLTOALLV

Syntax

For the whole message size:

CCL_ALLTOALL=<algo_name>  or CCL_ALLTOALLV=<algo_name>

For a specific message size range:

CCL_ALLTOALL="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

CCL_ALLTOALLV="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLTOALL="naive:0-8192;scatter:8193-max"

CCL_ALLTOALLV="naive:0-8192;scatter:8193-max"

Arguments

`topo`	Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on `MPI_Ialltoall`
`naive`	Send to all, receive from all.
`scatter`	scatter-based algorithm.

CCL_ALLTOALLV_MONOLITHIC_KERNEL

Syntax

CCL_ALLTOALLV_MONOLITHIC_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the allgather phase of the `ALLTOALL` and `ALLTOALLV` collectives. The default value.
`0`	Uses copy engines to transfer data across GPUs for the allgather phase of the `ALLTOALL` and `ALLTOALLV` collectives.

Description

Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLTOALL or ALLTOALLV This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_ALLTOALL=topo or CCL_ALLTOALLV=topo. The default for GPU buffers.

CCL_ALLTOALL_SCALEOUT, CCL_scaleout_ALLTOALLV_scaleout

Syntax

For the whole message size:

CCL_ALLTOALL_SCALEOUT=<algo_name>  or CCL_ALLTOALLV_SCALEOUT=<algo_name>

For a specific message size range:

CCL_ALLTOALL_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

CCL_ALLTOALLV_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_ALLTOALL_SCALEOUT="naive:0-8192;scatter:8193-max"

CCL_ALLTOALLV_SCALEOUT="naive:0-8192;scatter:8193-max"

Arguments

<algo_name>	Description
`naive`	Send to all, receive from all.
`scatter`	scatter-based algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLTOALL or ALLTOALLV. This option is only available if CCL_ALLTOALL=topo or CCL_ALLTOALLV=topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

BARRIER

CCL_BARRIER

Syntax

CCL_BARRIER=<algo_name>

Arguments

<algo_name>	Description
`direct`	Based on `MPI_Ibarrier`.
`ring`	Ring-based algorithm.

Description

Use this environment variable to select the barrier algorithm.

BROADCAST

CCL_BCAST

Syntax

CCL_BCAST=<algo_name>

Arguments

<algo_name>	Description
`topo`	Topology-aware algorithm. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on MPI_Ibcast.
`ring`	ring-based algorithm.
`double_tree`	double-tree algorithm.
`naive`	Send to all from root rank.

Description

Use this environment variable to select the algorithm used for broadcast.

NOTE:

The BCAST algorithm does not yet support the CCL_BCAST_scaleout environment variable. To change the algorithm for BCAST, use the CCL_BCAST environment variable.

REDUCE

CCL_REDUCE

Syntax

For the whole message size:

CCL_REDUCE=<algo_name>

For a specific message size range:

CCL_REDUCE="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in the <left>-<right> format. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>	Description
`topo`	Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on `MPI_Ireduce`.
`rabenseifner`	Rabenseifner algorithm.
`tree`	tree algorithm
`double_tree`	double-tree algorithm.

Description

Set this environment variable to specify the algorithm for REDUCE.

If using GPU buffers, select CCL_REDUCE=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers, when selecting an algorithm different from topo, oneCCL copies the data to the host and follows the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the `REDUCE` collective. The default value.
`0`	Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the `REDUCE` collective.

Description

Set this environment variable to use GPU buffers to specify the scale-up algorithm for ALLREDUCE. This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_REDUCE=topo (the default for GPU buffers).

CCL_REDUCE_SCALEOUT (GPU buffers only)

Syntax

For the whole message size:

CCL_REDUCE_SCALEOUT=<algo_name>

For a specific message size range:

CCL_REDUCE_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCALEOUT="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>	Description
`direct`	Based on `MPI_Ireduce`.
`rabenseifner`	Rabenseifner algorithm.
`tree`	tree algorithm.
`double_tree`	double-tree algorithm. The default value.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for REDUCE. This option is only available if CCL_REDUCE=topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

REDUCE_SCATTER

CCL_REDUCE_SCATTER

Syntax

For the whole message size:

CCL_REDUCE_SCATTER=<algo_name>

For a specific message size range:

CCL_REDUCE_SCATTER="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCATTER="direct:0-8192;ring:1048577-max"

Arguments

<algo_name>	Description
`topo`	Topology-aware algorithm for scale-up. The default for GPU buffers. Not available for CPU buffers.
`direct`	Based on `MPI_Ireduce_scatter_block`.
`naive`	Send to all, receive, and reduce from all.
`ring`	ring-based algorithm. Use `CCL_RS_CHUNK_COUNT` and `CCL_RS_MIN_CHUNK_SIZE` to control pipelining.

Description

Use this environment variable to specify the algorithm for reduce. If using GPU buffers, select CCL_REDUCE_SCATTER=topo (the default) to use a hierarchical algorithm for scale-up data transfer across GPUs in the same node. For GPU buffers,when selecting an algorithm different from topo, oneCCL copies the data to the host and follow the specified CPU algorithm.

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL (GPU buffers only)

Syntax

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=<value>

Arguments

<value>	Description
`1`	Uses compute kernels to transfer data across GPUs for the reduce-scatter phase of the `REDUCE_SCATTER` collective. The default value.
`0`	Uses copy engines to transfer data across GPUs for the reduce-scatter phase of the `REDUCE_SCATTER` collective.

Description

Set this environment variable to use GPU buffers to specify how to perform the reduce-scatter portion of the scale-up REDUCE_SCATTER collective. This environment variable allows the user to choose between using compute kernels or using copy engines.

This option is only available if CCL_REDUCE_SCATTER=topo (the default for GPU buffers).

CCL_REDUCE_SCATTER_SCALEOUT (GPU buffers only)

Syntax

For the whole message size:

CCL_REDUCE_SCATTER_SCALEOUT=<algo_name>

For a specific message size range:

CCL_REDUCE_SCATTER_SCALEOUT="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"

Where:

<algo_name> is selected from the list of available collective algorithms.
<size_range> is described by the left and the right size borders in a format <left>-<right>. The size is specified in bytes. To specify the maximum message size, use the reserved word max.

Example

CCL_REDUCE_SCATTER_SCALEOUT="direct:0-8192;double_tree:1048577-max"

Arguments

<algo_name>	Description
`direct`	Based on `MPI_Ireduce_scatter_block`.
`naive`	Send to all, receive, and reduce from all. The default value.
`ring`	Ring-based algorithm. Use `CCL_RS_CHUNK_COUNT` and `CCL_RS_MIN_CHUNK_SIZE` to control pipelining.

Description

Set this environment variable to use GPU buffers to specify the scaleout algorithm for ALLREDUCE. This option is only available if CCL_REDUCE_SCATTER = topo (the default for GPU buffers).

oneCCL internally fills the algorithm selection table with appropriate defaults. Your input complements the selection table.

To see the actual table values, set CCL_LOG_LEVEL=info.

SYCL PATH (Default with 2021.14)

All collectives

CCL_ENABLE_SYCL_KERNELS

Syntax

CCL_ENABLE_SYCL_KERNELS=<value>

Arguments

<value>	Description
`1`	Enable SYCL kernels. The default value.
`0`	Disable SYCL kernels.

Description

Setting this environment variable to 1 enables SYCL kernel-based implementations for ALLGATHERV, ALLREDUCE, and REDUCE_SCATTER.

This new optimization optimizes all message sizes and supports the following data types:

int32
fp32
fp16
bf16
sum operations

oneCCL falls back to other implementations when the support is unavailable with SYCL kernels, so that you can set up this environment variable safely.

NOTE:

The name of this variable in 2021.12 was CCL_SKIP_SCHEDULER. Starting with 2021.13, the variable has been renamed to CCL_ENABLE_SYCL_KERNELS.

ALLGATHER/ALLGATHERV

CCL_SYCL_ALLGATHERV_TMP_BUF

Syntax

CCL_SYCL_ALLGATHERV_TMP_BUF=<value>

Arguments

<value>	Description
`1`	Uses a persistent temporary buffer to perform the `ALLGATHER`/ `ALLGATHERV`.
`0`	Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies if the ALLGATHER/ ALLGATHERV implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC to avoid the copies to the temporary buffer.

CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD

Syntax

CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify the small size algorithm. Default value `131072`.

Description

ALLGATHER/ ALLGATHERV collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD

Syntax

CCL_SYCL_ALLGATHERV_SCALEOUT_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify when scale-out `ALLGATHER`/ `ALLGATHERV` uses SYCL kernel-based implementation. Default value is `1048576`.

Description

For ALLGATHER/ ALLGATHERV collectives, with the total message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path. The total message size is the number of bytes received from all participating processes.

ALLREDUCE

CCL_SYCL_ALLREDUCE_TMP_BUF

Syntax

CCL_SYCL_ALLREDUCE_TMP_BUF=<value>

Arguments

<value>	Description
`1`	Uses a persistent temporary buffer to perform the `ALLREDUCE` operation.
`0`	Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies whether the ALLREDUCE implementation should use a persistent temporary buffer. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.

CCL_SYCL_ALLREDUCE_SMALL_THRESHOLD

Syntax

CCL_ALLREDUCE_SMALL_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify the small size algorithm. Default value is `524288`.

Description

ALLREDUCE collective with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD

Syntax

CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify when scale-out allreduce uses SYCL kernel-based implementation. Default value is `1048576`.

Description

For ALLREDUCE collectives, with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.

CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD

Syntax

CCL_SYCL_ALLREDUCE_SCALEOUT_DIRECT_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify when allreduce collective selects direct `MPI_Iallreduce` for the scale-out phase of the collective. Default value is `1048576`.

Description

For allreduce collectives with message sizes below this threshold in bytes, MPI_Iallreduce direct algorithm is selected as scale-out phase of the colllective. For message sizes above this threshold and under the CCL_SYCL_ALLREDUCE_SCALEOUT_THRESHOLD, the default algorithm (ring) is selected.

REDUCE_SCATTER

CCL_SYCL_REDUCE_SCATTER_TMP_BUF

Syntax

CCL_SYCL_REDUCE_SCATTER_TMP_BUF=<value>

Arguments

<value>	Description
`1`	Uses a persistent temporary buffer to perform the `REDUCE_SCATTER` operation.
`0`	Performs an IPC handle exchange, avoiding copies to temporary buffers. Default value.

Description

Specifies if the REDUCE_SCATTER implementation should use a persistent temporary buffer or not. The implementation with temporary buffers makes the collective fully asynchronous, but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. The current default uses Level Zero IPC support to avoid the copies to the temporary buffer.

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD

Syntax

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_DIRECT_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify when reduce-scatter collective selects direct `MPI_Ireduce_scatter` as scale-out phase algorithm. Default value is `1048576`.

Description

For reduce-scatter collectives with message sizes below this threshold in bytes, MPI_Ireduce_scatter direct algorithm is selected for the scale-out phase of the collective. For message sizes above this threshold and under the CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD, the default algorithm (ring) is selected.

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD

Syntax

CCL_SYCL_REDUCE_SCATTER_SCALEOUT_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify when scale-out `REDUCE_SCATTER` uses SYCL kernel-based implementation. Default value is `4294967296`.

Description

For REDUCE_SCATTER collectives with message sizes below this threshold in bytes, the SYCL path is chosen to execute the collective operation. For message sizes exceeding this threshold, the implementation will switch to the Level Zero Path.

CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD

Syntax

CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD=<value>

Arguments

<value>	Description
`>=0`	Threshold in bytes to specify the small-size algorithm. Default value `2097152`.

Description

REDUCE_SCATTER collectives with message sizes smaller than the specified threshold will use an algorithm specialized for small-sized messages.

Workers

The group of environment variables to control worker threads.

CCL_WORKER_COUNT

Syntax

CCL_WORKER_COUNT=<value>

Arguments

<value>	Description
`N`	The number of worker threads for oneCCL rank (`1` if not specified).

Description

Set this environment variable to specify the number of oneCCL worker threads. For GPU buffers, currently we do not recommend setting this variable to values larger than 1.

CCL_WORKER_AFFINITY

Syntax

CCL_WORKER_AFFINITY=<cpulist>

Arguments

<cpulist>	Description
`auto`	Workers are automatically pinned to last cores of pin domain. Pin domain depends from process launcher. If `mpirun` from oneCCL package is used then pin domain is MPI process pin domain. Otherwise, pin domain is all cores on the node.
`<cpulist>`	A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th core in the list. For example `<a>,<b>-<c>` defines list of cores containing core with number `<a>` and range of cores with numbers from `<b>` to `<c>`. The core number should not exceed the number of cores available on the system. The length of the list should be equal to the number of workers.

Description

auto

Workers are automatically pinned to last cores of pin domain. Pin domain depends from process launcher. If mpirun from oneCCL package is used then pin domain is MPI process pin domain. Otherwise, pin domain is all cores on the node.

<cpulist>

A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th core in the list. For example <a>,<b>-<c> defines list of cores containing core with number <a> and range of cores with numbers from <b> to <c>. The core number should not exceed the number of cores available on the system. The length of the list should be equal to the number of workers.

Description

Set this environment variable to specify cpu affinity for oneCCL worker threads.

CCL_WORKER_MEM_AFFINITY

Syntax

CCL_WORKER_MEM_AFFINITY=<nodelist>

Arguments

<nodelist>	Description
`auto`	Workers are automatically pinned to NUMA nodes that correspond to CPU affinity of workers.
`<nodelist>`	A comma-separated list of NUMA node numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th NUMA node in the list. The number should not exceed the number of NUMA nodes available on the system.

Description

Set this environment variable to specify memory affinity for oneCCL worker threads.

KVS

CCL_KVS_MODE

Syntax

CCL_KVS_MODE=<value>

Arguments

<cpulist>	Description
`pmi`	PMI transport (default)
`mpi`	MPI transport

Description

Set the environment variable to specify the transport used to establish a connection between ranks during the oneCCL communicator creation. Currently, the mpi value is only supported when the MPI transport is used (see CCL_ATL_TRANSPORT). For large scale runs, we recommend setting KVS_MODE to mpi.

ATL

The group of environment variables to control ATL (abstract transport layer).

CCL_ATL_TRANSPORT

Syntax

CCL_ATL_TRANSPORT=<value>

Arguments

<value>	Description
`mpi`	MPI transport (default).
`ofi`	OFI (libfabric*) transport.

Description

Set this environment variable to select the transport for inter-process communications.

CCL_ATL_HMEM

Syntax

CCL_ATL_HMEM=<value>

Arguments

<value>	Description
`1`	Enable heterogeneous memory support on the transport layer.
`0`	Disable heterogeneous memory support on the transport layer (default).

Description

Set this environment variable to enable handling of HMEM/GPU buffers by the transport layer. The actual HMEM support depends on the limitations on the transport level and system configuration.

CCL_ATL_SHM

Syntax

CCL_ATL_SHM=<value>

Arguments

<value>	Description
`0`	Disables the OFI shared memory provider. The default value.
`1`	Enables the OFI shared memory provider.

Description

Set this environment variable to enable the OFI shared memory provider to communicate between ranks in the same node of the host (CPU) buffers. This capability requires OFI as the transport (CCL_ATL_TRANSPORT=ofi).

The OFI/SHM provider has support to utilize the Intel(R) Data Streaming Accelerator* (DSA). To run it with DSA*, you need: * Linux* OS kernel support for the DSA* shared work queues * Libfabric* 1.17 or later

To enable DSA, set the following environment variables:

FI_SHM_DISABLE_CMA=1
FI_SHM_USE_DSA_SAR=1

Refer to Libfabric* Programmer’s Manual for the additional details about DSA* support in the SHM provider: https://ofiwg.github.io/libfabric/main/man/fi_shm.7.html.

CCL_PROCESS_LAUNCHER

Syntax

CCL_PROCESS_LAUNCHER=<value>

Arguments

<value>	Description
`hydra`	Uses the MPI hydra job launcher. The default value.
`torchrun`	Uses torchrun <https://pytorch.org/docs/stable/elastic/run.html> as a job launcher.
`pmix`	Is used with the PALS job launcher that uses the pmix API. The `mpiexec` command should be similar to: `CCL_PROCESS_LAUNCHER=pmix CCL_ATL_TRANSPORT=mpi mpiexec -np 2 -ppn 2 --pmi=pmix ...`
`none`	No job launcher is used. You should specify the values for `CCL_LOCAL_SIZE and CCL_LOCAL_RANK`.

Description

Set this environment variable to specify the job launcher.

CCL_LOCAL_SIZE

Syntax

CCL_LOCAL_SIZE=<value>

Arguments

<value>	Description
`SIZE`	A total number of ranks on the local host.

Description

Set this environment variable to specify a total number of ranks on a local host.

CCL_LOCAL_RANK

Syntax

CCL_LOCAL_RANK=<value>

Arguments

<value>	Description
`RANK`	Rank number of the current process on the local host.

Description

Set this environment variable to specify the rank number of the current process in the local host.

Multi-NIC

CCL_MNIC, CCL_MNIC_NAME and CCL_MNIC_COUNT define filters to select multiple NICs. oneCCL workers will be pinned on selected NICs in a round-robin way.

CCL_MNIC

Syntax

CCL_MNIC=<value>

Arguments

<value>	Description
`global`	Select all NICs available on the node.
`local`	Select all NICs local for the NUMA node that corresponds to process pinning.
`none`	Disable special NIC selection, use a single default NIC (default).

Description

Set this environment variable to control multi-NIC selection by NIC locality.

CCL_MNIC_NAME

Syntax

CCL_MNIC_NAME=<namelist>

Arguments

<namelist>	Description
`<namelist>`	A comma-separated list of NIC full names or prefixes to filter NICs. Use the `^` symbol to exclude NICs starting with the specified prefixes. For example, if you provide a list `mlx5_0,mlx5_1,^mlx5_2`, NICs with the names `mlx5_0` and `mlx5_1` will be selected, while `mlx5_2` will be excluded from the selection.

Description

Set this environment variable to control multi-NIC selection by NIC names.

CCL_MNIC_COUNT

Syntax

CCL_MNIC_COUNT=<value>

Arguments

<value>	Description
`N`	The maximum number of NICs that should be selected for oneCCL workers. If not specified then equal to the number of oneCCL workers.

Description

Set this environment variable to specify the maximum number of NICs to be selected. The actual number of NICs selected may be smaller due to limitations on transport level or system configuration.

Inter Process Communication (IPC)

CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD

Syntax

CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=<value>

<value>	Description
`N`	The number IPC handles in the receiver cache. The default value is 1000.

Description

Use this environment variable to change the number of IPC handles opened with zeMemOpenIpcHandle() that oneCCL maintains in its receiving cache. IPC handles refer to Level Zero Memory IPCs.

The IPC handles opened with zeMemOpenIpcHandle() are stored by oneCCL in the receiving cache. However, when the number of opened IPC handles exceeds the specified threshold, the cache will evict a handle using a LRU (Last Recently Used) policy. Starting with version 2021.10, the default value is 1000.

CCL_ZE_IPC_EXCHANGE

Syntax

CCL_ZE_IPC_EXCHANGE=<value>

Arguments

<value>	Description
`drmfd`	Uses the DRM mechanism for Level Zero IPC exchange. This is an experimental mechanism that is used with OS kernels previous to SP4. Default value for 2021.13 and before. To use the DRM mechanism, the `libdrm` and `drm` headers must be available on the system.
`pidfd`	Uses `pidfd` mechanism for Level Zero IPC exchange. It requires OS kernel SP4 or above as it requires Linux 5.6 kernel or above. Default with 2021.14.
`sockets`	Uses socket mechanism for Level Zero IPC exchange.
`none`	This mode is used by oneCCL when built on a system without `drmfd` support.

Description

Set this environment variable to specify the mechanism to use for Level Zero IPC exchange.

CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD

Syntax

CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD=<value>

<value>	Description
`N`	The number IPC handles in the receiver cache. The default value is 1000.

Description

Use this environment variable to change the number of IPC handles obtained with zeMemGetIpcHandle() that oneCCL maintains in its sender cache. IPC handles refer to Level Zero Memory IPCs.

The IPC handles obtained with zeMemGetIpcHandle() are stored by oneCCL in the sender cache. However, when the number of get IPC handles exceeds the specified threshold, the cache will evict a handle using a LRU (Last Recently Used) policy. The default value is 1000.

Low-precision Data Types

The group of environment variables to control processing of low-precision data types.

CCL_BF16

Syntax

CCL_BF16=<value>

Arguments

<value>	Description
`avx512f`	Select implementation based on `AVX512F` instructions.
`avx512bf`	Select implementation based on `AVX512_BF16` instructions.

Description

Set this environment variable to select implementation for BF16 <-> FP32 conversion on reduction phase of collective operation. The default value depends on instruction set support on specific CPU. AVX512_BF16-based implementation has precedence over AVX512F-based one.

CCL_FP16

Syntax

CCL_FP16=<value>

Arguments

<value>	Description
`f16c`	Select implementation based on `F16C` instructions.
`avx512f`	Select implementation based on `AVX512F` instructions.
`avx512fp16`	Select implementation based on `AVX512FP16` instructions.

Description

Set this environment variable to select implementation for on reduction phase of collective operation. AVX512FP16 uses native FP16 numeric operations for reduction. AVX512F and F16C use FP16 <-> FP32 conversion operations to perform the reduction. The default value depends on instruction set support on specific CPU. AVX512FP16-based implementation has precedence over AVX512F and F16C-based one.

CCL_ATL_MPI_FP16

Syntax

CCL_ATL_MPI_FP16=<value>

Arguments

<value>	Description
`0`	Disables the Intel MPI native FP16 support.
`1`	Enables the Intel MPI native FP16 support (default for version 2021.14).

Description

Set this environment variable to enable or disable Intel MPI native FP16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH, but it will have no impact.

CCL_ATL_MPI_BF16

Syntax

CCL_ATL_MPI_BF16=<value>

Arguments

<value>	Description
`0`	Disables the Intel MPI native BF16 support.
`1`	Enables the Intel MPI native BF16 support (default for version 2021.14).

Description

Set this environment variable to enable or disable Intel MPI native BF16 support. Requires Intel MPI newer than 2021.13. This variable can be enabled with MPI implementation that is not Intel MPI, such as MPICH, but it will have no impact.

CCL_LOG_LEVEL

Syntax

CCL_LOG_LEVEL=<value>

Arguments

<value>
`error`
`warn` (default)
`info`
`debug`
`trace`

Description

Set this environment variable to control logging level.

CCL_ITT_LEVEL

Syntax

CCL_ITT_LEVEL=<value>

Arguments

<value>	Description
`1`	Enable support for ITT profiling.
`0`	Disable support for ITT profiling (default).

Description

Set this environment variable to specify Intel® Instrumentation and Tracing Technology (ITT) profiling level. Once the environment variable is enabled (value > 0), it is possible to collect and display profiling data for oneCCL using tools such as Intel® VTune™ Profiler.

Fusion

The group of environment variables to control fusion of collective operations.

CCL_FUSION

Syntax

CCL_FUSION=<value>

Arguments

<value>	Description
`1`	Enable fusion of collective operations
`0`	Disable fusion of collective operations (default)

Description

Set this environment variable to control fusion of collective operations. The real fusion depends on additional settings described below.

CCL_FUSION_BYTES_THRESHOLD

Syntax

CCL_FUSION_BYTES_THRESHOLD=<value>

Arguments

<value>	Description
`SIZE`	Bytes threshold for a collective operation. If the size of a communication buffer in bytes is less than or equal to `SIZE`, then oneCCL fuses this operation with the other ones.

Description

Set this environment variable to specify the threshold of the number of bytes for a collective operation to be fused.

CCL_FUSION_COUNT_THRESHOLD

Syntax

CCL_FUSION_COUNT_THRESHOLD=<value>

Arguments

<value>	Description
`COUNT`	The threshold for the number of collective operations. oneCCL can fuse together no more than `COUNT` operations at a time.

Description

Set this environment variable to specify count threshold for a collective operation to be fused.

CCL_FUSION_CYCLE_MS

Syntax

CCL_FUSION_CYCLE_MS=<value>

Arguments

<value>	Description
`MS`	The frequency of checking for collectives operations to be fused, in milliseconds: Small `MS` value can improve latency. Large `MS` value can help to fuse larger number of operations at a time.

Description

Set this environment variable to specify the frequency of checking for collectives operations to be fused.

CCL_PRIORITY

Syntax

CCL_PRIORITY=<value>

Arguments

<value>	Description
`direct`	You have to explicitly specify the priority using `priority`.
`lifo`	Priority is implicitly increased on each collective call. You do not have to specify priority.
`none`	Disable prioritization (default).

Description

Set this environment variable to control priority mode of collective operations.

CCL_MAX_SHORT_SIZE

Syntax

CCL_MAX_SHORT_SIZE=<value>

Arguments

<value>	Description
`SIZE`	Bytes threshold for a collective operation (`0` if not specified). If the size of a communication buffer in bytes is less than or equal to `SIZE`, then oneCCL does not split operation between workers. Applicable for `ALLREDUCE`, `REDUCE` and `BROADCAST`.

Description

Set this environment variable to specify the threshold of the number of bytes for a collective operation to be split.

CCL_SYCL_OUTPUT_EVENT

Syntax

CCL_SYCL_OUTPUT_EVENT=<value>

Arguments

<value>	Description
`1`	Enable support for SYCL output event (default).
`0`	Disable support for SYCL output event.

Description

Set this environment variable to control support for SYCL output event. Once the support is enabled, you can retrieve SYCL output event from oneCCL event using get_native() method. oneCCL event must be associated with oneCCL communication operation.

CCL_ZE_LIBRARY_PATH

Syntax

CCL_ZE_LIBRARY_PATH=<value>

Arguments

<value>	Description
`PATH/NAME`	Specify the name and full path to the `Level-Zero` library for dynamic loading by oneCCL.

Description

Set this environment variable to specify the name and full path to Level-Zero library. The path should be absolute and validated. Set this variable if Level-Zero is not located in the default path. By default oneCCL uses the libze_loader.so name for dynamic loading.

Point-To-Point Operations

CCL_RECV

Syntax

CCL_RECV=<value>

Arguments

<value>	Description
`direct`	Based on the MPI/OFI transport layer.
`topo`	Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers.
`offload`	Based on the MPI/OFI transport layer and GPU RDMA when supported by the hardware.

CCL_SEND

Syntax

CCL_SEND=<value>

Arguments

<value>	Description
`direct`	Based on the MPI/OFI transport layer.
`topo`	Uses Intel(R) Xe Link technology across GPUs in a multi-GPU node. The default for GPU buffers.
`offload`	Based on the MPI/OFI transport layer and GPU RDMA when supported by the hardware.

CCL_ZE_TMP_BUF_SIZE

Syntax

CCL_ZE_TMP_BUF_SIZE=<value>

Arguments

<value>	Description
`N`	Size of the temporary buffer (in bytes) oneCCL uses to perform collective operations with topo algorithm and Level Zero path. Default is `536870912`, that is, `512 MBs`.

Description

Set this environment variable to change the size of the temporary buffer used by the topo algorithm in the Level Zero path. The value is specified in bytes. The default value is 536870912.

You can tune the value of this variable depending on the system memory available, the memory the application requires, and the message size of the collectives used. With larger values, oneCCL consumes more memory but can provide higher performance. Similarly, small values will reduce memory utilization, but can degrade performance.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Collective Communications Library (oneCCL) Developer Guide and Reference

Environment Variables

Collective Algorithms Selection

Workers

KVS

ATL

Multi-NIC

Inter Process Communication (IPC)

Low-precision Data Types

CCL_LOG_LEVEL

CCL_ITT_LEVEL

Fusion

CCL_PRIORITY

CCL_MAX_SHORT_SIZE

CCL_SYCL_OUTPUT_EVENT

CCL_ZE_LIBRARY_PATH

Point-To-Point Operations

CCL_ZE_TMP_BUF_SIZE