Intel® MPI Library Developer Reference for Linux* OS

ID 768732
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

I_MPI_ADJUST Family Environment Variables

I_MPI_ADJUST_<opname>

Control collective operation algorithm selection.

Syntax

I_MPI_ADJUST_<opname>="<presetid>[:<conditions>][;<presetid>:<conditions>[...]]"

Arguments

<presetid> Preset identifier
>= 0 Set a number to select the desired algorithm. The value 0 uses basic logic of the collective algorithm selection.
<conditions> A comma separated list of conditions. An empty list selects all message sizes and process combinations
<l> Messages of size <l>
<l>-<m> Messages of size from <l> to <m>, inclusive
<l>@<p> Messages of size <l> and number of processes <p>
<l>-<m>@<p>-<q> Messages of size from <l> to <m> and number of processes from <p> to <q>, inclusive

Description

Set this environment variable to select the desired algorithm(s) for the collective operation <opname> under particular conditions. Each collective operation has its own environment variable and algorithms.

Environment Variables, Collective Operations, and Algorithms
Environment Variable Collective Operation Algorithms
I_MPI_ADJUST_ALLGATHER MPI_Allgather
  1. Recursive doubling
  2. Bruck's
  3. Ring
  4. Topology aware Gatherv + Bcast
  5. Knomial
I_MPI_ADJUST_ALLGATHERV MPI_Allgatherv
  1. Recursive doubling
  2. Bruck's
  3. Ring
  4. Topology aware Gatherv + Bcast
I_MPI_ADJUST_ALLREDUCE MPI_Allreduce
  1. Recursive doubling
  2. Rabenseifner's
  3. Reduce + Bcast
  4. Topology aware Reduce + Bcast
  5. Binomial gather + scatter
  6. Topology aware binominal gather + scatter
  7. Shumilin's ring
  8. Ring
  9. Knomial
  10. Topology aware SHM-based flat
  11. Topology aware SHM-based Knomial
  12. Topology aware SHM-based Knary
I_MPI_ADJUST_ALLTOALL MPI_Alltoall
  1. Bruck's
  2. Isend/Irecv + waitall
  3. Pair wise exchange
  4. Plum's
I_MPI_ADJUST_ALLTOALLV MPI_Alltoallv
  1. Isend/Irecv + waitall
  2. Plum's

The message size calculation rules for the collective operations are described in the table. In the following table, "n/a" means that the corresponding interval <l>-<m> should be omitted.

NOTE:
The I_MPI_ADJUST_SENDRECV_REPLACE=2 ("Uniform") algorithm can be used only in the case when datatype and objects count are the same across all ranks.

To get the maximum number (range) of presets available for each collective operation, use the impi_info command:

$ impi_info -v I_MPI_ADJUST_ALLREDUCE
I_MPI_ADJUST_ALLREDUCE
  MPI Datatype:
    MPI_CHAR
  Description:
    Control selection of MPI_Allreduce algorithm presets.
    Arguments
    <presetid> - Preset identifier
    range: 0-27          

Message Collective Functions
Collective Function Message Size Formula
MPI_Allgather recv_count*recv_type_size
MPI_Allgatherv total_recv_count*recv_type_size
MPI_Allreduce count*type_size
MPI_Alltoall send_count*send_type_size
MPI_Alltoallv n/a

Examples

Use the following settings to select the second algorithm for MPI_Reduce operation: I_MPI_ADJUST_REDUCE=2

Use the following settings to define the algorithms for MPI_Reduce_scatter operation: I_MPI_ADJUST_REDUCE_SCATTER="4:0-100,5001-10000;1:101-3200;2:3201-5000;3"

In this case. algorithm 4 is used for the message sizes between 0 and 100 bytes and from 5001 and 10000 bytes, algorithm 1 is used for the message sizes between 101 and 3200 bytes, algorithm 2 is used for the message sizes between 3201 and 5000 bytes, and algorithm 3 is used for all other messages.

I_MPI_ADJUST_<opname>_LIST

Syntax

I_MPI_ADJUST_<opname>_LIST=<presetid1>[-<presetid2>][,<presetid3>][,<presetid4>-<presetid5>]

Description

Set this environment variable to specify the set of algorithms to be considered by the Intel MPI runtime for a specified <opname>. This variable is useful in autotuning scenarios, as well as tuning scenarios where users would like to select a certain subset of algorithms.

NOTE:
Setting an empty string disables autotuning for the <opname> collective.

I_MPI_COLL_INTRANODE

Syntax

I_MPI_COLL_INTRANODE=<mode>

Arguments

<mode>  Intranode collectives type
pt2pt Use only point-to-point communication-based collectives
shm Enables shared memory collectives. This is the default value

Description

Set this environment variable to switch intranode communication type for collective operations. If there is large set of communicators, you can switch off the SHM-collectives to avoid memory overconsumption.

I_MPI_COLL_INTRANODE_SHM_THRESHOLD

Syntax

I_MPI_COLL_INTRANODE_SHM_THRESHOLD=<nbytes>

Arguments

<nbytes>  Define the maximal data block size processed by shared memory collectives
> 0 Use the specified size. The default value is 16384 bytes.

Description

Set this environment variable to define the size of shared memory area available for each rank for data placement. Messages greater than this value will not be processed by SHM-based collective operation, but will be processed by point-to-point based collective operation. The value must be a multiple of 4096.

I_MPI_COLL_EXTERNAL

Syntax

I_MPI_COLL_EXTERNAL=<arg>

Arguments

<arg>  Description
enable | yes | on | 1 Enable the external collective operations functionality using available collectives libraries.
disable | no | off | 0 Disable the external collective operations functionality. This is the default value.
hcoll Enable the external collective operations functionality using HCOLL library.

Description

Set this environment variable to enable external collective operations. For reaching better performance, use an autotuner after enabling I_MPI_COLL_EXTERNAL. This process gets the optimal collectives settings.

To force external collective operations usage, use the following I_MPI_ADJUST_<opname> values: I_MPI_ADJUST_ALLREDUCE=24, I_MPI_ADJUST_BARRIER=11, I_MPI_ADJUST_BCAST=16, I_MPI_ADJUST_REDUCE=13, I_MPI_ADJUST_ALLGATHER=6, I_MPI_ADJUST_ALLTOALL=5, I_MPI_ADJUST_ALLTOALLV=5, I_MPI_ADJUST_SCAN=3, I_MPI_ADJUST_EXSCAN=3, I_MPI_ADJUST_GATHER=5, I_MPI_ADJUST_GATHERV=4, I_MPI_ADJUST_SCATTER=5, I_MPI_ADJUST_SCATTERV=4, I_MPI_ADJUST_ALLGATHERV=5, I_MPI_ADJUST_ALLTOALLW=2, I_MPI_ADJUST_REDUCE_SCATTER=6, I_MPI_ADJUST_REDUCE_SCATTER_BLOCK=4, I_MPI_ADJUST_IALLGATHER=5, I_MPI_ADJUST_IALLGATHERV=5, I_MPI_ADJUST_IGATHERV=3, I_MPI_ADJUST_IALLREDUCE=9, I_MPI_ADJUST_IALLTOALLV=2, I_MPI_ADJUST_IBARRIER=2, I_MPI_ADJUST_IBCAST=5, I_MPI_ADJUST_IREDUCE=4.

For more information on HCOLL tuning, refer to NVIDIA* documentation.

I_MPI_COLL_DIRECT

Syntax

I_MPI_COLL_DIRECT=<arg>

Arguments

<arg> Description
on Enable direct collectives. This is the default value.
off Disable direct collectives.

Description

Set this environment variable to control direct collectives usage. Disable this variable to eliminate OFI* usage for intra-node communications in case of shm:ofi fabric.

I_MPI_CBWR

Control reproducibility of floating-point operations results across different platforms, networks, and topologies in case of the same number of processes.

Syntax

I_MPI_CBWR=<arg>

Arguments

<arg> CBWR compatibility mode

Description

0 None Do not use CBWR in a library-wide mode. CNR-safe communicators may be created with MPI_Comm_dup_with_info explicitly. This is the default value.
1 Weak mode Disable topology aware collectives. The result of a collective operation does not depend on the rank placement. The mode guarantees results reproducibility across different runs on the same cluster (independent of the rank placement).
2 Strict mode Disable topology aware collectives, ignore CPU architecture, and interconnect during algorithm selection. The mode guarantees results reproducibility across different runs on different clusters (independent of the rank placement, CPU architecture, and interconnection)

Description

Conditional Numerical Reproducibility (CNR) provides controls for obtaining reproducible floating-point results on collectives operations. With this feature, Intel MPI collective operations are designed to return the same floating-point results from run to run in case of the same number of MPI ranks.

Control this feature with the I_MPI_CBWR environment variable in a library-wide manner, where all collectives on all communicators are guaranteed to have reproducible results. To control the floating-point operations reproducibility in a more precise and per-communicator way, pass the {“I_MPI_CBWR”, “yes”} key-value pair to the MPI_Comm_dup_with_info call.

NOTE:

Setting the I_MPI_CBWR in a library-wide mode using the environment variable leads to performance penalty.

CNR-safe communicators created using MPI_Comm_dup_with_info always work in the strict mode. For example:

MPI_Info hint;
MPI_Comm cbwr_safe_world, cbwr_safe_copy;
MPI_Info_create(&hint);
MPI_Info_set(hint, “I_MPI_CBWR”, “yes”);
MPI_Comm_dup_with_info(MPI_COMM_WORLD, hint, & cbwr_safe_world);
MPI_Comm_dup(cbwr_safe_world, & cbwr_safe_copy);

In the example above, both cbwr_safe_world and cbwr_safe_copy are CNR-safe. Use cbwr_safe_world and its duplicates to get reproducible results for critical operations.

Note that MPI_COMM_WORLD itself may be used for performance-critical operations without reproducibility limitations.