Intel® MPI Library Developer Reference for Linux* OS

ID 768732
Date 3/22/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

I_MPI_ADJUST Family Environment Variables

I_MPI_ADJUST_<opname>

Control collective operation algorithm selection.

Syntax

I_MPI_ADJUST_<opname>="<presetid>[:<conditions>][;<presetid>:<conditions>[...]]"

Arguments

<presetid> Preset identifier
>= 0 Set a number to select the desired algorithm. The value 0 uses basic logic of the collective algorithm selection.
<conditions> A comma separated list of conditions. An empty list selects all message sizes and process combinations
<l> Messages of size <l>
<l>-<m> Messages of size from <l> to <m>, inclusive
<l>@<p> Messages of size <l> and number of processes <p>
<l>-<m>@<p>-<q> Messages of size from <l> to <m> and number of processes from <p> to <q>, inclusive

Description

Set this environment variable to select the desired algorithm(s) for the collective operation <opname> under particular conditions. Each collective operation has its own environment variable and algorithms.

Environment Variables, Collective Operations, and Algorithms
Environment Variable Collective Operation Algorithms
I_MPI_ADJUST_ALLGATHER MPI_Allgather
  1. Recursive doubling
  2. Bruck's
  3. Ring
  4. Topology aware Gatherv + Bcast
  5. Knomial
I_MPI_ADJUST_ALLGATHERV MPI_Allgatherv
  1. Recursive doubling
  2. Bruck's
  3. Ring
  4. Topology aware Gatherv + Bcast
I_MPI_ADJUST_ALLREDUCE MPI_Allreduce
  1. Recursive doubling
  2. Rabenseifner's
  3. Reduce + Bcast
  4. Topology aware Reduce + Bcast
  5. Binomial gather + scatter
  6. Topology aware binominal gather + scatter
  7. Shumilin's ring
  8. Ring
  9. Knomial
  10. Topology aware SHM-based flat
  11. Topology aware SHM-based Knomial
  12. Topology aware SHM-based Knary
I_MPI_ADJUST_ALLTOALL MPI_Alltoall
  1. Bruck's
  2. Isend/Irecv + waitall
  3. Pair wise exchange
  4. Plum's
I_MPI_ADJUST_ALLTOALLV MPI_Alltoallv
  1. Isend/Irecv + waitall
  2. Plum's
I_MPI_ADJUST_ALLTOALLW MPI_Alltoallw Isend/Irecv + waitall
I_MPI_ADJUST_BARRIER MPI_Barrier
  1. Dissemination
  2. Recursive doubling
  3. Topology aware dissemination
  4. Topology aware recursive doubling
  5. Binominal gather + scatter
  6. Topology aware binominal gather + scatter
  7. Topology aware SHM-based flat
  8. Topology aware SHM-based Knomial
  9. Topology aware SHM-based Knary
I_MPI_ADJUST_BCAST MPI_Bcast
  1. Binomial
  2. Recursive doubling
  3. Ring
  4. Topology aware binomial
  5. Topology aware recursive doubling
  6. Topology aware ring
  7. Shumilin's
  8. Knomial
  9. Topology aware SHM-based flat
  10. Topology aware SHM-based Knomial
  11. Topology aware SHM-based Knary
  12. NUMA aware SHM-based (SSE4.2)
  13. NUMA aware SHM-based (AVX2)

  14. NUMA aware SHM-based (AVX512)

I_MPI_ADJUST_EXSCAN MPI_Exscan
  1. Partial results gathering
  2. Partial results gathering regarding layout of processes
I_MPI_ADJUST_GATHER MPI_Gather
  1. Binomial
  2. Topology aware binomial
  3. Shumilin's
  4. Binomial with segmentation
I_MPI_ADJUST_GATHERV MPI_Gatherv
  1. Linear
  2. Topology aware linear
  3. Knomial
I_MPI_ADJUST_REDUCE_SCATTER MPI_Reduce_scatter
  1. Recursive halving
  2. Pair wise exchange
  3. Recursive doubling
  4. Reduce + Scatterv
  5. Topology aware Reduce + Scatterv
I_MPI_ADJUST_REDUCE MPI_Reduce
  1. Shumilin's
  2. Binomial
  3. Topology aware Shumilin's
  4. Topology aware binomial
  5. Rabenseifner's
  6. Topology aware Rabenseifner's
  7. Knomial
  8. Topology aware SHM-based flat
  9. Topology aware SHM-based Knomial
  10. Topology aware SHM-based Knary
  11. Topology aware SHM-based binomial
I_MPI_ADJUST_SCAN MPI_Scan
  1. Partial results gathering
  2. Topology aware partial results gathering
I_MPI_ADJUST_SCATTER MPI_Scatter
  1. Binomial
  2. Topology aware binomial
  3. Shumilin's
I_MPI_ADJUST_SCATTERV MPI_Scatterv
  1. Linear
  2. Topology aware linear
I_MPI_ADJUST_SENDRECV_REPLACE MPI_Sendrecv_replace 1. Generic

2. Uniform (with restrictions)

I_MPI_ADJUST_IALLGATHER MPI_Iallgather
  1. Recursive doubling
  2. Bruck’s
  3. Ring
I_MPI_ADJUST_IALLGATHERV MPI_Iallgatherv
  1. Recursive doubling
  2. Bruck’s
  3. Ring
I_MPI_ADJUST_IALLREDUCE MPI_Iallreduce
  1. Recursive doubling
  2. Rabenseifner’s
  3. Reduce + Bcast
  4. Ring (patarasuk)
  5. Knomial
  6. Binomial
  7. Reduce scatter allgather
  8. SMP
  9. Nreduce
I_MPI_ADJUST_IALLTOALL MPI_Ialltoall
  1. Bruck’s
  2. Isend/Irecv + Waitall
  3. Pairwise exchange
I_MPI_ADJUST_IALLTOALLV MPI_Ialltoallv Isend/Irecv + Waitall
I_MPI_ADJUST_IALLTOALLW MPI_Ialltoallw Isend/Irecv + Waitall
I_MPI_ADJUST_IBARRIER MPI_Ibarrier Dissemination
I_MPI_ADJUST_IBCAST MPI_Ibcast
  1. Binomial
  2. Recursive doubling
  3. Ring
  4. Knomial
  5. SMP
  6. Tree knominal
  7. Tree kary
I_MPI_ADJUST_IEXSCAN MPI_Iexscan
  1. Recursive doubling
  2. SMP
I_MPI_ADJUST_IGATHER MPI_Igather
  1. Binomial
  2. Knomial
I_MPI_ADJUST_IGATHERV MPI_Igatherv
  1. Linear
  2. Linear ssend
I_MPI_ADJUST_IREDUCE_SCATTER MPI_Ireduce_scatter
  1. Recursive halving
  2. Pairwise
  3. Recursive doubling
I_MPI_ADJUST_IREDUCE MPI_Ireduce
  1. Rabenseifner’s
  2. Binomial
  3. Knomial
I_MPI_ADJUST_ISCAN MPI_Iscan
  1. Recursive Doubling
  2. SMP
I_MPI_ADJUST_ISCATTER MPI_Iscatter
  1. Binomial
  2. Knomial
I_MPI_ADJUST_ISCATTERV MPI_Iscatterv Linear

The message size calculation rules for the collective operations are described in the table. In the following table, "n/a" means that the corresponding interval <l>-<m> should be omitted.

NOTE:
The I_MPI_ADJUST_SENDRECV_REPLACE=2 ("Uniform") algorithm can be used only in the case when datatype and objects count are the same across all ranks.

To get the maximum number (range) of presets available for each collective operation, use the impi_info command:

$ impi_info -v I_MPI_ADJUST_ALLREDUCE
I_MPI_ADJUST_ALLREDUCE
  MPI Datatype:
    MPI_CHAR
  Description:
    Control selection of MPI_Allreduce algorithm presets.
    Arguments
    <presetid> - Preset identifier
    range: 0-27          

Message Collective Functions
Collective Function Message Size Formula
MPI_Allgather recv_count*recv_type_size
MPI_Allgatherv total_recv_count*recv_type_size
MPI_Allreduce count*type_size
MPI_Alltoall send_count*send_type_size
MPI_Alltoallv n/a
MPI_Alltoallw n/a
MPI_Barrier n/a
MPI_Bcast count*type_size
MPI_Exscan count*type_size
MPI_Gather recv_count*recv_type_size if MPI_IN_PLACE is used, otherwise send_count*send_type_size
MPI_Gatherv n/a
MPI_Reduce_scatter total_recv_count*type_size
MPI_Reduce count*type_size
MPI_Scan count*type_size
MPI_Scatter send_count*send_type_size if MPI_IN_PLACE is used, otherwise recv_count*recv_type_size
MPI_Scatterv n/a

Examples

Use the following settings to select the second algorithm for MPI_Reduce operation: I_MPI_ADJUST_REDUCE=2

Use the following settings to define the algorithms for MPI_Reduce_scatter operation: I_MPI_ADJUST_REDUCE_SCATTER="4:0-100,5001-10000;1:101-3200;2:3201-5000;3"

In this case. algorithm 4 is used for the message sizes between 0 and 100 bytes and from 5001 and 10000 bytes, algorithm 1 is used for the message sizes between 101 and 3200 bytes, algorithm 2 is used for the message sizes between 3201 and 5000 bytes, and algorithm 3 is used for all other messages.

I_MPI_ADJUST_<opname>_LIST

Syntax

I_MPI_ADJUST_<opname>_LIST=<presetid1>[-<presetid2>][,<presetid3>][,<presetid4>-<presetid5>]

Description

Set this environment variable to specify the set of algorithms to be considered by the Intel MPI runtime for a specified <opname>. This variable is useful in autotuning scenarios, as well as tuning scenarios where users would like to select a certain subset of algorithms.

NOTE:
Setting an empty string disables autotuning for the <opname> collective.

I_MPI_COLL_INTRANODE

Syntax

I_MPI_COLL_INTRANODE=<mode>

Arguments

<mode>  Intranode collectives type
pt2pt Use only point-to-point communication-based collectives
shm Enables shared memory collectives. This is the default value

Description

Set this environment variable to switch intranode communication type for collective operations. If there is large set of communicators, you can switch off the SHM-collectives to avoid memory overconsumption.

I_MPI_COLL_EXTERNAL

Syntax

I_MPI_COLL_EXTERNAL=<arg>

Arguments

<arg>  Description
enable | yes | on | 1 Enable the external collective operations functionality using available collectives libraries.
disable | no | off | 0 Disable the external collective operations functionality. This is the default value.
hcoll Enable the external collective operations functionality using HCOLL library.

Description

Set this environment variable to enable external collective operations. For reaching better performance, use an autotuner after enabling I_MPI_COLL_EXTERNAL. This process gets the optimal collectives settings.

To force external collective operations usage, use the following I_MPI_ADJUST_<opname> values: I_MPI_ADJUST_ALLREDUCE=24, I_MPI_ADJUST_BARRIER=11, I_MPI_ADJUST_BCAST=16, I_MPI_ADJUST_REDUCE=13, I_MPI_ADJUST_ALLGATHER=6, I_MPI_ADJUST_ALLTOALL=5, I_MPI_ADJUST_ALLTOALLV=5, I_MPI_ADJUST_SCAN=3, I_MPI_ADJUST_EXSCAN=3, I_MPI_ADJUST_GATHER=5, I_MPI_ADJUST_GATHERV=4, I_MPI_ADJUST_SCATTER=5, I_MPI_ADJUST_SCATTERV=4, I_MPI_ADJUST_ALLGATHERV=5, I_MPI_ADJUST_ALLTOALLW=2, I_MPI_ADJUST_REDUCE_SCATTER=6, I_MPI_ADJUST_REDUCE_SCATTER_BLOCK=4, I_MPI_ADJUST_IALLGATHER=5, I_MPI_ADJUST_IALLGATHERV=5, I_MPI_ADJUST_IGATHERV=3, I_MPI_ADJUST_IALLREDUCE=9, I_MPI_ADJUST_IALLTOALLV=2, I_MPI_ADJUST_IBARRIER=2, I_MPI_ADJUST_IBCAST=5, I_MPI_ADJUST_IREDUCE=4.

For more information on HCOLL tuning, refer to NVIDIA* documentation.

I_MPI_COLL_DIRECT

Syntax

I_MPI_COLL_DIRECT=<arg>

Arguments

<arg> Description
on Enable direct collectives. This is the default value.
off Disable direct collectives.

Description

Set this environment variable to control direct collectives usage. Disable this variable to eliminate OFI* usage for intra-node communications in case of shm:ofi fabric.

I_MPI_CBWR

Control reproducibility of floating-point operations results across different platforms, networks, and topologies in case of the same number of processes.

Syntax

I_MPI_CBWR=<arg>

Arguments

<arg> CBWR compatibility mode

Description

0 None Do not use CBWR in a library-wide mode. CNR-safe communicators may be created with MPI_Comm_dup_with_info explicitly. This is the default value.
1 Weak mode Disable topology aware collectives. The result of a collective operation does not depend on the rank placement. The mode guarantees results reproducibility across different runs on the same cluster (independent of the rank placement).
2 Strict mode Disable topology aware collectives, ignore CPU architecture, and interconnect during algorithm selection. The mode guarantees results reproducibility across different runs on different clusters (independent of the rank placement, CPU architecture, and interconnection)

Description

Conditional Numerical Reproducibility (CNR) provides controls for obtaining reproducible floating-point results on collectives operations. With this feature, Intel MPI collective operations are designed to return the same floating-point results from run to run in case of the same number of MPI ranks.

Control this feature with the I_MPI_CBWR environment variable in a library-wide manner, where all collectives on all communicators are guaranteed to have reproducible results. To control the floating-point operations reproducibility in a more precise and per-communicator way, pass the {"I_MPI_CBWR", "yes"} key-value pair to the MPI_Comm_dup_with_info call.

NOTE:

Setting the I_MPI_CBWR in a library-wide mode using the environment variable leads to performance penalty.

CNR-safe communicators created using MPI_Comm_dup_with_info always work in the strict mode. For example:

MPI_Info hint;
MPI_Comm cbwr_safe_world, cbwr_safe_copy;
MPI_Info_create(&hint);
MPI_Info_set(hint, "I_MPI_CBW", "yes");
MPI_Comm_dup_with_info(MPI_COMM_WORLD, hint, & cbwr_safe_world);
MPI_Comm_dup(cbwr_safe_world, & cbwr_safe_copy);

In the example above, both cbwr_safe_world and cbwr_safe_copy are CNR-safe. Use cbwr_safe_world and its duplicates to get reproducible results for critical operations.

Note that MPI_COMM_WORLD itself may be used for performance-critical operations without reproducibility limitations.