I_MPI_ADJUST Family Environment Variables
I_MPI_ADJUST_<opname>
Control collective operation algorithm selection.
Syntax
I_MPI_ADJUST_<opname>="<presetid>[:<conditions>][;<presetid>:<conditions>[...]]"
Arguments
<presetid> | Preset identifier |
>= 0 | Set a number to select the desired algorithm. The value 0 uses basic logic of the collective algorithm selection. |
<conditions> | A comma separated list of conditions. An empty list selects all message sizes and process combinations |
<l> | Messages of size <l> |
<l>-<m> | Messages of size from <l> to <m>, inclusive |
<l>@<p> | Messages of size <l> and number of processes <p> |
<l>-<m>@<p>-<q> | Messages of size from <l> to <m> and number of processes from <p> to <q>, inclusive |
Description
Set this environment variable to select the desired algorithm(s) for the collective operation <opname> under particular conditions. Each collective operation has its own environment variable and algorithms.
Environment Variable | Collective Operation | Algorithms |
---|---|---|
I_MPI_ADJUST_ALLGATHER | MPI_Allgather |
|
I_MPI_ADJUST_ALLGATHERV | MPI_Allgatherv |
|
I_MPI_ADJUST_ALLREDUCE | MPI_Allreduce |
|
I_MPI_ADJUST_ALLTOALL | MPI_Alltoall |
|
I_MPI_ADJUST_ALLTOALLV | MPI_Alltoallv |
|
I_MPI_ADJUST_ALLTOALLW | MPI_Alltoallw | Isend/Irecv + waitall |
I_MPI_ADJUST_BARRIER | MPI_Barrier |
|
I_MPI_ADJUST_BCAST | MPI_Bcast |
|
I_MPI_ADJUST_EXSCAN | MPI_Exscan |
|
I_MPI_ADJUST_GATHER | MPI_Gather |
|
I_MPI_ADJUST_GATHERV | MPI_Gatherv |
|
I_MPI_ADJUST_REDUCE_SCATTER | MPI_Reduce_scatter |
|
I_MPI_ADJUST_REDUCE | MPI_Reduce |
|
I_MPI_ADJUST_SCAN | MPI_Scan |
|
I_MPI_ADJUST_SCATTER | MPI_Scatter |
|
I_MPI_ADJUST_SCATTERV | MPI_Scatterv |
|
I_MPI_ADJUST_SENDRECV_REPLACE | MPI_Sendrecv_replace | 1. Generic 2. Uniform (with restrictions) |
I_MPI_ADJUST_IALLGATHER | MPI_Iallgather |
|
I_MPI_ADJUST_IALLGATHERV | MPI_Iallgatherv |
|
I_MPI_ADJUST_IALLREDUCE | MPI_Iallreduce |
|
I_MPI_ADJUST_IALLTOALL | MPI_Ialltoall |
|
I_MPI_ADJUST_IALLTOALLV | MPI_Ialltoallv | Isend/Irecv + Waitall |
I_MPI_ADJUST_IALLTOALLW | MPI_Ialltoallw | Isend/Irecv + Waitall |
I_MPI_ADJUST_IBARRIER | MPI_Ibarrier | Dissemination |
I_MPI_ADJUST_IBCAST | MPI_Ibcast |
|
I_MPI_ADJUST_IEXSCAN | MPI_Iexscan |
|
I_MPI_ADJUST_IGATHER | MPI_Igather |
|
I_MPI_ADJUST_IGATHERV | MPI_Igatherv |
|
I_MPI_ADJUST_IREDUCE_SCATTER | MPI_Ireduce_scatter |
|
I_MPI_ADJUST_IREDUCE | MPI_Ireduce |
|
I_MPI_ADJUST_ISCAN | MPI_Iscan |
|
I_MPI_ADJUST_ISCATTER | MPI_Iscatter |
|
I_MPI_ADJUST_ISCATTERV | MPI_Iscatterv | Linear |
The message size calculation rules for the collective operations are described in the table. In the following table, "n/a" means that the corresponding interval <l>-<m> should be omitted.
To get the maximum number (range) of presets available for each collective operation, use the impi_info command:
$ impi_info -v I_MPI_ADJUST_ALLREDUCE I_MPI_ADJUST_ALLREDUCE MPI Datatype: MPI_CHAR Description: Control selection of MPI_Allreduce algorithm presets. Arguments <presetid> - Preset identifier range: 0-27
Collective Function | Message Size Formula |
---|---|
MPI_Allgather | recv_count*recv_type_size |
MPI_Allgatherv | total_recv_count*recv_type_size |
MPI_Allreduce | count*type_size |
MPI_Alltoall | send_count*send_type_size |
MPI_Alltoallv | n/a |
MPI_Alltoallw | n/a |
MPI_Barrier | n/a |
MPI_Bcast | count*type_size |
MPI_Exscan | count*type_size |
MPI_Gather | recv_count*recv_type_size if MPI_IN_PLACE is used, otherwise send_count*send_type_size |
MPI_Gatherv | n/a |
MPI_Reduce_scatter | total_recv_count*type_size |
MPI_Reduce | count*type_size |
MPI_Scan | count*type_size |
MPI_Scatter | send_count*send_type_size if MPI_IN_PLACE is used, otherwise recv_count*recv_type_size |
MPI_Scatterv | n/a |
Examples
Use the following settings to select the second algorithm for MPI_Reduce operation: I_MPI_ADJUST_REDUCE=2
Use the following settings to define the algorithms for MPI_Reduce_scatter operation: I_MPI_ADJUST_REDUCE_SCATTER="4:0-100,5001-10000;1:101-3200;2:3201-5000;3"
In this case. algorithm 4 is used for the message sizes between 0 and 100 bytes and from 5001 and 10000 bytes, algorithm 1 is used for the message sizes between 101 and 3200 bytes, algorithm 2 is used for the message sizes between 3201 and 5000 bytes, and algorithm 3 is used for all other messages.
I_MPI_ADJUST_<opname>_LIST
Syntax
I_MPI_ADJUST_<opname>_LIST=<presetid1>[-<presetid2>][,<presetid3>][,<presetid4>-<presetid5>]
Description
Set this environment variable to specify the set of algorithms to be considered by the Intel MPI runtime for a specified <opname>. This variable is useful in autotuning scenarios, as well as tuning scenarios where users would like to select a certain subset of algorithms.
I_MPI_COLL_INTRANODE
Syntax
I_MPI_COLL_INTRANODE=<mode>
Arguments
<mode> | Intranode collectives type |
pt2pt | Use only point-to-point communication-based collectives |
shm | Enables shared memory collectives. This is the default value |
Description
Set this environment variable to switch intranode communication type for collective operations. If there is large set of communicators, you can switch off the SHM-collectives to avoid memory overconsumption.
I_MPI_COLL_INTRANODE_SHM_THRESHOLD
Syntax
I_MPI_COLL_INTRANODE_SHM_THRESHOLD=<nbytes>
Arguments
<nbytes> | Define the maximal data block size processed by shared memory collectives |
> 0 | Use the specified size. The default value is 16384 bytes. |
Description
Set this environment variable to define the size of shared memory area available for each rank for data placement. Messages greater than this value will not be processed by SHM-based collective operation, but will be processed by point-to-point based collective operation. The value must be a multiple of 4096.
I_MPI_COLL_EXTERNAL
Syntax
I_MPI_COLL_EXTERNAL=<arg>
Arguments
<arg> | Description |
enable | yes | on | 1 | Enable the external collective operations functionality using available collectives libraries. |
disable | no | off | 0 | Disable the external collective operations functionality. This is the default value. |
hcoll | Enable the external collective operations functionality using HCOLL library. |
Description
Set this environment variable to enable external collective operations. For reaching better performance, use an autotuner after enabling I_MPI_COLL_EXTERNAL. This process gets the optimal collectives settings.
To force external collective operations usage, use the following I_MPI_ADJUST_<opname> values: I_MPI_ADJUST_ALLREDUCE=24, I_MPI_ADJUST_BARRIER=11, I_MPI_ADJUST_BCAST=16, I_MPI_ADJUST_REDUCE=13, I_MPI_ADJUST_ALLGATHER=6, I_MPI_ADJUST_ALLTOALL=5, I_MPI_ADJUST_ALLTOALLV=5, I_MPI_ADJUST_SCAN=3, I_MPI_ADJUST_EXSCAN=3, I_MPI_ADJUST_GATHER=5, I_MPI_ADJUST_GATHERV=4, I_MPI_ADJUST_SCATTER=5, I_MPI_ADJUST_SCATTERV=4, I_MPI_ADJUST_ALLGATHERV=5, I_MPI_ADJUST_ALLTOALLW=2, I_MPI_ADJUST_REDUCE_SCATTER=6, I_MPI_ADJUST_REDUCE_SCATTER_BLOCK=4, I_MPI_ADJUST_IALLGATHER=5, I_MPI_ADJUST_IALLGATHERV=5, I_MPI_ADJUST_IGATHERV=3, I_MPI_ADJUST_IALLREDUCE=9, I_MPI_ADJUST_IALLTOALLV=2, I_MPI_ADJUST_IBARRIER=2, I_MPI_ADJUST_IBCAST=5, I_MPI_ADJUST_IREDUCE=4.
For more information on HCOLL tuning, refer to NVIDIA* documentation.
I_MPI_COLL_DIRECT
Syntax
I_MPI_COLL_DIRECT=<arg>
Arguments
<arg> | Description |
on | Enable direct collectives. This is the default value. |
off | Disable direct collectives. |
Description
Set this environment variable to control direct collectives usage. Disable this variable to eliminate OFI* usage for intra-node communications in case of shm:ofi fabric.
I_MPI_CBWR
Control reproducibility of floating-point operations results across different platforms, networks, and topologies in case of the same number of processes.
Syntax
I_MPI_CBWR=<arg>
Arguments
<arg> | CBWR compatibility mode | Description |
0 | None | Do not use CBWR in a library-wide mode. CNR-safe communicators may be created with MPI_Comm_dup_with_info explicitly. This is the default value. |
1 | Weak mode | Disable topology aware collectives. The result of a collective operation does not depend on the rank placement. The mode guarantees results reproducibility across different runs on the same cluster (independent of the rank placement). |
2 | Strict mode | Disable topology aware collectives, ignore CPU architecture, and interconnect during algorithm selection. The mode guarantees results reproducibility across different runs on different clusters (independent of the rank placement, CPU architecture, and interconnection) |
Description
Conditional Numerical Reproducibility (CNR) provides controls for obtaining reproducible floating-point results on collectives operations. With this feature, Intel MPI collective operations are designed to return the same floating-point results from run to run in case of the same number of MPI ranks.
Control this feature with the I_MPI_CBWR environment variable in a library-wide manner, where all collectives on all communicators are guaranteed to have reproducible results. To control the floating-point operations reproducibility in a more precise and per-communicator way, pass the {“I_MPI_CBWR”, “yes”} key-value pair to the MPI_Comm_dup_with_info call.
Setting the I_MPI_CBWR in a library-wide mode using the environment variable leads to performance penalty.
CNR-safe communicators created using MPI_Comm_dup_with_info always work in the strict mode. For example:
MPI_Info hint; MPI_Comm cbwr_safe_world, cbwr_safe_copy; MPI_Info_create(&hint); MPI_Info_set(hint, “I_MPI_CBWR”, “yes”); MPI_Comm_dup_with_info(MPI_COMM_WORLD, hint, & cbwr_safe_world); MPI_Comm_dup(cbwr_safe_world, & cbwr_safe_copy);
In the example above, both cbwr_safe_world and cbwr_safe_copy are CNR-safe. Use cbwr_safe_world and its duplicates to get reproducible results for critical operations.
Note that MPI_COMM_WORLD itself may be used for performance-critical operations without reproducibility limitations.