Command-line Control
You can control all the aspects of the Intel(R) MPI Benchmarks through the command line. The general command-line syntax is the following:
The command line is repeated in the output. The options may appear in any order.
Get out-of-cache data for PingPong:
Run a very large configuration, with the following parameters:
Maximum iterations: 20
Maximum run time per message: 1.5 seconds
Maximum message buffer size: 2 GBytes
Run the P_Read_shared benchmark with the minimum number of processes set to seven:
Run the IMB-MPI1 benchmarks including PingPongAnySource and PingPingAnySource, but excluding the Alltoall and Alltoallv benchmarks. Set the transfer message sizes as 0, 4, 8, 16, 32, 64, 128:
Run the PingPong, PingPing, PingPongAnySource, and PingPingAnySource benchmarks with the transfer message sizes 0, 2^0, 2^1, 2^2, ..., 2^16:
Benchmark Selection Arguments
Benchmark selection arguments are a sequence of blank-separated strings. Each string is the name of a benchmark in exact spelling, case insensitive.
For example, the string IMB-MPI1 PingPong Allreduce specifies that you want to run PingPong and Allreduce benchmarks only:
By default, all benchmarks of the selected component are run.
-npmin Option
Specifies the minimum number of processes P_min to run all selected benchmarks on. The P_min value after -npmin must be an integer.
Given P_min, the benchmarks run on the processes with the numbers selected as follows:
P_min, 2P_min, 4P_min, ..., ``\ largest ``2xP_min <P, P
You may set P_min to 1. If you set P_min > P, Intel MPI Benchmarks interprets this value as P_min = P.
For example, to run the IMB-EXT benchmarks with minimum number of processes set to five, call:
By default, all active processes are selected as described in the Running Intel(R) MPI Benchmarks section.
-multi Option
Defines whether the benchmark runs in multiple mode. In this mode MPI_COMM_WORLD is split into several groups, which run simultaneously. The argument after -multi is a meta-symbol <outflag> that can take an integer value of 0 or 1:
outflag = 0 display only maximum timings (minimum throughputs) over all active groups
outflag = 1 report on all groups separately. The report may be long in this case.
This flag controls only benchmark results output style, the running procedure is the same for both -multi 0 and -multi 1 options.
When the number of processes running the benchmark is more than half of the overall number of ranks in MPI_COMM_WORLD, the multiple mode benchmark execution coincides with the non-multiple one, as not more than one process group can be created.
For example, if you run this command:
The benchmark will run in non-multiple mode, as the benchmarking starts from 12 processes, which is more than half of MPI_COMM_WORLD.
When a benchmark is set to be run on a set of different numbers of processes, its launch mode is determined based on the number of processes for each run. It is easy to tell if the benchmark is running in multiple mode by looking at the benchmark results header. When the name of the benchmark is printed out with the Multi- prefix, it is a multiple mode run.
For example, in the case of the same Bcast benchmark execution without –npmin parameter:
the benchmark will be executed 4 times: for 2, 4 and 8 processes in multiple mode, and for 16 processes in standard (non-multiple) mode. The benchmark results headers will look as follows:
For each but the last execution the header contains:
Multi- prefix before the benchmark name
The list of MPI_COMM_WORLD ranks, aggregated in each group
By default, Intel(R) MPI Benchmarks run non-multiple benchmark flavors.
-off_cache cache_size[,cache_line_size] Option
Use the -off_cache flag to avoid cache re-use. If you do not use this flag (default), the same communications buffer is used for all repetitions of one message size sample. In this case, Intel(R) MPI Benchmarks reuses the cache, so throughput results might be non-realistic.
The argument after off_cache can be a single number (cache_size), two comma-separated numbers (cache_size,cache_line_size), or -1:
cache_size is a float for an upper bound of the size of the last level cache, in MB.
cache_line_size is assumed to be the size of a last level cache line (can be an upper estimate).
-1 uses values defined in IMB_mem_info.h. In this case, make sure to define values for cache_size and cache_line_size in IMB_mem_info.h.
The sent/received data is stored in buffers of size ~2x MAX(cache_size, message_size). When repetitively using messages of a particular size, their addresses are advanced within those buffers so that a single message is at least 2 cache lines after the end of the previous message. When these buffers are filled up, they are reused from the beginning.
-off_cache is effective for IMB-MPI1 and IMB-EXT. Avoid using this option for IMB-IO.
Use the default values defined in IMB_mem_info.h:
-off_cache -1
2.5 MB last level cache, default line size:
-off_cache 2.5
16 MB last level cache, line size 128:
-off_cache 16,128
The off_cache mode might also be influenced by eventual internal caching with the Intel(R) MPI Library. This could make results interpretation complicated.
Default: no cache control.
-iter Option
Use this option to control the number of iterations executed by every benchmark.
By default, the number of iterations is controlled through parameters MSGSPERSAMPLE, OVERALL_VOL, MSGS_NONAGGR, and ITER_POLICY defined in IMB_settings.h.
You can optionally add one or more arguments after the -iter flag, to override the default values defined in IMB_settings.h. Use the following guidelines for the optional arguments:
To override the MSGSPERSAMPLE value, use a single integer.
To override the OVERALL_VOL value, use two comma-separated integers. The first integer defines the MSGSPERSAMPLE value. The second integer overrides the OVERALL_VOL value.
To override the MSGS_NONAGGR value, use three comma-separated integer numbers. The first integer defines the MSGSPERSAMPLE value. The second integer overrides the OVERALL_VOL value. The third overrides the MSGS_NONAGGR value.
To override the -iter_policy argument, enter it after the integer arguments, or right after the -iter flag if you do not use any other arguments.
To define MSGSPERSAMPLE as 2000, and OVERALL_VOL as 100, use the following string:
-iter 2000,100
To define MSGS_NONAGGR as 150, you need to define values for MSGSPERSAMPLE and OVERALL_VOL as shown in the following string:
-iter 1000,40,150
To define MSGSPERSAMPLE as 2000 and set the multiple_np policy, use the following string (see -iter_policy):
-iter 2000,multiple_np
-iter_policy Option
Use this option to set a policy for automatic calculation of the number of iterations. Use one of the following arguments to override the default ITER_POLICY value defined in IMB_settings.h:
Policy |
Description |
dynamic |
Reduces the number of iterations when the maximum run time per sample (see -time) is expected to be reached. Using this policy ensures faster execution, but may lead to inaccuracy of the results. |
multiple_np |
Reduces the number of iterations when the message size is getting bigger. Using this policy ensures the accuracy of the results, but may lead to longer execution time. You can control the execution time through the -time option. |
auto |
Automatically chooses which policy to use: - applies multiple_np to collective operations where one of the processes acts as the root of the operation (for example, MPI_Bcast) - applies dynamic to all other types of operations |
off |
The number of iterations does not change during the execution. |
You can also set the policy through the -iter option. See -iter.
By default, the ITER_POLICY defined in IMB_settings.h is used.
-time Option
Specifies the number of seconds for the benchmark to run per message size. The argument after -time is a floating-point number.
The combination of this flag with the -iter flag or its default alternative ensures that the Intel(R) MPI Benchmarks always chooses the maximum number of repetitions that conform to all restrictions.
A rough number of repetitions per sample to fulfill the -time request is estimated in preparatory runs that use ~1 second overhead.
Default: -time is activated. The floating-point value specifying the run-time seconds per sample is set in the SECS_PER_SAMPLE variable defined in IMB_settings.h, or IMB_settings_io.h.
-mem Option
Specifies the number of GB to be allocated per process for the message buffers. If the size is exceeded, a warning is returned, stating how much memory is required for the overall run.
The argument after -mem is a floating-point number.
Default: the memory is restricted by MAX_MEM_USAGE defined in IMB_mem_info.h.
-input <File> Option
Use the ASCII input file to select the benchmarks. For example, the IMB_SELECT_EXT file looks as follows:
With the help of this file, the following command runs only Unidir_Get and Accumulate benchmarks of the IMB-EXT component:
-msglen <File> Option
Enter any set of non-negative message lengths to an ASCII file, line by line, and call the Intel(R) MPI Benchmarks with arguments:
-msglen Lengths
The Lengths value overrides the default message lengths. For IMB-IO, the file defines the I/O portion lengths.
-map PxQ Option
Use this option to re-number the ranks for parallel processes in MPI_COMM_WORLD along rows of the matrix:
0 |
P |
… |
(Q-2)P |
(Q-1)P |
1 |
… |
P-1 |
2P-1 |
(Q-1)P-1 |
QP-1 |
For example, to run Multi-PingPong between two nodes, P processes on each (ppn=P), with each process on one node communicating with its counterpart on the other, call:
The P*Q product must not be less than the total number of ranks, otherwise a command line parsing error is issued. The P=1 and Q=1 cases are treated as meaningless and are just ignored.
See the examples below for a more detailed explanation of the –map option.
Example 1.PingPong benchmark with a 4x2 map, 8 ranks in total on 2 nodes.
a)–map 4x2 combined with –multi <outflag>, multiple mode:
The MPI_COMM_WORLD communicator originally consists of 8 ranks:
{ 0, 1, 2, 3, 4, 5, 6, 7 }
The given option –map 4x2 reorders this set of ranks into the following set (in terms of MPI_COMM_WORLD ranks):
{ 0, 4, 1, 5, 2, 6, 3, 7 }
The –multi <outflag> makes Intel(R) MPI Benchmarks split the communicator into 4 subgroups, 2 ranks in each, with a MPI_Comm_split call. As a result, the communicator looks like this:
{ { 0, 1 }, { 0, 1 }, { 0, 1 }, { 0, 1 } }
In terms of the original MPI_COMM_WORLD rank numbers, this means that there are 4 groups of ranks, and the benchmark is executed simultaneously for each:
Group 1: { 0, 4 }; Group 2: { 1, 5 }; Group 3: { 2, 6 }; Group 4: { 3, 7 }
This grouping is shown in the benchmark output header and can be easily verified:
As can be seen in the output, ranks in the pairs belong to different nodes, so this benchmark execution will measure inter-node communication parameters.
b)–map 4x2 without –multi <outflag>, non-multiple mode:
The same rules or rank numbers transformation are applied in this case, but since the multiple mode is not set, communicator splitting is not performed. Only two ranks will participate in actual communication, as the PingPong benchmark covers a pair of ranks only. The benchmark will cover only the first group:
Group: { 0, 4 }
and the other ranks from MPI_COMM_WORLD will be idle. This is reflected in the benchmark results output:
Example 2.Biband benchmark with the 2x4 map, 8 ranks in total on 2 nodes
a)–map 2x4 combined with –multi <outflag>, multiple mode:
The MPI_COMM_WORLD communicator originally consists of 8 ranks:
{ 0, 1, 2, 3, 4, 5, 6, 7 }
The given option –map 2x4 reorders this set of ranks into the following set (in terms of MPI_COMM_WORLD ranks):
{ 0, 2, 4, 6, 1, 3, 5, 7 }
The communicator splitting, which is required by the –multi <outflag> option, then depends on the number of processes to be used for execution. In this case, 2-process, 4-process and 8-process run cycles will be executed:
1)NP=2: Reordered communicator is split into 4 groups of 2 processes because of the multiple mode:
{ { 0, 1}, { 0, 1}, { 0, 1 }, { 0, 1 } }
In terms of MPI_COMM_WORLD ranks, the groups are:
Group 1: { 0, 2 }; Group 2: { 1, 3 }; Group 3: { 4, 6 }; Group 3: { 5, 7 }
All the pairs belong to a single node here, so no cross-node benchmarking is performed in this case.
2)NP=4: Reordered communicator is split into 2 groups of 4 processes because of the multiple mode:
{ { 1, 2, 3, 4 }, { 1, 2, 3, 4 } }
In terms of MPI_COMM_WORLD ranks, the groups are:
Group 1: { 0, 2, 4, 6 }; Group 2: { 1, 3, 5, 7 };
Execution groups mix ranks from different nodes in this case, and due to the Biband benchmark pairs ordering rules (see Biband), only inter-node pairs will be tested.
3)NP=8: No communicator splitting can be performed, since the ranks can fit only a single group:
Group: { 0, 2, 4, 6, 1, 3, 5, 7 }
The group is half-by-half spread within 2 execution nodes, but as a result of reordering all the pairs in the Biband test (see Biband) appear to be intra-node ones, which is totally opposite to the default case (no –map option) and the NP=4 case.
b)–map 2x4 without –multi <outflag> option, non-multiple mode:
The same rules or rank numbers transformation are applied in this case, but since the multiple mode is not set, no communicator splitting is performed. The set of ranks that are covered by the benchmark depends on the number of processes to be used for execution. In this case, 2-process, 4-process and 8-process run cycles will be executed, and they just use the first 2, 4 and 8 ranks of the reordered communicator for actual benchmark execution:
1)NP=2: first 2 ranks of the reordered communicator form the group (in terms of MPI_COMM_WORLD ranks):
Group: { 0, 2 };
2)NP=4: first 4 ranks of the reordered communicator form the group (in terms of MPI_COMM_WORLD ranks):
Group: { 0, 2, 4, 6 };
3)NP=8: all the ranks of the reordered communicator form the group (in terms of MPI_COMM_WORLD ranks):
Group: { 0, 2, 4, 6, 1, 3, 5, 7 }
As can be seen in the output, the NP=2 and NP=4 executions of the Biband test launched with and without the –multi <outflag> option are almost the same. The only difference is that in the non-multiple mode only one group is active, and all other processes are idle. For the NP=8 case, the Biband benchmark executions performed with and without the –multi <outflag> option are completely identical.
-include [[benchmark1] benchmark2 …]
Specifies the list of additional benchmarks to run. For example, to add PingPongAnySource and PingPingAnySource benchmarks, call:
-exclude [[benchmark1] benchmark2 …]
Specifies the list of benchmarks to be excluded from the run. For example, to exclude Alltoall and Allgather, call:
-msglog [<minlog>:]<maxlog>
This option allows you to control the lengths of the transfer messages. This setting overrides the MINMSGLOG and MAXMSGLOG values. The new message sizes are 0, 2^minlog, ..., 2^maxlog.
For example, if you run the following command line:
Intel(R) MPI Benchmarks selects the lengths 0, 8, 16, 32, 64, 128, as shown below:
Alternatively, you can specify only the maxlog value, enter:
In this case Intel(R) MPI Benchmarks selects the lengths 0,1,2,4,8:
-thread_level Option
This option specifies the desired thread level for MPI_Init_thread(). See description of MPI_Init_thread() for details. The option is available only if the Intel(R) MPI Benchmarks is built with the USE_MPI_INIT_THREAD macro defined. Possible values for <level> are single, funneled, serialized, and multiple.
-sync Option
This option is relevant only for benchmarks measuring collective operations. It controls whether all ranks are synchronized after every iteration step by means of the MPI_Barrier operation. The -sync option can take the following arguments:
Argument |
Description |
0 | off | disable | no |
Disables processes synchronization at each iteration step. |
1 | on | enable | yes |
Enables processes synchronization at each iteration step. This is the default value. |
-imb_barrier Option
Implementation of the MPI_Barrier operation may vary depending on the MPI implementation. Each MPI implementation might use a different algorithm for the barrier, with possibly different synchronization characteristics, so the Intel(R) MPI Benchmarks results may vary significantly as a result of MPI_Barrier implementation differences. The internal, MPI-independent barrier function IMB_barrier is provided to make the synchronization effect more reproducible.
Use this option to use the IMB_barrier function to get consistent results of collective operation benchmarks.
Argument |
Description |
0 | off | disable | no |
Use the standard MPI_Barrier operation. This is the default value. |
1 | on | enable | yes |
Use the internal barrier implementation for synchronization. |
-root_shift Option
This option is relevant only for benchmarks measuring collective operations that utilize the root concept (for example MPI_Bcast, MPI_Reduce, MPI_Gather, etc). It defines whether the root is changed at every iteration step or not. The –root_shift option can take the following arguments:
Argument |
Description |
0 | off | disable | no |
Disables root change at each iteration step. Rank 0 acts as a root at each iteration step. This is the default value. |
1 | on | enable | yes |
Enables root change at each iteration step. Root rank is changed in a round-robin fashion. |
-data_type Option
Specifies the type to be used. The -data_type option can take byte, char, int, float, float16, or bfloat16 argument. The default value is byte.
The option is available for MPI-1 only.
-red_data_type Option
Specifies the type of reduction to be used. The -red_data_type option can take char, int, float, float16, or bfloat16 argument. The default value is float.
The option is available for MPI-1 only.
-contig_type Option
Specifies the predefined type to be used.
Argument |
Description |
base |
A simple MPI type (for example, MPI_INT, MPI_CHAR). This is the default value. |
base_vec |
A vector of base |
resize |
A simple MPI type with an extent (type) = 2*size (type) |
resize_vec |
A vector of resize |
The option is available for MPI-1 only.
-zero_size Option
Do not run benchmarks with the message size 0.
Argument |
Description |
0 | off | disable | no |
Allows to run benchmarks with the zero message size. |
1 | on | enable | yes |
Does not allow to run benchmarks with the zero message size. This is the default value. |
The option is available for MPI-1 only.
-mem_alloc_type Option
Argument |
Description |
device |
Allocates device memory. This is the default value. |
host |
Allocates host memory registered on GPU device. |
shared |
Allocates shared memory. |
cpu |
Allocates host memory. |
The option is available for MPI-1 with GPU support only.