Actual Benchmarking

Intel® MPI Benchmarks User Guide

Download PDF

ID 766171

Date 3/26/2021

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-4F7FCD7F-CAB0-40D0-AD37-2189B1CE687F

View Details

Actual Benchmarking

To reduce measurement errors caused by insufficient clock resolution, every benchmark is run repeatedly. The repetition count is as follows:

For IMB-MPI1, IMB-NBC, and aggregate flavors of IMB-EXT, IMB-IO, and IMB-RMA benchmarks, the repetition count is MSGSPERSAMPLE. This constant is defined in IMB_settings.h and IMB_settings_io.h, with 1000 and 50 values, respectively.

To avoid excessive run times for large transfer sizes X, an upper bound is set to OVERALL_VOL/X. The OVERALL_VOL value is defined in IMB_settings.h and IMB_settings_io.h, with 4MB and 16MB values, respectively.

Given transfer size X, the repetition count for all aggregate benchmarks is defined as follows:

n_sample = MSGSPERSAMPLE (X=0)

n_sample = max(1,min(MSGSPERSAMPLE,OVERALL_VOL/X)) (X>0)

The repetition count for non-aggregate benchmarks is defined completely analogously, with MSGSPERSAMPLE replaced by MSGS_NONAGGR. It is recommended to reduce the repetition count as non-aggregate run times are usually much longer.

In the following examples, elementary transfer means a pure function (MPI_[Send, ...], MPI_Put, MPI_Get, MPI_Accumulate, MPI_File_write_XX, MPI_File_read_XX), without any further function call. Assured completion transfer completion is:

IMB-EXT benchmarks: MPI_Win_fence
IMB-IO Write benchmarks: a triplet MPI_File_sync/MPI_Barrier(file_communicator)/MPI_File_sync
IMB-RMA benchmarks: MPI_Win_flush, MPI_Win_flush_all, MPI_Win_flush_local, or MPI_Win_flush_local_all
Other benchmarks: empty

MPI-1 Benchmarks

for ( i=0; i<N_BARR; i++ ) MPI_Barrier(MY_COMM)
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
   execute MPI pattern
time = (MPI_Wtime()-time)/n_sample

IMB-EXT and Blocking I/O Benchmarks

For aggregate benchmarks, the kernel loop looks as follows:

for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM)
/* Negligible integer (offset) calculations ... */
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
   execute elementary transfer
   assure completion of all transfers
time = (MPI_Wtime()-time)/n_sample

For non-aggregate benchmarks, every transfer is completed before going on to the next transfer:

for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM)
/* Negligible integer (offset) calculations ... */
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
   {
   execute elementary transfer
   assure completion of transfer
   }
time = (MPI_Wtime()-time)/n_sample

Non-blocking I/O Benchmarks

A nonblocking benchmark has to provide three timings:

t_pure - blocking pure I/O time
t_ovrl- nonblocking I/O time concurrent with CPU activity
t_CPU - pure CPU activity time

The actual benchmark consists of the following stages:

Calling the equivalent blocking benchmark, as defined in Actual Benchmarking and taking benchmark time as t_pure.
Closing and re-opening the related file(s).
Re-synchronizing the processes.
Running the nonblocking case, concurrent with CPU activity (exploiting t_CPU when running undisturbed), taking the effective time as t_ovrl.

You can set the desired CPU time t_CPU in IMB_settings_io.h:

#define TARGET_CPU_SECS 0.1 /* unit seconds */

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® MPI Benchmarks User Guide

Actual Benchmarking

MPI-1 Benchmarks

IMB-EXT and Blocking I/O Benchmarks

Non-blocking I/O Benchmarks