Visible to Intel only — GUID: GUID-4F7FCD7F-CAB0-40D0-AD37-2189B1CE687F
Visible to Intel only — GUID: GUID-4F7FCD7F-CAB0-40D0-AD37-2189B1CE687F
Actual Benchmarking
To reduce measurement errors caused by insufficient clock resolution, every benchmark is run repeatedly. The repetition count is as follows:
For IMB-MPI1, IMB-NBC, and aggregate flavors of IMB-EXT, IMB-IO, and IMB-RMA benchmarks, the repetition count is MSGSPERSAMPLE. This constant is defined in IMB_settings.h and IMB_settings_io.h, with 1000 and 50 values, respectively.
To avoid excessive run times for large transfer sizes X, an upper bound is set to OVERALL_VOL/X. The OVERALL_VOL value is defined in IMB_settings.h and IMB_settings_io.h, with 4MB and 16MB values, respectively.
Given transfer size X, the repetition count for all aggregate benchmarks is defined as follows:
n_sample = MSGSPERSAMPLE (X=0)
n_sample = max(1,min(MSGSPERSAMPLE,OVERALL_VOL/X)) (X>0)
The repetition count for non-aggregate benchmarks is defined completely analogously, with MSGSPERSAMPLE replaced by MSGS_NONAGGR. It is recommended to reduce the repetition count as non-aggregate run times are usually much longer.
In the following examples, elementary transfer means a pure function (MPI_[Send, ...], MPI_Put, MPI_Get, MPI_Accumulate, MPI_File_write_XX, MPI_File_read_XX), without any further function call. Assured completion transfer completion is:
IMB-EXT benchmarks: MPI_Win_fence
IMB-IO Write benchmarks: a triplet MPI_File_sync/MPI_Barrier(file_communicator)/MPI_File_sync
IMB-RMA benchmarks: MPI_Win_flush, MPI_Win_flush_all, MPI_Win_flush_local, or MPI_Win_flush_local_all
Other benchmarks: empty
MPI-1 Benchmarks
for ( i=0; i<N_BARR; i++ ) MPI_Barrier(MY_COMM) time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) execute MPI pattern time = (MPI_Wtime()-time)/n_sample
IMB-EXT and Blocking I/O Benchmarks
For aggregate benchmarks, the kernel loop looks as follows:
for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM) /* Negligible integer (offset) calculations ... */ time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) execute elementary transfer assure completion of all transfers time = (MPI_Wtime()-time)/n_sample
For non-aggregate benchmarks, every transfer is completed before going on to the next transfer:
for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM) /* Negligible integer (offset) calculations ... */ time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) { execute elementary transfer assure completion of transfer } time = (MPI_Wtime()-time)/n_sample
Non-blocking I/O Benchmarks
A nonblocking benchmark has to provide three timings:
t_pure - blocking pure I/O time
t_ovrl- nonblocking I/O time concurrent with CPU activity
t_CPU - pure CPU activity time
The actual benchmark consists of the following stages:
Calling the equivalent blocking benchmark, as defined in Actual Benchmarking and taking benchmark time as t_pure.
Closing and re-opening the related file(s).
Re-synchronizing the processes.
Running the nonblocking case, concurrent with CPU activity (exploiting t_CPU when running undisturbed), taking the effective time as t_ovrl.
You can set the desired CPU time t_CPU in IMB_settings_io.h:
#define TARGET_CPU_SECS 0.1 /* unit seconds */