Visible to Intel only — GUID: GUID-A25A2162-D16E-4E78-BA14-0F60C7C3A6C0
Visible to Intel only — GUID: GUID-A25A2162-D16E-4E78-BA14-0F60C7C3A6C0
Managing Performance of the Cluster Fourier Transform Functions
Performance of Intel® oneAPI Math Kernel Library (oneMKL) Cluster FFT (CFFT) in different applications mainly depends on the cluster configuration, performance of message-passing interface (MPI) communications, and configuration of the run. Note that MPI communications usually take approximately 70% of the overall CFFT compute time.For more flexibility of control over time-consuming aspects of CFFT algorithms, Intel® oneAPI Math Kernel Library (oneMKL) provides theMKL_CDFT environment variable to set special values that affect CFFT performance. To improve performance of your application that intensively calls CFFT, you can use the environment variable to set optimal values for you cluster, application, MPI, and so on.
The MKL_CDFT environment variable has the following syntax, explained in the table below:
MKL_CDFT=option1[=value1],option2[=value2],…,optionN[=valueN]
While this table explains the settings that usually improve performance under certain conditions, the actual performance highly depends on the configuration of your cluster. Therefore, experiment with the listed values to speed up your computations.
Option |
Possible Values |
Description |
---|---|---|
alltoallv |
0 (default) |
Configures CFFT to use the standard MPI_Alltoallv function to perform global transpositions. |
1 |
Configures CFFT to use a series of calls to MPI_Isend and MPI_Irecv instead of the MPI_Alltoallv function. |
|
4 |
Configures CFFT to merge global transposition with data movements in the local memory. CFFT performs global transpositions by calling MPI_Isend and MPI_Irecv in this case. Use this value in a hybrid case (MPI + OpenMP), especially when the number of processes per node equals one. |
|
wo_omatcopy |
0 |
Configures CFFT to perform local FFT and local transpositions separately. CFFT usually performs faster with this value than with wo_omatcopy = 1 if the configuration parameter DFTI_TRANSPOSE has the value of DFTI_ALLOW. See the Intel® oneAPI Math Kernel Library (oneMKL) Developer Reference for details. |
1 |
Configures CFFT to merge local FFT calls with local transpositions. CFFT usually performs faster with this value than with wo_omatcopy = 0 if DFTI_TRANSPOSE has the value of DFTI_NONE. |
|
-1 (default) |
Enables CFFT to decide which of the two above values to use depending on the value of DFTI_TRANSPOSE. |
|
enable_soi |
Not applicable |
A flag that enables low-communication Segment Of Interest FFT (SOI FFT) algorithm for one-dimensional complex-to-complex CFFT, which requires fewer MPI communications than the standard nine-step (or six-step) algorithm.
CAUTION:
While using fewer MPI communications, the SOI FFT algorithm incurs a minor loss of precision (about one decimal digit). |
The following example illustrates usage of the environment variable assuming the bash shell:
export MKL_CDFT=wo_omatcopy=1,alltoallv=4,enable_soi mpirun –ppn 2 –n 16 ./mkl_cdft_app
Product and Performance Information |
---|
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201 |