Developer Reference for Intel® oneAPI Math Kernel Library for Fortran

ID 766686
Date 11/07/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Distributing Data Among Processes

The Intel® oneAPI Math Kernel Library (oneMKL) cluster FFT functions store all input and output multi-dimensional arrays (matrices) in one-dimensional arrays (vectors). The arrays are stored in the column-major order. For example, a two-dimensional matrix A of size (m,n) is stored in a vector B of size m*n so that

B((j-1)*m+i)=A(i,j) (i=1, ..., m, j=1, ..., n) .

NOTE:

Order of FFT dimensions is the same as the order of array dimensions in the programming language. For example, a 3-dimensional FFT with Lengths=(m,n,l) can be computed over an array AR(m,n,l).

All MPI processes involved in cluster FFT computation operate their own portions of data. These local arrays make up the virtual global array that the fast Fourier transform is applied to. It is your responsibility to properly allocate local arrays (if needed), fill them with initial data and gather resulting data into an actual global array or process the resulting data differently. To be able do this, see sections below on how the virtual global array is composed of the local ones.

Multi-dimensional transforms

If the dimension of transform is greater than one, the cluster FFT function library splits data in the dimension whose index changes most slowly, so that the parts contain all elements with several consecutive values of this index. It is the first dimension in C and the last dimension in Fortran. If the global array is two-dimensional, in C, it gives each process several consecutive rows. The term "rows" will be used regardless of the array dimension and programming language. Local arrays are placed in memory allocated for the virtual global array consecutively, in the order determined by process ranks. For example, in case of two processes, during the computation of a three-dimensional transform whose matrix has size (11,15,12), the processes may store local arrays of sizes (6,15,12) and (5,15,12), respectively.

If p is the number of MPI processes and the matrix of a transform to be computed has size (m,n,l), in C, each MPI process works with local data array of size (mq , n, l), where Σmq=m, q=0, ... , p-1. Local input arrays must contain appropriate parts of the actual global input array, and then local output arrays will contain appropriate parts of the actual global output array. You can figure out which particular rows of the global array the local array must contain from the following configuration parameters of the cluster FFT interface: CDFT_LOCAL_NX, CDFT_LOCAL_START_X, and CDFT_LOCAL_SIZE. To retrieve values of the parameters, use the DftiGetValueDM function:

  • CDFT_LOCAL_NX specifies how many rows of the global array the current process receives.
  • CDFT_LOCAL_START_X specifies which row of the global input or output array corresponds to the first row of the local input or output array. If A is a global array and L is the appropriate local array, then

    L(i,j,k)=A(i,j,k+cdft_local_start_x-1), where i=1, ..., m, j=1, ..., n, k=1, ..., lq.

Example "2D Out-of-place Cluster FFT Computation" shows how the data is distributed among processes for a two-dimensional cluster FFT computation.

One-dimensional transforms

In this case, input and output data are distributed among processes differently and even the numbers of elements stored in a particular process before and after the transform may be different. Each local array stores a segment of consecutive elements of the appropriate global array. Such segment is determined by the number of elements and a shift with respect to the first array element. So, to specify segments of the global input and output arrays that a particular process receives, four configuration parameters are needed: CDFT_LOCAL_NX, CDFT_LOCAL_START_X, CDFT_LOCAL_OUT_NX, and CDFT_LOCAL_OUT_START_X. Use the DftiGetValueDM function to retrieve their values. The meaning of the four configuration parameters depends upon the type of the transform, as shown in Table "Data Distribution Configuration Parameters for 1D Transforms":

Data Distribution Configuration Parameters for 1D Transforms
Meaning of the Parameter Forward Transform Backward Transform

Number of elements in input array

CDFT_LOCAL_NX

CDFT_LOCAL_OUT_NX

Elements shift in input array

CDFT_LOCAL_START_X

CDFT_LOCAL_OUT_START_X

Number of elements in output array

CDFT_LOCAL_OUT_NX

CDFT_LOCAL_NX

Elements shift in output array

CDFT_LOCAL_OUT_START_X

CDFT_LOCAL_START_X

Memory size for local data

The memory size needed for local arrays cannot be just calculated from CDFT_LOCAL_NX (CDFT_LOCAL_OUT_NX), because the cluster FFT functions sometimes require allocating a little bit more memory for local data than just the size of the appropriate sub-array. The configuration parameter CDFT_LOCAL_SIZE specifies the size of the local input and output array in data elements. Each local input and output arrays must have size not less than CDFT_LOCAL_SIZE*size_of_element. Note that in the current implementation of the cluster FFT interface, data elements can be real or complex values, each complex value consisting of the real and imaginary parts. If you employ a user-defined workspace for in-place transforms (for more information, refer to Table "Settable configuration Parameters"), it must have the same size as the local arrays. Example "1D In-place Cluster FFT Computations" illustrates how the cluster FFT functions distribute data among processes in case of a one-dimensional FFT computation performed with a user-defined workspace.

Available Auxiliary Functions

If a global input array is located on one MPI process and you want to obtain its local parts or you want to gather the global output array on one MPI process, you can use functions MKL_CDFT_ScatterData and MKL_CDFT_GatherData to distribute or gather data among processes, respectively. These functions are defined in a file that is delivered with Intel® oneAPI Math Kernel Library (oneMKL) and located in the following subdirectory of the Intel® oneAPI Math Kernel Library (oneMKL) installation directory: examples/cdftf/source/cdft_example_support.f90.

Restriction on Lengths of Transforms

The algorithm that the Intel® oneAPI Math Kernel Library (oneMKL) cluster FFT functions use to distribute data among processes imposes a restriction on lengths of transforms with respect to the number of MPI processes used for the FFT computation:

  • For a multi-dimensional transform, the lengths of the last two dimensions must be not less than the number of MPI processes.
  • The length of a one-dimensional transform must be the product of two integers each of which is not less than the number of MPI processes.

Non-compliance with the restriction causes an error CDFT_SPREAD_ERROR (refer to Error Codes for details). To achieve the compliance, you can change the transform lengths and/or the number of MPI processes, which is specified at start of an MPI program. MPI-2 enables changing the number of processes during execution of an MPI program.