Prefetching

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Prefetching

User-guided data prefetching is a useful technique for hiding latency arising from lower-level cache misses and main memory accesses. OpenMP offload for Intel^® GPUs now enables this feature using the prefetch pragma, with syntax as follows:

C OpenMP prefetch:

#pragma ompx prefetch data([prefetch-hint-modifier:],arrsect, [,arrsect] ) [ if (condition) ]

Fortran OpenMP prefetch:

!$omp prefetch data( [prefetch-hint-modifier:] arrsect [, arrsect] ) [if (condition)]

The prefetch pragma above is an Intel^® extension, and works for Intel^® Data Center GPU Max Series and later products. The main aspects of the pragma are:

Prefetch-hint: The destination for the prefetched data is specified using the optional prefetch-hint-modifier. Valid values are 0 (No-op), 2 (prefetch to L2 only) and 4 (prefetch to L1 and L2). If the value is not specified, the default value is 0.
Use of array section: A contiguous array section arrsect is specified using the OpenMP syntax [lower-bound : length]. For example, use a[2:4] in C, or a(2:4) in Fortran. If only one element needs to be prefetched, in C we can use either a[2:1] or a[2]. Fortran needs the length to be explicitly specified even for a single element prefetch, like a(2:1).
Default prefetch size: Even if a single array element is requested to be prefetched, the hardware will prefetch an entire cache line that contains that element. In Intel^® Data Center GPU Max Series, the size of a cache line is 64 bytes.
Faulting: Prefetch instructions in Intel^® Data Center GPU Max Series are faulting, which means accesses to invalid addresses can cause a segmentation fault. The optional if condition in the pragma can be used for guarding against out-of-bounds accesses.
Non-blocking: The prefetch pragma does not block, it does not wait for the prefetch to complete.

Prefetch in C OpenMP

The following example shows a simplified 1-dimension version of an N-body force kernel. The outer for-loop iterates over the particles for which the forces are calculated. The inner loops iterate over the interacting particles, in batches of TILE_SIZE particles. We can prefetch the next tile of particles during the computations of the current tile of particles. Prefetch always brings in 64 bytes of data as described above. So we need to prefetch only one out of every 16 single-precision floating point values, which is achieved by using if ( (next_tile % 16) == 0 ). Using this masking condition may not always help, see additional notes after the code snippet below. The prefetch hint used is 4 (prefetch to L1 and L2 cache). Only the offloaded kernel is shown below.

#define WORKGROUP_SIZE 1024
#define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3;  2 = prefetch to L3
#define TILE_SIZE 64

void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) {
#pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE)
  for (int i = 0; i < n1; i++) {
    const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f;
    const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f;
    const float eps = 0.01f;

    float dx = 0.0;
    float bb[TILE_SIZE];
    for (int j = 0; j < n2; j += TILE_SIZE) {
      // load tile from b
      for (int u = 0; u < TILE_SIZE; ++u) {
        bb[u] = b[j + u];
#ifdef PREFETCH
        int next_tile = j + TILE_SIZE + u;
        if ((next_tile % 16) == 0) {
#pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2)
        }
#endif
      }
#pragma unroll(TILE_SIZE)
      for (int u = 0; u < TILE_SIZE; ++u) {
        float delta = bb[u] - a[i];
        float r2 = delta * delta;
        float s0 = r2 + eps;
        float s1 = 1.0f / sqrtf(s0);
        float f =
            (s1 * s1 * s1) -
            (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5)))));
        dx += f * delta;
      }
    }
    c[i] = dx * 0.23f;
  }
}

The condition if ( (next_tile % 16) == 0 ) can save on the prefetch overhead when the array index is not vectorized. In the example above, only the index i is vectorized, so when we prefetch b[] that is indexed using j, it helps to issue a prefetch only once every 16 elements. On the other hand, if we were to prefetch an array over the index i, then the prefetch is vectorized and therefore the masking condition may not offer any benefits. The user will need to experimentally determine the best approach for their application.

Compilation command:

Without prefetch:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium nbody_c.cpp -o test_c

With prefetch:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH nbody_c.cpp -o test_c

Run command:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_c

The default SIMD width is chosen to be 16 or 32 automatically by the backend device compiler (Intel^® Graphics Compiler or IGC) on Intel^® Data Center GPU Max Series by compiler heuristics that take into account factors such as register pressure in the kernel. One can use the IGC environment variable IGC_ForceOCLSIMDWidth=16 to request the IGC compiler to force a SIMD width of 16. SIMD16 gave a better performance for the above kernel. In the run command, we have also enabled OpenMP’s built-in profiler using LIBOMPTARGET_PLUGIN_PROFILE=T,usec. The output from the run without prefetch was as follows below.

Obtained output = 222700231.430
Expected output = 222700339.016

Total time =    205.4 milliseconds
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_46_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15
Kernel 1                  : __omp_offloading_46_3c0d785c__Z15clean_cache_gpuPdi_l69
Kernel 2                  : __omp_offloading_46_3c0d785c__Z4main_l98
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (usec)                        Device Time (usec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :  598283.05 598283.05 598283.05 598283.05      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :    9578.23    798.19      0.00   8728.03      0.00      0.00      0.00      0.00     12.00
DataRead (Device to Host) :      77.01     77.01     77.01     77.01      5.68      5.68      5.68      5.68      1.00
DataWrite (Host to Device):     713.11    356.55    179.05    534.06     15.76      7.88      5.04     10.72      2.00
Kernel 0                  :  205292.22   2052.92   2033.95   2089.98 203572.32   2035.72   1984.96   2073.12    100.00
Kernel 1                  :  109194.28   1091.94   1076.94   1681.09 107051.52   1070.52   1062.40   1107.04    100.00
Kernel 2                  :    1746.89   1746.89   1746.89   1746.89      3.84      3.84      3.84      3.84      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :    2647.88   2647.88   2647.88   2647.88      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

From the output above, the average device time for the GPU kernel execution (Kernel 0) is 2036 microseconds. If we run the binary with prefetch enabled, the average kernel device time observed was 1841 microseconds, as shown below:

Obtained output = 222700231.430
Expected output = 222700339.016

Total time =    185.9 milliseconds
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_43_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15
Kernel 1                  : __omp_offloading_43_3c0d785c__Z15clean_cache_gpuPdi_l69
Kernel 2                  : __omp_offloading_43_3c0d785c__Z4main_l98
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (usec)                        Device Time (usec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :  499351.98 499351.98 499351.98 499351.98      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :    9609.94    800.83      0.00   8740.19      0.00      0.00      0.00      0.00     12.00
DataRead (Device to Host) :      77.01     77.01     77.01     77.01      4.96      4.96      4.96      4.96      1.00
DataWrite (Host to Device):     722.17    361.08    185.01    537.16     16.40      8.20      5.44     10.96      2.00
Kernel 0                  :  185793.88   1857.94   1839.88   1919.03 184075.20   1840.75   1824.00   1874.56    100.00
Kernel 1                  :  109442.95   1094.43   1076.94   1590.01 107334.56   1073.35   1062.40   1115.68    100.00
Kernel 2                  :    1821.99   1821.99   1821.99   1821.99      3.84      3.84      3.84      3.84      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :    2493.14   2493.14   2493.14   2493.14      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

Please note that the achieved performance depends on the hardware and the software stack used, so users may see different performance numbers at their end.

Prefetch in Fortran OpenMP

The same nbody1d kernel is shown in Fortran below. The prefetch pragma is inserted in the same location as before, with prefetch hint of value 4, and again prefetching only one out of every 16 elements.

#define WORKGROUP_SIZE 1024
#define PREFETCH_HINT 4     ! 4 = prefetch to L1 and L3;  2 = prefetch to L3
#define TILE_SIZE 64

      subroutine nbody_1d_gpu(c, a, b, n1, n2)
      implicit none
      integer n1, n2
      real a(0:n1-1), b(0:n2-1), c(0:n1-1)
      real dx, bb(0:TILE_SIZE-1), delta, r2, s0, s1, f
      integer i,j,u,next
      real ma0, ma1, ma2, ma3, ma4, ma5, eps
      parameter (ma0=0.269327, ma1=-0.0750978, ma2=0.0114808)
      parameter (ma3=-0.00109313, ma4=0.0000605491, ma5=-0.00000147177)
      parameter (eps=0.01)

!$omp target teams distribute parallel do thread_limit(WORKGROUP_SIZE)
!$omp& private(i,dx,j,u,bb,next,delta,r2,s0,s1,f)
      do i = 0, n1-1
        dx = 0.0
        do j = 0, n2-1, TILE_SIZE
          ! load tile from b
          do u = 0, TILE_SIZE-1
            bb(u) = b(j+u)
#ifdef PREFETCH
            next = j + TILE_SIZE + u
            if (mod(next,16).eq.0) then
!$omp prefetch data(PREFETCH_HINT:b(next:next))if(next<n2)
            endif
#endif
          enddo
          ! compute
          !DIR$ unroll(TILE_SIZE)
          do u = 0, TILE_SIZE-1
            delta = bb(u) - a(i)
            r2 = delta*delta
            s0 = r2 + eps
            s1 = 1.0 / sqrt(s0)
            f = (s1*s1*s1)-(ma0+r2*(ma1+r2*(ma2+r2*(ma3+r2*(ma4+ma5)))))
            dx = dx + f*delta
          enddo
        enddo
        c(i) = dx*0.23
      enddo
      end subroutine

Compilation command:

Without prefetch:

ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium nbody_f.f -o test_f

With prefetch:

ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium -DPREFETCH nbody_f.f -o test_f

Run command:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_f

The output is not shown here since it looks like the output of the C example. The average kernel time without and with prefetch were respectively, 2017 us and 1823 us. Again, please note that users may see different performance numbers, depending on the actual hardware and the software stack used.

Prefetch in C OpenMP SIMD

OpenMP offload also supports a SIMD programming model wherein all computations are specified in terms of EU threads that comprise 16 or 32 SIMD lanes in Intel^® Data Center GPU Max Series. Correspondingly, even the thread_limit() clause in OpenMP takes on a modified meaning, and now specifies the number of EU threads per work-group. The OpenMP SIMD version of the nbody1d kernel is listed below. We need to explicitly specify the SIMD width, which is VECLEN=16. At the time of this writing, the prefetch pragma is recommended to be used outside the scope of the simd clause, which means only one SIMD lane will issue a prefetch instruction. In this example, 1 out of 16 lanes will carry out prefetch, which is exactly what we need – so we no longer need if ( (next_tile % 16) == 0 ) that we had used in the previous examples above.

#define WORKGROUP_SIZE 1024
#define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3;  2 = prefetch to L3
#define TILE_SIZE 64

void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) {
#pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE / \
                                                              VECLEN)
  for (int i = 0; i < n1; i += VECLEN) {
    const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f;
    const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f;
    const float eps = 0.01f;

    float dx[VECLEN];
    float aa[VECLEN], bb[TILE_SIZE];
#pragma omp simd simdlen(VECLEN)
#pragma unroll(0)
    for (int v = 0; v < VECLEN; ++v) {
      dx[v] = 0.0f;
      aa[v] = a[i + v];
    }
    for (int j = 0; j < n2; j += TILE_SIZE) {
      // load tile from b
      for (int u = 0; u < TILE_SIZE; u += VECLEN) {
#pragma omp simd simdlen(VECLEN)
#pragma unroll(0)
        for (int v = 0; v < VECLEN; ++v)
          bb[u + v] = b[j + u + v];
#ifdef PREFETCH
        int next_tile = j + TILE_SIZE + u;
#pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2)
#endif
      }
// compute current tile
#pragma omp simd simdlen(VECLEN)
#pragma unroll(0)
      for (int v = 0; v < VECLEN; ++v) {
#pragma unroll(TILE_SIZE)
        for (int u = 0; u < TILE_SIZE; ++u) {
          float delta = bb[u] - aa[v];
          float r2 = delta * delta;
          float s0 = r2 + eps;
          float s1 = 1.0f / sqrtf(s0);
          float f =
              (s1 * s1 * s1) -
              (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5)))));
          dx[v] += f * delta;
        }
      }
    }
#pragma omp simd simdlen(VECLEN)
#pragma unroll(0)
    for (int v = 0; v < VECLEN; ++v) {
      c[i + v] = dx[v] * 0.23f;
    }
  }
}

Compilation command:

We need to use an additional compilation switch -fopenmp-target-simd to enable the SIMD programming model. The compilation command is therefore as follows:

Without prefetch:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd

With prefetch:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd

Run command:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec ./test_c_simd

Notice that we no longer need the environment variable IGC_ForceOCLSIMDWidth=16, because the SIMD width has been explicitly specified in the OpenMP code.

The output looks like the previous examples, so it is not shown. The average kernel time without and with prefetch are respectively, 2008 us and 1810 us. As noted earlier, users may see different performance numbers, depending on the actual hardware and the software stack used.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Prefetching

Prefetch in C OpenMP

Prefetch in Fortran OpenMP

Prefetch in C OpenMP SIMD