Prefetching
User-guided data prefetching is a useful technique for hiding latency arising from lower-level cache misses and main memory accesses. OpenMP offload for Intel® GPUs now enables this feature using the prefetch pragma, with syntax as follows:
C OpenMP prefetch:
#pragma ompx prefetch data([prefetch-hint-modifier:],arrsect, [,arrsect] ) [ if (condition) ]
Fortran OpenMP prefetch:
!$omp prefetch data( [prefetch-hint-modifier:] arrsect [, arrsect] ) [if (condition)]
The prefetch pragma above is an Intel® extension, and works for Intel® Data Center GPU Max Series and later products. The main aspects of the pragma are:
Prefetch-hint: The destination for the prefetched data is specified using the optional prefetch-hint-modifier. Valid values are 0 (No-op), 2 (prefetch to L2 only) and 4 (prefetch to L1 and L2). If the value is not specified, the default value is 0.
Use of array section: A contiguous array section arrsect is specified using the OpenMP syntax [lower-bound : length]. For example, use a[2:4] in C, or a(2:4) in Fortran. If only one element needs to be prefetched, in C we can use either a[2:1] or a[2]. Fortran needs the length to be explicitly specified even for a single element prefetch, like a(2:1).
Default prefetch size: Even if a single array element is requested to be prefetched, the hardware will prefetch an entire cache line that contains that element. In Intel® Data Center GPU Max Series, the size of a cache line is 64 bytes.
Faulting: Prefetch instructions in Intel® Data Center GPU Max Series are faulting, which means accesses to invalid addresses can cause a segmentation fault. The optional if condition in the pragma can be used for guarding against out-of-bounds accesses.
Non-blocking: The prefetch pragma does not block, it does not wait for the prefetch to complete.
Prefetch in C OpenMP
The following example shows a simplified 1-dimension version of an N-body force kernel. The outer for-loop iterates over the particles for which the forces are calculated. The inner loops iterate over the interacting particles, in batches of TILE_SIZE particles. We can prefetch the next tile of particles during the computations of the current tile of particles. Prefetch always brings in 64 bytes of data as described above. So we need to prefetch only one out of every 16 single-precision floating point values, which is achieved by using if ( (next_tile % 16) == 0 ). Using this masking condition may not always help, see additional notes after the code snippet below. The prefetch hint used is 4 (prefetch to L1 and L2 cache). Only the offloaded kernel is shown below.
#define WORKGROUP_SIZE 1024 #define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3; 2 = prefetch to L3 #define TILE_SIZE 64 void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) { #pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE) for (int i = 0; i < n1; i++) { const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f; const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f; const float eps = 0.01f; float dx = 0.0; float bb[TILE_SIZE]; for (int j = 0; j < n2; j += TILE_SIZE) { // load tile from b for (int u = 0; u < TILE_SIZE; ++u) { bb[u] = b[j + u]; #ifdef PREFETCH int next_tile = j + TILE_SIZE + u; if ((next_tile % 16) == 0) { #pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2) } #endif } #pragma unroll(TILE_SIZE) for (int u = 0; u < TILE_SIZE; ++u) { float delta = bb[u] - a[i]; float r2 = delta * delta; float s0 = r2 + eps; float s1 = 1.0f / sqrtf(s0); float f = (s1 * s1 * s1) - (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5))))); dx += f * delta; } } c[i] = dx * 0.23f; } }
The condition if ( (next_tile % 16) == 0 ) can save on the prefetch overhead when the array index is not vectorized. In the example above, only the index i is vectorized, so when we prefetch b[] that is indexed using j, it helps to issue a prefetch only once every 16 elements. On the other hand, if we were to prefetch an array over the index i, then the prefetch is vectorized and therefore the masking condition may not offer any benefits. The user will need to experimentally determine the best approach for their application.
Compilation command:
Without prefetch:
icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium nbody_c.cpp -o test_c
With prefetch:
icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH nbody_c.cpp -o test_c
Run command:
LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_c
The default SIMD width is chosen to be 16 or 32 automatically by the backend device compiler (Intel® Graphics Compiler or IGC) on Intel® Data Center GPU Max Series by compiler heuristics that take into account factors such as register pressure in the kernel. One can use the IGC environment variable IGC_ForceOCLSIMDWidth=16 to request the IGC compiler to force a SIMD width of 16. SIMD16 gave a better performance for the above kernel. In the run command, we have also enabled OpenMP’s built-in profiler using LIBOMPTARGET_PLUGIN_PROFILE=T,usec. The output from the run without prefetch was as follows below.
Obtained output = 222700231.430 Expected output = 222700339.016 Total time = 205.4 milliseconds ====================================================================================================================== LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0 ---------------------------------------------------------------------------------------------------------------------- Kernel 0 : __omp_offloading_46_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15 Kernel 1 : __omp_offloading_46_3c0d785c__Z15clean_cache_gpuPdi_l69 Kernel 2 : __omp_offloading_46_3c0d785c__Z4main_l98 ---------------------------------------------------------------------------------------------------------------------- : Host Time (usec) Device Time (usec) Name : Total Average Min Max Total Average Min Max Count ---------------------------------------------------------------------------------------------------------------------- Compiling : 598283.05 598283.05 598283.05 598283.05 0.00 0.00 0.00 0.00 1.00 DataAlloc : 9578.23 798.19 0.00 8728.03 0.00 0.00 0.00 0.00 12.00 DataRead (Device to Host) : 77.01 77.01 77.01 77.01 5.68 5.68 5.68 5.68 1.00 DataWrite (Host to Device): 713.11 356.55 179.05 534.06 15.76 7.88 5.04 10.72 2.00 Kernel 0 : 205292.22 2052.92 2033.95 2089.98 203572.32 2035.72 1984.96 2073.12 100.00 Kernel 1 : 109194.28 1091.94 1076.94 1681.09 107051.52 1070.52 1062.40 1107.04 100.00 Kernel 2 : 1746.89 1746.89 1746.89 1746.89 3.84 3.84 3.84 3.84 1.00 Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 OffloadEntriesInit : 2647.88 2647.88 2647.88 2647.88 0.00 0.00 0.00 0.00 1.00 ======================================================================================================================
From the output above, the average device time for the GPU kernel execution (Kernel 0) is 2036 microseconds. If we run the binary with prefetch enabled, the average kernel device time observed was 1841 microseconds, as shown below:
Obtained output = 222700231.430 Expected output = 222700339.016 Total time = 185.9 milliseconds ====================================================================================================================== LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0 ---------------------------------------------------------------------------------------------------------------------- Kernel 0 : __omp_offloading_43_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15 Kernel 1 : __omp_offloading_43_3c0d785c__Z15clean_cache_gpuPdi_l69 Kernel 2 : __omp_offloading_43_3c0d785c__Z4main_l98 ---------------------------------------------------------------------------------------------------------------------- : Host Time (usec) Device Time (usec) Name : Total Average Min Max Total Average Min Max Count ---------------------------------------------------------------------------------------------------------------------- Compiling : 499351.98 499351.98 499351.98 499351.98 0.00 0.00 0.00 0.00 1.00 DataAlloc : 9609.94 800.83 0.00 8740.19 0.00 0.00 0.00 0.00 12.00 DataRead (Device to Host) : 77.01 77.01 77.01 77.01 4.96 4.96 4.96 4.96 1.00 DataWrite (Host to Device): 722.17 361.08 185.01 537.16 16.40 8.20 5.44 10.96 2.00 Kernel 0 : 185793.88 1857.94 1839.88 1919.03 184075.20 1840.75 1824.00 1874.56 100.00 Kernel 1 : 109442.95 1094.43 1076.94 1590.01 107334.56 1073.35 1062.40 1115.68 100.00 Kernel 2 : 1821.99 1821.99 1821.99 1821.99 3.84 3.84 3.84 3.84 1.00 Linking : 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 OffloadEntriesInit : 2493.14 2493.14 2493.14 2493.14 0.00 0.00 0.00 0.00 1.00 ======================================================================================================================
Please note that the achieved performance depends on the hardware and the software stack used, so users may see different performance numbers at their end.
Prefetch in Fortran OpenMP
The same nbody1d kernel is shown in Fortran below. The prefetch pragma is inserted in the same location as before, with prefetch hint of value 4, and again prefetching only one out of every 16 elements.
#define WORKGROUP_SIZE 1024 #define PREFETCH_HINT 4 ! 4 = prefetch to L1 and L3; 2 = prefetch to L3 #define TILE_SIZE 64 subroutine nbody_1d_gpu(c, a, b, n1, n2) implicit none integer n1, n2 real a(0:n1-1), b(0:n2-1), c(0:n1-1) real dx, bb(0:TILE_SIZE-1), delta, r2, s0, s1, f integer i,j,u,next real ma0, ma1, ma2, ma3, ma4, ma5, eps parameter (ma0=0.269327, ma1=-0.0750978, ma2=0.0114808) parameter (ma3=-0.00109313, ma4=0.0000605491, ma5=-0.00000147177) parameter (eps=0.01) !$omp target teams distribute parallel do thread_limit(WORKGROUP_SIZE) !$omp& private(i,dx,j,u,bb,next,delta,r2,s0,s1,f) do i = 0, n1-1 dx = 0.0 do j = 0, n2-1, TILE_SIZE ! load tile from b do u = 0, TILE_SIZE-1 bb(u) = b(j+u) #ifdef PREFETCH next = j + TILE_SIZE + u if (mod(next,16).eq.0) then !$omp prefetch data(PREFETCH_HINT:b(next:next))if(next<n2) endif #endif enddo ! compute !DIR$ unroll(TILE_SIZE) do u = 0, TILE_SIZE-1 delta = bb(u) - a(i) r2 = delta*delta s0 = r2 + eps s1 = 1.0 / sqrt(s0) f = (s1*s1*s1)-(ma0+r2*(ma1+r2*(ma2+r2*(ma3+r2*(ma4+ma5))))) dx = dx + f*delta enddo enddo c(i) = dx*0.23 enddo end subroutine
Compilation command:
Without prefetch:
ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium nbody_f.f -o test_f
With prefetch:
ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium -DPREFETCH nbody_f.f -o test_f
Run command:
LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_f
The output is not shown here since it looks like the output of the C example. The average kernel time without and with prefetch were respectively, 2017 us and 1823 us. Again, please note that users may see different performance numbers, depending on the actual hardware and the software stack used.
Prefetch in C OpenMP SIMD
OpenMP offload also supports a SIMD programming model wherein all computations are specified in terms of EU threads that comprise 16 or 32 SIMD lanes in Intel® Data Center GPU Max Series. Correspondingly, even the thread_limit() clause in OpenMP takes on a modified meaning, and now specifies the number of EU threads per work-group. The OpenMP SIMD version of the nbody1d kernel is listed below. We need to explicitly specify the SIMD width, which is VECLEN=16. At the time of this writing, the prefetch pragma is recommended to be used outside the scope of the simd clause, which means only one SIMD lane will issue a prefetch instruction. In this example, 1 out of 16 lanes will carry out prefetch, which is exactly what we need – so we no longer need if ( (next_tile % 16) == 0 ) that we had used in the previous examples above.
#define WORKGROUP_SIZE 1024 #define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3; 2 = prefetch to L3 #define TILE_SIZE 64 void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) { #pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE / \ VECLEN) for (int i = 0; i < n1; i += VECLEN) { const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f; const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f; const float eps = 0.01f; float dx[VECLEN]; float aa[VECLEN], bb[TILE_SIZE]; #pragma omp simd simdlen(VECLEN) #pragma unroll(0) for (int v = 0; v < VECLEN; ++v) { dx[v] = 0.0f; aa[v] = a[i + v]; } for (int j = 0; j < n2; j += TILE_SIZE) { // load tile from b for (int u = 0; u < TILE_SIZE; u += VECLEN) { #pragma omp simd simdlen(VECLEN) #pragma unroll(0) for (int v = 0; v < VECLEN; ++v) bb[u + v] = b[j + u + v]; #ifdef PREFETCH int next_tile = j + TILE_SIZE + u; #pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2) #endif } // compute current tile #pragma omp simd simdlen(VECLEN) #pragma unroll(0) for (int v = 0; v < VECLEN; ++v) { #pragma unroll(TILE_SIZE) for (int u = 0; u < TILE_SIZE; ++u) { float delta = bb[u] - aa[v]; float r2 = delta * delta; float s0 = r2 + eps; float s1 = 1.0f / sqrtf(s0); float f = (s1 * s1 * s1) - (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5))))); dx[v] += f * delta; } } } #pragma omp simd simdlen(VECLEN) #pragma unroll(0) for (int v = 0; v < VECLEN; ++v) { c[i + v] = dx[v] * 0.23f; } } }
Compilation command:
We need to use an additional compilation switch -fopenmp-target-simd to enable the SIMD programming model. The compilation command is therefore as follows:
Without prefetch:
icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd
With prefetch:
icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd
Run command:
LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec ./test_c_simd
Notice that we no longer need the environment variable IGC_ForceOCLSIMDWidth=16, because the SIMD width has been explicitly specified in the OpenMP code.
The output looks like the previous examples, so it is not shown. The average kernel time without and with prefetch are respectively, 2008 us and 1810 us. As noted earlier, users may see different performance numbers, depending on the actual hardware and the software stack used.