Prefetch

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 7/13/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-74D1AD7C-0E55-4EA6-94E4-CFFE8AD3DCC7

View Details

Prefetch

User can give a hint to compiler to prefetch data into cache. This is a useful technique for hiding memory latency. Prefetch works on Intel^® Data Center GPU Max Series and later products.

The following example shows simple 1-dimensional reduction of N tiles. Each work item reduces 4 tiles. We can prefetch the next tile during the computations of the current tile.

  constexpr int N = 1024 * 1024;
  constexpr int TILE_SIZE = 8;
  float *data = sycl::malloc_shared<float>(N / TILE_SIZE, q);
  float *data2 = sycl::malloc_shared<float>(N, q);
  for (auto i = 0; i < N; ++i)
    data2[i] = static_cast<float>(i);

  auto e = q.submit([&](auto &h) {
    h.parallel_for(
        sycl::nd_range(sycl::range{N / TILE_SIZE / 4}, sycl::range{1024}),
        [=](sycl::nd_item<1> it) {
          float tile[TILE_SIZE];
          int i = it.get_global_linear_id();
          using global_ptr =
              sycl::multi_ptr<float, sycl::access::address_space::global_space>;
          for (int t = 0; t < 4; ++t) {
            // load tile
            for (int j = 0; j < TILE_SIZE; ++j) {
              tile[j] = data2[i * TILE_SIZE * 4 + t * 4 + j];
            }
            // prefetch next tile
            if (t < 3)
              global_ptr(&data2[i * TILE_SIZE * 4 + (t + 1) * 4])
                  .prefetch(TILE_SIZE);
            // reduce
            float dx = 0.0f;
            for (int j = 0; j < TILE_SIZE; ++j) {
              dx += 1.0f / sqrtf(tile[j]);
              data[i * 4 + t] = dx;
            }
          }
        });
  });

Prefetch takes one argument: number of continuous elements to prefetch. Compiler works best if this argument has constant value at compilation time. Compiler caches at max 32 bytes total per work item for each prefetch call. If bigger prefetch is required, it must be split into multiple function calls. Too many prefetches can have diminishing returns; always measure kernel execution time when adding prefetches to balance results.

Prefetch should be used ahead of expected use of data, so the next load will be able to hit cache. Note that memory fences can flush cache and any previous prefetch must be repeated.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Prefetch