Visible to Intel only — GUID: GUID-74D1AD7C-0E55-4EA6-94E4-CFFE8AD3DCC7
Visible to Intel only — GUID: GUID-74D1AD7C-0E55-4EA6-94E4-CFFE8AD3DCC7
Prefetch
User can give a hint to compiler to prefetch data into cache. This is a useful technique for hiding memory latency. Prefetch works on Intel® Data Center GPU Max Series and later products.
The following example shows simple 1-dimensional reduction of N tiles. Each work item reduces 4 tiles. We can prefetch the next tile during the computations of the current tile.
constexpr int N = 1024 * 1024; constexpr int TILE_SIZE = 8; float *data = sycl::malloc_shared<float>(N / TILE_SIZE, q); float *data2 = sycl::malloc_shared<float>(N, q); for (auto i = 0; i < N; ++i) data2[i] = static_cast<float>(i); auto e = q.submit([&](auto &h) { h.parallel_for( sycl::nd_range(sycl::range{N / TILE_SIZE / 4}, sycl::range{1024}), [=](sycl::nd_item<1> it) { float tile[TILE_SIZE]; int i = it.get_global_linear_id(); using global_ptr = sycl::multi_ptr<float, sycl::access::address_space::global_space>; for (int t = 0; t < 4; ++t) { // load tile for (int j = 0; j < TILE_SIZE; ++j) { tile[j] = data2[i * TILE_SIZE * 4 + t * 4 + j]; } // prefetch next tile if (t < 3) global_ptr(&data2[i * TILE_SIZE * 4 + (t + 1) * 4]) .prefetch(TILE_SIZE); // reduce float dx = 0.0f; for (int j = 0; j < TILE_SIZE; ++j) { dx += 1.0f / sqrtf(tile[j]); data[i * 4 + t] = dx; } } }); });
Prefetch takes one argument: number of continuous elements to prefetch. Compiler works best if this argument has constant value at compilation time. Compiler caches at max 32 bytes total per work item for each prefetch call. If bigger prefetch is required, it must be split into multiple function calls. Too many prefetches can have diminishing returns; always measure kernel execution time when adding prefetches to balance results.
Prefetch should be used ahead of expected use of data, so the next load will be able to hit cache. Note that memory fences can flush cache and any previous prefetch must be repeated.