Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Enable the Read-Only Cache for Read-Only Accessors (<span class='codeph'>-Xsread-only-cache-size=<var><N></var>)</span>

If your kernel accesses a read-only accessor that is guaranteed not to alias with other accessors and USM pointers, consider enabling the read-only cache using the -Xsread-only-cache-size=<N> flag in your icpx command. You should use a read-only cache for high-bandwidth table lookups that is constant throughout the kernel execution. The read-only cache is optimized for high cache-hit performance.

Example

icpx -fsycl -fintelfpga -Xshardware -Xsread-only-cache-size=<N> <source_file>.cpp

The compiler implements the read-only cache using on-chip memory blocks and privatizes it per kernel. Each kernel receives a version of the cache that serves all reads in the kernel from read-only no-alias accessors. The compiler replicates each private cache as many times as necessary to expose extra read ports. The size of each replicate is <N> bytes as specified by the -Xsread-only-cache-size=<N> flag.

NOTE:
  • Unlike global memory accesses that have extra hardware for tolerating long memory latencies, the read-only cache suffers significant performance penalties for cache misses. If the buffer being accessed in your kernel code cannot fit in the cache, you might achieve better performance without enabling the cache. The cached data is discarded (invalidated) from the read-only cache every time the kernel is launched.
  • Currently, omitting the read-only cache for only a subset of your read-only accessors in your design is unsupported. If your design has multiple read-only no-alias accessors, you can either enable caching for all of them using the global -Xsread-only-cache-size=<N> flag or disable caching for all of them by removing the flag.

Consider the following example code snippet:

q.submit([&](handler &h) {
  accessor sqrt_lut(sqrt_lut_buf, h, read_only,
                    ext::oneapi::accessor_property_list{no_alias});
  accessor indices(indices_buf, h, read_write, 
                   ext::oneapi::accessor_property_list{no_alias, no_init});
  accessor output(output_buf, h, write_only,
                  ext::oneapi::accessor_property_list{no_alias, no_init});

  h.single_task<class Test>([=]() {
    for (int i = 0; i < kNumInputs; ++i) {
      output[i] = sqrt_lut[indices[i]];
    }
  });
});

Compile the above code using the following command:

icpx -fsycl -fintelfpga -Xshardware -Xsread-only-cache-size=2048 <source_file>.cpp

The compiler creates a read-only cache of size 2048 bytes that serves the single read from sqrt_lut. If the cache is sized correctly to match the size of sqrt_lut_buf, then the cache improves the design throughput, especially because the read accesses are random.