Visible to Intel only — GUID: GUID-9561B197-435C-4671-B646-AF9C5D29D047
Visible to Intel only — GUID: GUID-9561B197-435C-4671-B646-AF9C5D29D047
Perform Kernel Computations Using Local or Private Memory
To optimize memory access efficiency, minimize the number of global memory accesses by performing kernel computations in local or private memory.
To minimize global memory accesses, it is often best to preload data from a group of computations from global memory to a local or private memory. Perform kernel computations on the preloaded data and write the results back to the global memory.
Preload Data into Local Memory or Private Memory
Local memory is considerably smaller than global memory, but it has significantly higher bandwidth and much lower latency. Unlike global memory accesses, the kernel can access local memory randomly without any performance penalty. When you structure your kernel code, attempt to access the global memory sequentially, and buffer that data in on-chip local memory before your kernel uses the data for computation.
Store Variables and Arrays in Private Memory
The Intel® oneAPI DPC++/C++ Compiler implements private memory using FPGA registers in the kernel datapath, block RAMs, or MLABs. The Intel® oneAPI DPC++/C++ Compiler analyzes the private memory accesses and promotes them to register accesses. Scalar variables, for example float, int and char, are typically promoted. Aggregate data types are promoted if array-access indices are compile-time constants. Typically, private memory is useful for storing single variables or small arrays. Registers are plentiful hardware resources in FPGAs, and it is usually better to use private memory instead of other memory types whenever possible. The kernel can access private memories in parallel, allowing them to provide more bandwidth than any other memory type (global and local).
For more information on the implementation of private memory using registers, refer to Inferring a Shift Register.