Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Perform Kernel Computations Using Local or Private Memory

To optimize memory access efficiency, minimize the number of global memory accesses by performing kernel computations in local or private memory.

To minimize global memory accesses, it is often best to preload data from a group of computations from global memory to a local or private memory. Perform kernel computations on the preloaded data and write the results back to the global memory.

Preload Data into Local Memory or Private Memory

When you structure your kernel code, if your global memory accesses are not sequential, consider refactoring your code to access global memory sequentially while buffering that data in local or private memory before using the data for computation. This can be beneficial for performance since the Intel® oneAPI DPC++/C++ Compiler implements local and private memory on-chip whereas global memory is off-chip for most platforms. On-chip memory is smaller than off-chip memory, but it significantly has higher bandwidth and much lower latency. Additionally, on-chip memory is more effective with random access memory patterns than off-chip. For more information, refer to Memory Accesses and Memory Attributes.

Store Variables and Arrays in Private Memory

The Intel® oneAPI DPC++/C++ Compiler implements private memory using FPGA registers in the kernel datapath, block RAMs, or MLABs.

Aggregate data types are also implemented in the registers if array-access indices are compile-time constants. Typically, private memory is useful for storing single variables or small arrays. Otherwise, the compiler uses block RAMs or MLABs.

Registers are ample hardware resources in FPGAs, andyou should use them with private memory instead of other memory types whenever possible. If a variable is implemented in registers, it can be accessed in parallel across the datapath, as each stage of the pipeline will have its own copy of the data.

For more information on the implementation of private memory using registers, refer to Inferring a Shift Register.