Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Improve Loop Performance by Caching On-Chip Memory

In SYCL* task kernels for FPGA, the main objective is to achieve an initiation interval (II) of 1 on performance-critical loops. This means that a new loop iteration is launched on every clock cycle, thereby maximizing the loop's throughput. When the loop contains a loop-carried variable implemented in on-chip memory, the Intel® oneAPI DPC++/C++ Compiler often cannot achieve II=1 because the memory access takes more than one clock cycle. If the updated memory location is necessary on the next loop iteration, the next iteration must be delayed to allow time for the update, hence II > 1.

The on-chip memory cache technique breaks this dependency by storing recently-accessed values in a cache capable of a one-cycle read-modify-write operation. The cache is implemented in FPGA registers rather than on-chip memory. By pulling memory accesses preferentially from the register cache, the loop-carried dependency is broken.

When is the On-chip Memory Cache Technique Applicable?

You can apply the on-chip memory cache technique in the following situations:

  • Failure to achieve II=1 because of a loop-carried memory dependency in on-chip memory

    The on-chip memory cache technique is applicable if the compiler could not pipeline a loop with II=1 because of an on-chip memory dependency. If the compiler could not achieve II=1 because of a global memory dependency, this technique does not apply as the access latencies are too great.

    To check this for a given design, view the Loop Analysis report in the design's optimization report. The Loop Analysis report lists the II of all loops and explains why a lower II is not achievable. Check whether the reason given resembles the compiler failed to schedule this loop with smaller II due to memory dependency. The report describes the most critical loop feedback path during scheduling. Check whether this includes on-chip memory load/store operations on the critical path.

  • An II=1 loop with a load operation of latency 1

    The compiler is capable of reducing the latency of on-chip memory accesses to achieve II=1. In doing so, the compiler makes a trade-off by sacrificing fMAX to improve the II.

    In a design with II=1 critical loops but lower than the desired fMAX, the on-chip memory cache technique might still be applicable. It can help recover fMAX by enabling the compiler to achieve II=1 with a higher latency memory access. To check whether this is the case for a given design, view the Kernel Memory Viewer report in the design's optimization report. Select the desired on-chip memory from the Kernel Memory List, and mouse over the load operation LD to check its latency. If the latency of the load operation is 1, this is a clear sign that the compiler has attempted to sacrifice fMAX to improve loop II.

Implement the On-chip Memory Cache Technique

Consider the FPGA design example in onchip_memory_cache.cpp, which demonstrates the technique using a program that computes a histogram. The histogram operation accepts an input vector of values, separates the values into buckets, and counts the number of values per bucket. For each input value, an output bucket location is determined, and the count for the bucket is incremented. This count is stored in the on-chip memory, and the increment operation requires reading from memory, performing the increment, and storing the result. This read-modify-write operation is the critical path that can result in II > 1.

To reduce II, the idea is to store recently-accessed values in an FPGA register-implemented cache that is capable of a one-cycle read-modify-write operation. If the memory location required on a given iteration exists in the cache, it is pulled from there. The updated count is written back to both the cache and the on-chip memory. The ivdep attribute is added to inform the compiler that if a loop-carried variable (namely, the variable storing the histogram output) is required within CACHE_DEPTH iterations, it is guaranteed to be available right away.

Select the Cache Depth

While any value of CACHE_DEPTH results in functional hardware, the ideal value of CACHE_DEPTH requires some experimentation. The depth of the cache must roughly cover the latency of the on-chip memory access. To determine the correct value, Intel® recommends starting with a value of 2 and then increase it until both II = 1 and load latency > 1. In the onchip_memory_cache.cpp example, a CACHE_DEPTH of 5 is necessary. It is important to find the minimal value of CACHE_DEPTH that results in a maximal performance increase. Unnecessarily large values of CACHE_DEPTH consume unnecessary FPGA resources and can reduce fMAX. Therefore, at a CACHE_DEPTH that results in II=1 and load latency = 1, if further increases to CACHE_DEPTH show no improvement, do not increase CACHE_DEPTH any further.

NOTE:

For additional information, refer to the FPGA tutorial sample Onchip Memory Cache listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample in GitHub.