Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 9/26/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.6.4. When to Use Each LSU

You can decide between different LSUs to use either based on what you know about the access patterns of your load/store site or on your silicon area requirements. The following are the LSU styles in an increasing order of their area requirements:
  1. Pipelined LSU (load/store): It is area efficient but it can be slower than other LSUs. You should use this LSU if you are constricted on area or if your access patterns are not necessarily consecutive.
  2. Prefetching LSU (only for loads): It is also area efficient but it is perfect for fully consecutive access patterns. There is a throughput penalty for using it for non-consecutive access patterns, so, use it only if you know that the addresses accessed are strictly consecutive.
  3. Burst-coalesced LSU (load/store): It is expensive in area but can process consecutive access patterns very efficiently. There is an area penalty for checking whether the access patterns are consecutive or not. The LSU dynamically attempts to combine several kernel requests into one big burst spanning multiple memory words, if possible.
  4. Burst-coalesced cached LSU (only for loads): It is the most expensive in area because it contains an extra cache that is local to the LSU. It can help the throughput in cases where you intend to read the same location in memory multiple times, especially across multiple ND-range threads.