Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 7/13/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Load-Store Unit Styles

The Intel® oneAPI DPC++/C++ Compiler generates different styles of load-store units (LSUs) based on:

  • Inferred memory access pattern
  • Types of memory available on the target platform
  • Whether the memory accesses are to the local or global memory

The Intel® oneAPI DPC++/C++ Compiler can generate the following styles of LSUs:

Burst-Coalesced Load-Store Units

A burst-coalesced LSU is the default LSU style instantiated by the compiler for accessing global memory. It buffers contiguous memory requests until the largest possible burst can be made. For noncontiguous memory requests, a burst-coalesced LSU flushes the buffer between requests.

While a burst-coalesced LSU provides efficient, variable-latency access to global memory, a burst-coalesced LSU requires a considerable amount of FPGA resources.

The following example code results in the compiler instantiating burst-coalesced LSUs:

cgh.single_task<class Kernel>([=] {
  
  int x = input_accessor[RandomIndex]; //burst-coalesced
  output_accessor[0] = x; 
});

Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU in the following ways:

Prefetching Load-Store Units

A prefetching LSU instantiates a FIFO that burst-reads large memory blocks to keep the FIFO full of valid data based on the previous address and assumes contiguous reads. Noncontiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for nonvolatile global pointers.

The following example code results in the compiler instantiating prefetching LSUs to access global memory:

cgh.single_task<class Kernel>([=] {
    int x = 1;
  for (int i = 0; i < VectorSize; i++) {
     x = x + input_accessor[i]; //prefetching
  }
  output_accessor[0] = x;
});

Pipelined Load-Store Units

A pipelined LSU is used for accessing local memory. Memory requests are submitted immediately after they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.

cgh.single_task<class Kernel>([=] {
  
  const unsigned LMEM_SIZE = 128;
  int lmem[LMEM_SIZE];
  for (int i = 0; i < LMEM_SIZE; i++) {
    lmem[i] = i * 100; //pipelined
  }
  output_accessor[0] = lmem[input_accessor[0]];
});

The compiler might modify a local-pipelined LSU as a never-stall LSU. For more details, refer to Never-stall.

The Intel® oneAPI DPC++/C++ Compiler may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU styles. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.