Visible to Intel only — GUID: GUID-E183E5F6-69F1-4840-9A21-69F2B39C4C6A
Visible to Intel only — GUID: GUID-E183E5F6-69F1-4840-9A21-69F2B39C4C6A
Load-Store Unit Styles
The Intel® oneAPI DPC++/C++ Compiler generates different styles of load-store units (LSUs) based on:
- Inferred memory access pattern
- Types of memory available on the target platform
- Whether the memory accesses are to the local or global memory
The Intel® oneAPI DPC++/C++ Compiler can generate the following styles of LSUs:
Burst-Coalesced Load-Store Units
A burst-coalesced LSU is the default LSU style instantiated by the compiler for accessing global memory. It buffers contiguous memory requests until the largest possible burst can be made. For noncontiguous memory requests, a burst-coalesced LSU flushes the buffer between requests.
While a burst-coalesced LSU provides efficient, variable-latency access to global memory, a burst-coalesced LSU requires a considerable amount of FPGA resources.
The following example code results in the compiler instantiating burst-coalesced LSUs:
cgh.single_task<class Kernel>([=] {
int x = input_accessor[RandomIndex]; //burst-coalesced
output_accessor[0] = x;
});
Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU in the following ways:
Prefetching Load-Store Units
A prefetching LSU instantiates a FIFO that burst-reads large memory blocks to keep the FIFO full of valid data based on the previous address and assumes contiguous reads. Noncontiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for nonvolatile global pointers.
The following example code results in the compiler instantiating prefetching LSUs to access global memory:
cgh.single_task<class Kernel>([=] {
int x = 1;
for (int i = 0; i < VectorSize; i++) {
x = x + input_accessor[i]; //prefetching
}
output_accessor[0] = x;
});
Pipelined Load-Store Units
A pipelined LSU is used for accessing local memory. Memory requests are submitted immediately after they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.
cgh.single_task<class Kernel>([=] {
const unsigned LMEM_SIZE = 128;
int lmem[LMEM_SIZE];
for (int i = 0; i < LMEM_SIZE; i++) {
lmem[i] = i * 100; //pipelined
}
output_accessor[0] = lmem[input_accessor[0]];
});
The compiler might modify a local-pipelined LSU as a never-stall LSU. For more details, refer to Never-stall.
The Intel® oneAPI DPC++/C++ Compiler may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU styles. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.