Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide
A newer version of this document is available. Customers should click here to go to the newest version.
Visible to Intel only — GUID: ase1566243064294
Ixiasoft
Visible to Intel only — GUID: ase1566243064294
Ixiasoft
3.6.1. Load-Store Unit Types
- Burst-Coalesced Load-Store Units
- Prefetching Load-Store Units
- Pipelined Load-Store Units
- Constant-Pipelined Load-Store Units
- Atomic-Pipelined Load-Store Units
Burst-Coalesced Load-Store Units
A burst-coalesced LSU is the default LSU type instantiated by the compiler for accessing global memory. It buffers requests until the largest possible burst can be made. The burst-coalesced LSU can provide efficient access to global memory, but it requires a considerable amount of FPGA resources.
kernel void burst_coalesced (global int * restrict in, global int * restrict out) { int i = get_global_id(0); int value = in[i/2]; // Burst-coalesced LSU out[i] = value; }
Prefetching Load-Store Units
A prefetching LSU instantiates a FIFO which burst reads large blocks from memory to keep the FIFO full of valid data based on the previous address and assuming contiguous reads. Non-contiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for non-volatile global pointers.
kernel void prefetching (global int * restrict in, global int * restrict out, int N) { int res = 1; for (int i = 0; i < N; i++) { int v = in[i]; // Prefetching LSU res ^= v; } out[0] = res; }
Pipelined Load-Store Units
A pipelined LSU is used for accessing local memory. Requests are submitted as soon as they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.
__attribute((reqd_work_group_size(1024,1,1))) kernel void local_pipelined (global int* restrict in, global int* restrict out) { local int lmem[1024]; int gi = get_global_id(0); int li = get_local_id(0); int res = in[gi]; for (int i = 0; i < 4; i++) { lmem[li - i] = res; // pipelined LSU res >>= 1; } barrier(CLK_GLOBAL_MEM_FENCE); res = 0; for (int i = 0; i < 4; i++) { res ^= lmem[li - i]; // pipelined LSU } out[gi] = res; }
The compiler may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU types. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.
kernel void global_infrequent (global int * restrict in, global int * restrict out, int N) { int a = 0; if (get_global_id(0) == 0) a = in[0]; // Pipelined LSU for (int i = 0; i < N; i++) { out[i] = in[i] + a; } }
Constant-Pipelined Load-Store Units
A constant-pipelined LSU is a pipelined LSU that is used mainly to read from the constant cache. The constant-pipelined LSU consumes less area than a burst-coalesced LSU. The throughput of a constant-pipelined LSU depends greatly on whether the reads hit in the constant cache. Cache misses are expensive.
kernel void constant_pipelined (constant int *src, global int *dst) { int i = get_global_id(0); dst[i] = src[i]; // Constant pipelined LSU }
For information about the constant cache, see Constant Cache Memory.
Atomic-Pipelined Load-Store Units
An atomic-pipelined LSU is used for all atomic operations. Using atomic operations can significantly reduce kernel performance.
kernel void atomic_pipelined (global int* restrict out) { atomic_add(&out[0], 1); // Atomic LSU }