Visible to Intel only — GUID: dsq1517938475560
Ixiasoft
Visible to Intel only — GUID: dsq1517938475560
Ixiasoft
2.8.7. Load-Store Units
The generates a number of different types of load-store units (LSUs). For some types of LSU, the compiler might modify the LSU behavior and properties depending on the memory access pattern and other memory attributes.
While you cannot explicitly choose the load-store unit type or modifier, you can affect the type of LSU the compiler instantiates by changing the memory access pattern in your code, the types of memory available, and whether the memory accesses are to local or global memory.
Load-Store Unit Types
Burst-Coalesced Load-Store Units
A burst-coalesced LSU is the default LSU type instantiated by the compiler. It buffers requests until the largest possible burst can be made. The burst-coalesced LSU can provide efficient access to global memory, but it requires a considerable amount of FPGA resources.
kernel void burst_coalesced (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int value = in[i/2]; // Burst-coalesced LSU
out[i] = value;
}
Prefetching Load-Store Units
A prefetching LSU instantiates a FIFO (sometimes called a named pipe) which burst reads large blocks from memory to keep the FIFO full of valid data based on the previous address and assuming contiguous reads. Non-contiguous reads are supported, but a penalty is incurred to flush and refill the FIFO.
kernel void prefetching (global int * restrict in,
global int * restrict out,
int N) {
int res = 1;
for (int i = 0; i < N; i++) {
int v = in[i]; // Prefetching LSU
res ^= v;
}
out[0] = res;
}
Streaming Load-Store Units
A streaming LSU instantiates a FIFO which burst reads large blocks from memory to keep the FIFO full of valid data. This block of data can be used only if memory accesses are in-order, and addresses can be calculated as a simple offset from the base address.
kernel void streaming (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int idx = out[i]; // Streaming LSU
int cached_value = in[idx];
out[i] = cached_value; // Streaming LSU
}
Semi-Streaming Load-Store Units
A semi-streaming LSU instantiates a read-only cache. The cache will have an area overhead, but will provide improved performance in cases where you make repeated accesses to the same data location in the global memory. You must ensure that your data is not overwritten by a store within the kernel, as that would break the coherency of the cache. The LSU cache is flushed each time the associated kernels are started.
#define N 16
kernel void semi_streaming (global int * restrict in,
global int * restrict out) {
#pragma unroll 1
for (int i = 0; i < N; i++) {
int value = in[i]; // Semi-streaming LSU
out[i] = value;
}
}
Local-Pipelined Load-Store Units
A local-pipelined LSU is a pipelined LSU that is used for accessing local memory. Requests are submitted as soon as they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a local-pipelined never-stall LSU is created.
__attribute((reqd_work_group_size(1024,1,1)))
kernel void local_pipelined (global int* restrict in,
global int* restrict out) {
local int lmem[1024];
int gi = get_global_id(0);
int li = get_local_id(0);
int res = in[gi];
for (int i = 0; i < 4; i++) {
lmem[li - i] = res; // Local-pipelined LSU
res >>= 1;
}
barrier(CLK_GLOBAL_MEM_FENCE);
res = 0;
for (int i = 0; i < 4; i++) {
res ^= lmem[li - i]; // Local-pipelined LSU
}
out[gi] = res;
}
Global Infrequent Load-Store Units
A global infrequent LSU is a pipelined LSU that is used for global memory accesses that can be proven to be infrequent. The global infrequent LSU is instantiated only for memory operations that are not contained in a loop, and are active only for a single thread in an NDRange kernel.
The compiler implements a global infrequent LSU as pipelined LSU because a pipelined LSU is smaller than other LSU types. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because the memory accesses are infrequent.
kernel void global_infrequent (global int * restrict in,
global int * restrict out,
int N) {
int a = 0;
if (get_global_id(0) == 0)
a = in[0]; // Global Infrequent LSU
for (int i = 0; i < N; i++) {
out[i] = in[i] + a;
}
}
Constant-Pipelined Load-Store Units
A constant pipelined LSU is a pipelined LSU that is used mainly to read from the constant cache. The constant pipelined LSU consumes less area than a burst-coalesced LSU. The throughput of a constant-pipelined LSU depends greatly on whether the reads hit in the constant cache. Cache misses are expensive.
kernel void constant_pipelined (constant int *src,
global int *dst) {
int i = get_global_id(0);
dst[i] = src[i]; // Constant pipelined LSU
}
For information about the constant cache, see Constant Cache Memory.
Atomic-Pipelined Load-Store Units
An atomic-pipelined LSU is used for all atomic operations. Using atomic operations can significantly reduce kernel performance.
kernel void atomic_pipelined (global int* restrict out) {
atomic_add(&out[0], 1); // Atomic LSU
}
Load-Store Unit Modifiers
Depending on the memory access pattern in your kernel, the compiler modifies some LSUs.
Cached
Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache can be disabled by simplifying the access pattern or marking the pointer as volatile.
kernel void cached (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int idx = out[i];
int cached_value = in[idx]; // Burst-coalesced cached LSU
out[i] = cached_value;
}
Write-Acknowledge (write-ack)
Burst-coalesced store LSUs sometimes require a write-acknowledgment signal when data dependencies exist. LSUs with a write-acknowledge signal require additional hardware resources. Throughput might be reduced if multiple write-acknowledge LSUs access the same memory.
kernel void write_ack (global int * restrict in,
global int * restrict out,
int N) {
for (int i = 0; i < N; i++) {
if (i < 2)
out[i] = 0; // Burst-coalesced write-ack LSU
out[i] = in[i];
}
}
Nonaligned
When a burst-coalesced LSU can access memory that is not aligned to the external memory word size, a nonaligned LSU is created. Additional hardware resources are required to implement a nonaligned LSU. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.
kernel void non_aligned (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
// three loads are statically coalesced into one, creating a Burst-coalesced non-aligned LSU
int a1 = in[3*i+0];
int a2 = in[3*i+1];
int a3 = in[3*i+2];
// three stores statically coalesced into one
out[3*i+0] = a3;
out[3*i+1] = a2;
out[3*i+2] = a1;
}
Never-stall
If a local-pipelined LSU is connected to a local memory without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.
In the following example, some of the 96-bit wide memory access span two memory words, which requires two full lines of data to be read from memory.
__attribute((reqd_work_group_size(1024,1,1)))
kernel void never_stall (global int* restrict in,
global int* restrict out,
int N) {
local int lmem[1024];
int gi = get_global_id(0);
int li = get_local_id(0);
lmem[li] = in[gi]; // Local-pipelined never-stall LSU
barrier(CLK_GLOBAL_MEM_FENCE);
out[gi] = lmem[li] ^ lmem[li + 1];
}