Visible to Intel only — GUID: hex1566241909270
Ixiasoft
Visible to Intel only — GUID: hex1566241909270
Ixiasoft
3.6.2. Load-Store Unit Modifiers
Cached
Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache is inferred only for non-volatile global pointers.
kernel void cached (global int * restrict in,
global int * restrict out,
int N) {
int gid = get_global_id(0);
for (int i = 0; i < N; i++) {
out[N*gid + i] = in[i];
}
}
Write-Acknowledge (write-ack)
Burst-coalesced store LSUs sometimes require a write-acknowledgment signal when data dependencies exist. LSUs with a write-acknowledge signal require additional hardware resources. Throughput might be reduced if multiple write-acknowledge LSUs access the same memory.
kernel void write_ack (global int * restrict in,
global int * restrict out,
int N) {
for (int i = 0; i < N; i++) {
if (i < 2)
out[i] = 0; // Burst-coalesced write-ack LSU
out[i] = in[i];
}
}
Nonaligned
When a burst-coalesced LSU can access memory that is not aligned to the external memory word size, a nonaligned LSU is created. Additional hardware resources are required to implement a nonaligned LSU. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.
kernel void non_aligned (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
// Three loads are statically coalesced into one,
// creating a burst-coalesced non-aligned LSU.
int a1 = in[3*i+0];
int a2 = in[3*i+1];
int a3 = in[3*i+2];
// Three stores statically coalesced into one,
// creating a burst-coalesced non-aligned LSU.
out[3*i+0] = a3;
out[3*i+1] = a2;
out[3*i+2] = a1;
}
Never-stall
If a pipelined LSU is connected to a local memory without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.
__attribute((reqd_work_group_size(1024,1,1)))
kernel void never_stall (global int* restrict in,
global int* restrict out,
int N) {
local int lmem[1024];
int gi = get_global_id(0);
int li = get_local_id(0);
lmem[li] = in[gi]; // Pipelined never-stall LSU
barrier(CLK_GLOBAL_MEM_FENCE);
out[gi] = lmem[li] ^ lmem[li + 1];
}