Visible to Intel only — GUID: mwh1391807503031
Ixiasoft
Visible to Intel only — GUID: mwh1391807503031
Ixiasoft
8.7. Static Memory Coalescing
The figure below shows a common case where kernel performance might benefit from static memory coalescing:
Consider the following vectorized kernel:
__attribute__((num_simd_work_items(4)))
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
__global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid] = a[gid] + b[gid];
}
The OpenCL™ kernel performs four load operations that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the offline compiler coalesces the four loads into a single wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the offline compiler performs static memory coalescing automatically when it vectorizes the kernel, you should use wide vector loads and stores in your OpenCL code whenever possible to ensure efficient memory accesses. To implement static memory coalescing manually, you must write your code in such a way that a sequential access pattern can be identified at compilation time. The original kernel code shown in the figure above can benefit from static memory coalescing because all the indexes into buffers a and b increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:
__kernel void test (__global float * restrict a,
__global float * restrict b,
__global float * restrict answer;
__global int * restrict offsets)
{
size_t gid = get_global_id(0);
answer[gid*4 + 0] = a[gid*4 + 0 + offsets[gid]] + b[gid*4 + 0];
answer[gid*4 + 1] = a[gid*4 + 1 + offsets[gid]] + b[gid*4 + 1];
answer[gid*4 + 2] = a[gid*4 + 2 + offsets[gid]] + b[gid*4 + 2];
answer[gid*4 + 3] = a[gid*4 + 3 + offsets[gid]] + b[gid*4 + 3];
}
The value offsets[gid] is unknown at compilation time. As a result, the offline compiler cannot statically coalesce the read accesses to buffer a.