Visible to Intel only — GUID: GUID-ADC78BDF-325C-4CB7-BF23-B7898230BB8B
Visible to Intel only — GUID: GUID-ADC78BDF-325C-4CB7-BF23-B7898230BB8B
Static Memory Coalescing
Static memory coalescing is an Intel® oneAPI DPC++/C++ Compiler optimization step that merges contiguous accesses to global memory into a single wide access. A similar optimization is applied to on-chip memory.
The figure below shows a common case where kernel performance might benefit from static memory coalescing:
Consider the following vectorized kernel:
q.submit([&](handler &cgh) {
accessor a(a_buf, cgh, read_only);
accessor b(b_buf, cgh, write_only, no_init);
cgh.single_task<class coalesced>([=]() {
#pragma unroll
for (int i = 0; i < 4; i++)
b[i] = a[i];
});
});
The kernel performs four load operations from buffer a that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the compiler coalesces the four loads into a single, wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the compiler performs static memory coalescing automatically, you should use wide vector loads and stores in your SYCL* code whenever possible to ensure efficient memory accesses.
To allow static memory coalescing, you must write your code in such a way that the compiler can identify a sequential access pattern during compilation. The original kernel code shown in the figure above can benefit from static memory coalescing because all indexes into buffers a and b increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:
q.submit([&](handler &cgh) {
accessor a(a_buf, cgh, read_only);
accessor b(b_buf, cgh, write_only);
accessor offsets(offsets_buf, cgh, read_only);
cgh.single_task<class not_coalesced>([=]() {
#pragma unroll
for (int i = 0; i < 4; i++)
b[i] = a[offsets[i]];
});
});
The value offsets[i] is unknown at compilation time. As a result, the Intel® oneAPI DPC++/C++ Compiler cannot statically coalesce the read accesses to buffer a.
For more information, refer to Local and Private Memory Accesses Optimization.