Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 7/13/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Static Memory Coalescing

Static memory coalescing is an Intel® oneAPI DPC++/C++ Compiler optimization step that merges contiguous accesses to global memory into a single wide access. A similar optimization is applied to on-chip memory.

The figure below shows a common case where kernel performance might benefit from static memory coalescing:

Static Memory Coalescing

Consider the following vectorized kernel:

q.submit([&](handler &cgh) {
  accessor a(a_buf, cgh, read_only);
  accessor b(b_buf, cgh, write_only, no_init);
  cgh.single_task<class coalesced>([=]() {
    #pragma unroll
    for (int i = 0; i < 4; i++)
      b[i] = a[i];
  });
});

The kernel performs four load operations from buffer a that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the compiler coalesces the four loads into a single, wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.

Although the compiler performs static memory coalescing automatically, you should use wide vector loads and stores in your SYCL* code whenever possible to ensure efficient memory accesses.

To allow static memory coalescing, you must write your code in such a way that the compiler can identify a sequential access pattern during compilation. The original kernel code shown in the figure above can benefit from static memory coalescing because all indexes into buffers a and b increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:

q.submit([&](handler &cgh) {
  accessor a(a_buf, cgh, read_only);
  accessor b(b_buf, cgh, write_only);
  accessor offsets(offsets_buf, cgh, read_only);
  cgh.single_task<class not_coalesced>([=]() {
    #pragma unroll
    for (int i = 0; i < 4; i++)
      b[i] = a[offsets[i]];
  });
});

The value offsets[i] is unknown at compilation time. As a result, the Intel® oneAPI DPC++/C++ Compiler cannot statically coalesce the read accesses to buffer a.

For more information, refer to Local and Private Memory Accesses Optimization.