Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.6.3. Controlling the Load-Store Units

The Intel® FPGA SDK for OpenCL™ Offline Compiler allows you to control the type of LSU that is being generated for global memory accesses via a set of built-in calls that you can use for loading from and storing to global memory.

Load Built-ins

The variations of the load built-in are summarized in the following table:

Table 10.  Load Built-ins
Built-in LSU Type Implemented
__pipelined_load() Pipelined if possible
__prefetching_load() Prefetching if possible
__burst_coalesced_load() Burst-coalesced
__burst_coalesced_cached_load() Burst-coalesced cached if possible

All variations expect the following arguments:

Table 11.  Load Built-in Arguments
Built-in Type Description
Argument #1 Pointer Memory location to load from.
Argument #2 Integer
  • Available only for __burst_coalesced_cached_load() function.
  • Describes the LSU cache size in bytes.
  • Non-negative compile-time constant integer.
Return value Object
  • Data that the pointer argument points to.
  • Same type as the base type of the pointer argument.

Store Built-ins

The variations of the store built-in are summarized in the following table:

Table 12.  Store Built-ins
Built-in LSU Type Implemented
__pipelined_store() Pipelined if possible
__burst_coalesced_store() Burst-coalesced

All variations expect the following arguments:

Table 13.  Store Built-in Arguments
Built-in Type Description
Argument #1 Pointer Memory location to store to.
Argument #2 Same as the pointer's base type Value to be stored.
Note: All variations of the store built-in are non-value-returning.

Example

Following is an OpenCL example depicting different variations of the load and the store built-ins:

kernel void oclTest(global int * restrict in, 
                    global int * restrict out) {
    int i = get_global_id(0);
  
    int a1 = __pipelined_load(in + 3*i+0); // Uses a pipelined LSU
    // Uses a burst-coalesced LSU with a cache of size 1024 bytes
    int a2 = __burst_coalesced_cached_load(&in[3*i+1], 1024);  
    int a3 = __prefetching_load(&in[3*i+2]); // Uses a prefetching LSU

    __burst_coalesced_store(&out[3*i+0], a3); // Uses a burst-coalesced LSU
}
Note:
  • The compiler does not allow you to select an LSU that may cause functionally incorrect results in the context in which it is being requested. For example, if you request a prefetching LSU on a volatile pointer, the compiler errors out. The compiler also errors out if caching is requested in a situation where the cache (which is local to the LSU) may become incoherent due to other LSUs writing to memory.
  • The prefetching LSU is not available on the Intel® Stratix® 10 device.