Applying Shared Local Memory

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Visible to Intel only — GUID: GUID-1CAC3729-7221-4570-99EB-959AE78FC1FC

View Details

Applying Shared Local Memory

Intel® Graphics device supports the Shared Local Memory (SLM), attributed with __local in OpenCL™. This type of memory is well-suited for scatter operations that otherwise are directed to global memory. Copy small table buffers or any buffer data, which is frequently reused, to SLM. Refer to the “Local Memory Consideration” section for more information.

An obvious approach to populate SLM is using the for loop. However, this approach is inefficient because this code is executed for every single work-item:

__kernel void foo_SLM_BAD(global int * table, 
                        local int * slmTable /*256 entries*/)
{
        //initialize shared local memory (performed for each work-item!)
        for( uint index = 0;  index < 256;  index ++ )
                slmTable[index] = table[index];
        barrier(CLK_LOCAL_MEM_FENCE);

The code copies the table over and over again, for every single work-item.

An alternative approach is to keep the for loop, but make it start at an index set by getting the local id of the current work-item. Also get the size of the work-group, and use it to increment through the table:

__kernel void foo_SLM_GOOD(global int * table, 
                        local int * slmTable /*256 entries*/)
{
        //initialize  shared local memory
        int   lidx = get_local_id(0);
        int   size_x = get_local_size(0);
        for( uint   index = lidx; index < 256; index += size_x )
                slmTable[index] = table[index];
        barrier(CLK_LOCAL_MEM_FENCE);

You can further avoid the overhead of copying to SLM. Specifically for the cases, when number of SLM entries equals the number of work-items, every work-item can copy just one table entry. Consider populating SLM this way:

__kernel void foo_SLM_BEST(global int * table, 
                        local int * slmTable)
{
        //initialize  shared local memory
        int   lidx = get_local_id(0); 
        int   lidy = get_local_id(1);
        int   index = lidx + lidy * get_local_size(0);
        slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);

If the table is smaller than the work-group size, you might use the “min” instruction. If the table is bigger, you might have several code lines that populate SLM at fixed offsets (which actually is unrolling of the original for loop). If the table size is not known in advance, you can use a realfor loop.

Applying SLM can improve the Intel Graphics data throughput considerably, but it might slightly reduce the performance of the CPU OpenCL device, so you can use a separate version of the kernel.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Applying Shared Local Memory

See Also