Visible to Intel only — GUID: GUID-1CAC3729-7221-4570-99EB-959AE78FC1FC
Visible to Intel only — GUID: GUID-1CAC3729-7221-4570-99EB-959AE78FC1FC
Applying Shared Local Memory
Intel® Graphics device supports the Shared Local Memory (SLM), attributed with __local in OpenCL™. This type of memory is well-suited for scatter operations that otherwise are directed to global memory. Copy small table buffers or any buffer data, which is frequently reused, to SLM. Refer to the “Local Memory Consideration” section for more information.
An obvious approach to populate SLM is using the for loop. However, this approach is inefficient because this code is executed for every single work-item:
__kernel void foo_SLM_BAD(global int * table, local int * slmTable /*256 entries*/) { //initialize shared local memory (performed for each work-item!) for( uint index = 0; index < 256; index ++ ) slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
The code copies the table over and over again, for every single work-item.
An alternative approach is to keep the for loop, but make it start at an index set by getting the local id of the current work-item. Also get the size of the work-group, and use it to increment through the table:
__kernel void foo_SLM_GOOD(global int * table, local int * slmTable /*256 entries*/) { //initialize shared local memory int lidx = get_local_id(0); int size_x = get_local_size(0); for( uint index = lidx; index < 256; index += size_x ) slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
You can further avoid the overhead of copying to SLM. Specifically for the cases, when number of SLM entries equals the number of work-items, every work-item can copy just one table entry. Consider populating SLM this way:
__kernel void foo_SLM_BEST(global int * table, local int * slmTable) { //initialize shared local memory int lidx = get_local_id(0); int lidy = get_local_id(1); int index = lidx + lidy * get_local_size(0); slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
If the table is smaller than the work-group size, you might use the “min” instruction. If the table is bigger, you might have several code lines that populate SLM at fixed offsets (which actually is unrolling of the original for loop). If the table size is not known in advance, you can use a realfor loop.
Applying SLM can improve the Intel Graphics data throughput considerably, but it might slightly reduce the performance of the CPU OpenCL device, so you can use a separate version of the kernel.