Visible to Intel only — GUID: GUID-CB77D598-349F-472A-B697-3F4DB51B4EF7
Visible to Intel only — GUID: GUID-CB77D598-349F-472A-B697-3F4DB51B4EF7
Avoid Handling Edge Conditions in Kernels
To understand how to avoid handling edge conditions in kernels, consider the following smoothing 2x2 filter:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); uint neighbors = 1; float sum = 0.0f; if ((myX + 1) < (image_width-1)) { sum += input[myY * image_width + (myX + 1)]; ++neighbors; } if (myX > 0) { sum += input[myY * image_width + (myX - 1)]; ++neighbors; } if ((myY + 1) < (image_height-1)) { sum += input[(myY + 1) * image_width + myX]; ++neighbors; } if (myY > 0) { sum += input[(myY – 1) * image_width + myX]; ++neighbors; } sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / (float)neighbors; }
Assume that you have a full HD image with size of 1920x1080 pixels. The four edge if conditions are tested for every pixel, that is, roughly two million times.
However, they are only relevant for the 6000 pixels on the image edges, which is 0.2% of the pixels. For the remaining 99.8% work-items, the edge condition check is a waste of time. See the following optimized code:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
This code requires padding (enlarging) the input buffer appropriately, while using the original global size. This way querying the neighbors for the border pixels does not result in buffer overrun. When you process several frames this way, do this padding exactly once (which means you initially allocate frames of the proper size), instead of copying each frame to larger buffer before processing, and then copying the result back to smaller (original size) buffer.
If padding through larger input is not possible, make sure you use the min or max built-ins, so checking a work-item does not access outside the actual image adds only four lines. Consider the following example:
__kernel void smooth(const __global float* input, __global float* output) { const int image_width = get_global_size(0); const int image_height = get_global_size(0); int myX = get_global_id(0); //since for myX== image_width–1 the (myX+1) is incorrect myX = min(myX, image_width -2); //since for myX==0 the (myX-1) is incorrect myX = max(myX, 1); int myY = get_global_id(1); //since for myY== image_height-1 the (myY+1) is incorrect myY = min(myY, image_height -2); //since for myY==0 the (myY-1) is incorrect myY = max(myY - 1, 0); float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
At a cost of duplicating calculations for border work-items, this code avoids testing for the edge conditions, which is otherwise needed to perform for all work-items.