Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

3.2. Unrolling Loops

You can control the way the translates OpenCL™ kernel descriptions to hardware resources. If your OpenCL kernel contains loop iterations, increase performance by unrolling the loop. Loop unrolling decreases the number of iterations that the offline compiler executes at the expense of increased hardware resource consumption.

Consider the OpenCL code for a parallel application in which each work-item is responsible for computing the accumulation of four elements in an array:

__kernel void example ( __global const int * restrict x,
                        __global int * restrict sum ) {
   int accum = 0;

   for (size_t i = 0; i < 4; i++) {
      accum += x[i + get_global_id(0) * 4];
   }

   sum[get_global_id(0)] = accum;
}

Notice the following three main operations that occur in this kernel:

  • Load operations from input x
  • Accumulation
  • Store operations to output sum

The offline compiler arranges these operations in a pipeline according to the data flow semantics of the OpenCL kernel code. For example, the offline compiler implements loops by forwarding the results from the end of the pipeline to the top of the pipeline, depending on the loop exit condition.

The OpenCL kernel performs one loop iteration of each work-item per clock cycle. With sufficient hardware resources, you can increase kernel performance by unrolling the loop, which decreases the number of iterations that the kernel executes. To unroll a loop, add a #pragma unroll directive to the main loop, as shown in the code example below. Keep in mind loop unrolling significantly changes the structure of the compute unit that the offline compiler creates.

__kernel void example ( __global const int * restrict x,
                        __global int * restrict sum ) {
  int accum = 0;

  #pragma unroll
  for (size_t i = 0; i < 4; i++) {
    accum += x[i + get_global_id(0) * 4];
  }

  sum[get_global_id(0)] = accum;
}

In this example, the #pragma unroll directive causes the offline compiler to unroll the four iterations of the loop completely. To accomplish the unrolling, the offline compiler expands the pipeline by tripling the number of addition operations and loading four times more data. With the removal of the loop, the compute unit assumes a feed-forward structure. As a result, the compute unit can store the sum elements every clock cycle after the completion of the initial load operations and additions. The offline compiler further optimizes this kernel by coalescing the four load operations so that the compute unit can load all the necessary input data to calculate a result in one load operation.

CAUTION:

Avoid nested looping structures. Instead, implement a large single loop or unroll inner loops by adding the #pragma unroll directive whenever possible.

For example, if you compile a kernel that has a heavily-nested loop structure, wherein each loop includes a #pragma unroll directive, you might experience a long compilation time. The might fail to meet scheduling because it cannot unroll this nested loop structure easily, resulting in a high II. In this case, the offline compiler will issue the following error message along with the line number of the outermost loop:

Kernel <function> exceeded the Max II. The Kernel's resource usage is estimated to be much larger than FPGA capacity. It will perform poorly even if it fits. Reduce resource utilization of the kernel by reducing loop unroll factors within it (if any) or otherwise reduce amount of computation within the kernel.

Unrolling the loop and coalescing the load operations from global memory allow the hardware implementation of the kernel to perform more operations per clock cycle. In general, the methods you use to improve the performance of your OpenCL kernels should achieve the following results:

  • Increase the number of parallel operations
  • Increase the memory bandwidth of the implementation
  • Increase the number of operations per clock cycle that the kernels can perform in hardware

The offline compiler might not be able to unroll a loop completely under the following circumstances:

  • You specify complete unrolling of a data-dependent loop with a very large number of iterations. Consequently, the hardware implementation of your kernel might not fit into the FPGA.
  • You specify complete unrolling and the loop bounds are not constants.
  • The loop consists of complex control flows (for example, a loop containing complex array indexes or exit conditions that are unknown at compilation time).

For the last two cases listed above, the offline compiler issues the following warning:

Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled.

To enable loop unrolling in these situations, specify the #pragma unroll <N> directive, where <N> is the unroll factor. The unroll factor limits the number of iterations that the offline compiler unrolls. For example, to prevent a loop in your kernel from unrolling, add the directive #pragma unroll 1 to that loop.

Refer to Good Design Practices for Single Work-Item Kernel for tips on constructing well-structured loops.