Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Unroll Loops

You can control the way the Intel® oneAPI DPC++/C++ Compiler translates SYCL kernel descriptions to hardware resources by unrolling loops. Loop unrolling decreases the number of iterations that the Intel® oneAPI DPC++/C++ Compiler executes at the expense of increased hardware resource consumption and increases performance. See also Unroll Loops

Consider the SYCL code for a parallel application in which each work-item is responsible for computing the accumulation of four elements in an array:

queue.submit([&](handler &cgh) {
  accessor x(x_buf, cgh, read_only);
  accessor sum(sum_buf, cgh, write_only);
  cgh.single_task<class unoptimzed>([=]() {
    int accum = 0;
    for (size_t i = 0; i < 4; i++) {
      accum += x[i + get_global_id(0) * 4];
    }
    sum[get_global_id(0)] = accum;
  });
});

Observe the following three main operations that occur in this kernel:

  • Load operations from input x
  • Accumulation
  • Store operations to output sum

The Intel® oneAPI DPC++/C++ Compiler arranges these operations in a pipeline according to the data flow semantics of the SYCL* kernel code. For example, the Intel® oneAPI DPC++/C++ Compiler implements loops by forwarding the results from the end of the pipeline to the top of the pipeline, depending on the loop exit condition.

The SYCL kernel performs one loop iteration of each work-item per clock cycle. With sufficient hardware resources, you can increase kernel performance by unrolling the loop, which decreases the number of iterations that the kernel executes. To unroll a loop, add a #pragma unroll directive to the main loop, as shown in the following code example:

NOTE:

Loop unrolling significantly changes the structure of the compute unit that the Intel® oneAPI DPC++/C++ Compiler creates.

queue.submit([&](handler &cgh) {
  accessor x(x_buf, cgh, read_only);
  accessor sum(sum_buf, cgh, write_only);
  cgh.single_task<class unoptimzed>([=]() {
    int accum = 0;
   
    #pragma unroll
    for (size_t i = 0; i < 4; i++) {
      accum += x[i + get_global_id(0) * 4];
    }
    sum[get_global_id(0)] = accum;
  });
});

In this example, the #pragma unroll directive causes the Intel® oneAPI DPC++/C++ Compiler to unroll four iterations of the loop completely. To accomplish the unrolling, the Intel® oneAPI DPC++/C++ Compiler expands the pipeline by tripling the number of addition operations and loading four times more data. With the removal of the loop, the compute unit assumes a feed-forward structure. As a result, the compute unit can store the sum elements in every clock cycle after the completion of the initial load operations and additions. The Intel® oneAPI DPC++/C++ Compiler further optimizes this kernel by coalescing the four load operations so that the compute unit can load all necessary input data to calculate a result in one load operation.

Factors to Consider for Loop Unrolling

  • Avoid nested looping structures. Instead, implement a large single loop or unroll inner loops by adding the #pragma unroll directive whenever possible. For example, if you compile a kernel that has a heavily nested loop structure, wherein each loop includes a #pragma unroll directive, you might experience a long compilation time. The Intel® oneAPI DPC++/C++ Compiler might fail to meet scheduling because it cannot unroll this nested loop structure easily, resulting in a high II. In this case, the Intel® oneAPI DPC++/C++ Compiler issues the following error message along with the line number of the outermost loop:

    Kernel <function> exceeded the Max II. The Kernel's resource usage is estimated to be much larger than FPGA capacity. It will perform poorly 
    even if it fits. Reduce resource utilization of the kernel by reducing loop unroll factors within it (if any) or otherwise reduce amount of 
    computation within the kernel.

  • Unrolling the loop and coalescing load operations from global memory allow the hardware implementation of the kernel to perform more operations per clock cycle.
  • The Intel® oneAPI DPC++/C++ Compiler might not be able to unroll a loop completely under the following circumstances:
    • You specify complete unrolling of a data-dependent loop with a very large number of iterations. Consequently, the hardware implementation of your kernel might not fit into the FPGA.
    • You specify complete unrolling and the loop bounds are not constants.
    • The loop consists of complex control flows (for example, a loop containing complex array indexes or exit conditions that are unknown at compilation time).

    For the last two cases listed above, the Intel® oneAPI DPC++/C++ Compiler issues the following warning:

    Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled.

    To enable loop unrolling in these situations, specify the #pragma unroll <N> directive, where <N> is the unroll factor. The unroll factor limits the number of iterations that the Intel® oneAPI DPC++/C++ Compiler unrolls. Refer to Single Work-item Kernel Design Guidelines for tips on constructing well-structured loops.

TIP:

For additional information, refer to the FPGA tutorial sample "Loop Unroll" listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample on GitHub.