Developer Guide

Intel® oneAPI DPC++/C++ Compiler Handbook for FPGAs

ID 785441
Date 10/24/2024
Public
Document Table of Contents

unroll Pragma

Loop unrolling involves replicating a loop body multiple times and reducing the trip count of a loop. Unroll loops to reduce or eliminate loop control overhead on the FPGA. In cases where there are no loop-carried dependencies and the Intel® oneAPI DPC++/C++ Compiler can perform loop iterations in parallel, unrolling loops can also reduce latency and overhead.

IMPORTANT:

Unrolling of nested loops with large bounds might generate huge number of instructions that could lead to very long compile times.

The compiler might unroll simple loops even if a pragma does not annotate them. To direct the compiler to unroll a loop, or to explicitly not unroll a loop, insert an unroll kernel pragma in the kernel code preceding a loop you want to unroll. To specify an unroll factor N, use the optional unroll factor specifier #pragma unroll <N>. For more information, see Determining the Correct Unroll Factor section in Unrolling Loops FPGA tutorial.

Syntax

#pragma unroll  

#pragma unroll N

If you specify the unroll factor N, the factor must be a positive constant expression of integer type. If you omit the unroll factor N, the loop is unrolled fully.

Examples

The following is an example of full loop unrolling:

// Before unrolling loop
#pragma unroll
for(i = 0 ; i < 5; i++){
  a[i] += 1;
}
// After fully unrolling the loop by a factor of 5, 
// the loop is flattened. There is no loop after unrolling.
a[0] += 1;
a[1] += 1;
a[2] += 1;
a[3] += 1;
a[4] += 1;

You can observe that a full unroll is a special case where the unroll factor is equal to the number of loop iterations.

The following is an example of partial loop unrolling:

// Before unrolling loop
#pragma unroll 4
for(i = 0 ; i < 20; i++){
  a[i] += 1;
}
// After the loop is unrolled by a factor of 4,
// the loop has five (20 / 4) iterations.
for(i = 0 ; i < 5; i++){
  a[i * 4] += 1;
  a[i * 4 + 1] += 1;
  a[i * 4 + 2] += 1;
  a[i * 4 + 3] += 1;
}

In the partial unroll example, each loop iteration in the unrolled loop is equivalent to four iterations. The Intel® oneAPI DPC++/C++ Compiler instantiates four adders instead of one adder. Because there is no data dependency between iterations in the loop (which is true in this case), the compiler executes four adds in parallel.

TIP:

For additional information, refer to the FPGA tutorial sample "Loop Unroll" on GitHub.

NOTE:
  • Provide an unroll factor whenever possible. To specify an unroll factor N, insert the #pragma unroll <N> directive before a loop in your kernel code. The Intel® oneAPI DPC++/C++ Compiler attempts to unroll the loop at most <N> times. Consider the following code fragment. By assigning a value of 2 as the unroll factor, you direct the compiler to unroll the loop twice.
    #pragma unroll 2
    for(size_t k = 0; k < 4; k++)
    {
       mac += data_in[(gid * 4) + k] * coeff[k];
    }
    For more information, see Determining the Correct Unroll Factor in the FPGA tutorial sample "Loop Unroll" on GitHub.
  • To unroll a loop fully, you may omit the unroll factor by simply inserting the #pragma unroll directive before a loop in your kernel code. The compiler attempts to unroll the loop fully if it understands the trip count and issues a warning if it cannot execute the unroll request.