Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/13/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.4.4. Loop Speculation

Loop speculation is an optimization technique that enables more efficient loop pipelining by allowing future iterations to be initiated before determining whether the loop was exited already. Consider the following simple loop example:
while (m*m*m < N) {
    m += 1;
}

Logically, the exit condition (m*m*m < N) for an iteration must be evaluated before determining whether you need to initiate another iteration or not. This means that, in the absence of speculation, the loop II cannot be lower than the number of cycles it takes to compute this exit condition. Speculated iterations are iterations that launch before the exit condition computation has completed. However, all operations with side-effects, such as stores to memory, are predicated by the exit condition. This means that operations with side-effects still waits for the exit condition to be computed. Loop speculation is beneficial when the exit condition is the bottleneck preventing from achieving a lower II. In the loop shown above, the exit condition contains two multiplications that cannot complete within a single clock cycle. However, loop speculation allows this loop to achieve II=1.

For example, for a given iteration i with exit condition Ei, the number of speculated iterations s is the number of iterations after i has been initiated but before Ei has been evaluated. By default, this number of speculated iterations is determined by the compiler on a per-loop basis, and can be found in the per-loop details of the Loop Analysis report.

The #pragma speculated_iterations pragma allows you to directly control the number of speculated iterations for a loop. If the exit condition calculation is the bottleneck to lowering II (as shown in the Loop Analysis report), increasing the number of speculated iterations may improve the II (this is not guaranteed as other bottlenecks may be uncovered). For details about #pragma speculated_iterations, refer to Loop Speculation in the Intel FPGA SDK for OpenCL Programming Guide.

Speculated iterations introduce some overhead in nested loops, since a new invocation of a loop may not begin until all speculated iterations of its previous invocation have completed. In cases where a loop body with low latency is expected to be frequently invoked, (for example, an inner loop with a short trip count), use the #pragma speculated_iterations pragma to reduce the number of speculated iterations. You can estimate the amount of this overhead by multiplying the number of speculated iterations with the II of the loop (as shown in the Loop Analysis report). Using the #pragma speculated_iterations pragma can reduce this overhead, but be aware that choosing a pragma value that is too low may increase the II (due to not having enough time to evaluate the exit condition).

Consider the following example:

kernel void unopt_int_cube_root (global int *dst, int N) {
   int m = 0;
   while (m*m*m < N) {
       m += 1;
   }
   dst[0] = m;
}

kernel void opt_int_cube_root (global int *dst, int N) {
   int m = 0;
   #pragma speculated_iterations 7
   while (m*m*m < N) {
        m += 1;
   }
   dst[0] = m;
}

kernel void unopt2_int_cube_root (global int *dst, int N) {
   int m = 0;
   #pragma speculated_iterations 0
   while (m*m*m < N) {
       m += 1;
   }
   dst[0] = m;
}

In this example, the exit condition that has two multiplies and a compare is the bottleneck preventing II=1. The compiler's choice of four speculated iterations result in II=2 since the exit condition takes seven cycles (each multiply takes three cycles and the compare takes one cycle) and four speculated iterations times two-cycle II gives eight cycles to cover this evaluation. Then, the speculated iterations are increased to seven to cover the seven-cycle exit condition calculation allows us to achieve II=1. By setting the speculated_iterations pragma to 0, you can verify that the II has increased to 7, which matches the exit condition bottleneck.