Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 9/26/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.4.5. Loop Fusion

Loop fusion is a compiler transformation in which two adjacent loops are merged into a single loop over the same index range. This transformation is typically applied to reduce loop overhead and improve run-time performance.

The following example shows the effects of fusing loops in a simple case:
Unfused Loops Fused Loops
for (i = 0; i < 300; i++)
    a[i] = a[i] + 3;
for (i = 0; i < 300; i++)
    b[i] = b[i] + 4;
for (i = 0; i < 300; i++) {
    a[i] = a[i] + 3;
    b[i] = b[i] + 4;
}

Loop control structures represent a significant overhead. By fusing two loops, the number of control structures needed for the loops is reduced from two to one, reducing this overhead. The main goal of reducing the number of control structures is to save FPGA area for your design while still maintaining (ideally increasing) component throughput.

Fusing outer loops introduces concurrency where there was previously none. Combining bodies of two adjacent loops (Lj and Lk) forms a single loop (Lf) with a loop body that spans the bodies of Lj and Lk. This combined loop body creates an opportunity for operations that are serialized across a given iteration of Lj and Lk to execute concurrently. In effect, the two loops now execute as one, reducing latency.

If inner loops are fused, concurrency is already achieved by pipelined execution of the outer loop iteration. In these cases, the concurrency effect of loop fusion is diminished.

Fusion Criteria

The compiler considers the fusion of two loops (Lj and Lk) to be valid if the loops meet the following criteria:
  • The loops must be adjacent.

    That is, you cannot have a statement Si with side-effects such that Si executes after Lj and before Lk.

  • Each loop must have a single-entry point and a single exit point. For example, loops that contain break statements are not considered for fusion.
  • The loops must have no negative-distance dependencies.

    That is, for loops Lj and Lk where Lj is defined before Lk, iteration m of loop Lk does not depend on values calculated in iteration m+n (where n>0) of loop Lj.

Automatic Loop Fusion

The Intel® FPGA SDK for OpenCL™ Offline Compiler fuses loops with the same trip counts automatically if the compiler analysis of your component determines that fusing the loops is profitable.

Examples of where fusing loops is a valid transformation (based on the earlier criteria) but are not considered profitable by the compiler include the following situations:
  • One of the two loops, but not both, is annotated with the ivdep pragma.
  • One of the two loops, but not both, contains stall-free logic.

The Loop Analysis Report in the High-Level Design Reports indicates when loops are fused.

In addition to automatic loop fusion, the Intel® FPGA SDK for OpenCL™ Offline Compiler provides two pragmas to help you control when loops are fused:
  • loop_fuse pragma

    Override the compiler profitability analysis and fuse adjacent loops if it is safe.

  • nofusion pragma

    Annotate loops with this pragma to request that the compiler not fuse the annotated loop.