Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 7/13/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Loop Fuse Functions and nofusion Attribute

This topic describes the loop fuse functions and nofusion attribute that the Intel® oneAPI DPC++/C++ Compiler supports.

Loop Fuse Functions

The loop fuse functions are declared in the sycl/ext/intel/fpga_loop_fuse.hpp header file, which is invoked by the sycl/ext/intel/fpga_extensions.hpp header file. Apply these loop functions to a block of code to indicate to the compiler that it must fuse adjacent loops in the code block overriding the compiler profitability or safety analysis of the fusion. Fusing adjacent loops reduces the amount of loop control overhead in your kernel that reduces the FPGA area used and increases the performance by executing both loops as one (fused) loop. For additional information, see Fuse Loops to Reduce Overhead and Improve Performance

The compiler supports the following loop fuse functions:

  • sycl::ext::intel::fpga_loop_fuse<v>(f): Directs the compiler to fuse loops within the function f and up to a depth of v >= 1 without affecting the functionality of either loop, overriding the compiler profitability analysis of fusing the loops. By default, v = 1, which is equivalent to indicating that the compiler should consider only the adjacent top-level loops for fusing. For example:
    [=]() { //Kernel
      sycl::ext::intel::fpga_loop_fuse<1>([&] {
        L1: for(...) {}
        L2: for(...) {
          L3: for(...) {}
          L4: for(...) {
            L5: for(...) {}
            L6: for(...) {}
          }
        }
      });
    }

    By default (v = 1), only loops L1 and L2 are initially considered for fusing. At a depth of v = 2, the compiler considers L1-L2 and L3-L4 loop pairs for fusing.

    The compiler automatically considers fusing adjacent loops with equal trip counts when the loops meet the Automatic Loop Fusion criteria or when fusion is deemed safe and profitable. You can use the sycl::ext::intel::fpga_loop_fuse<v>(f) function to inform the compiler to consider fusing adjacent loops with different trip counts, which is considered to be unprofitable by default.. With the loop fuse function applied to a block of code, the compiler always attempts to fuse adjacent loops (with equal or different trip counts) in the block whenever the compiler determines that it is safe to fuse the loops. Two loops are considered safe to merge if they meet the Fusion Criteria. The following example shows the effects of fusing loops with unequal trip counts:

    Unfused Loops Fused Loops
    [=]() { //Kernel
      sycl::ext::intel::fpga_loop_fuse([&] {
        for (int i = 0; i < N; i++) {
          // Loop Body 1
        }
        for (int j = 0; j < M; j++) {
          // Loop Body 2
        }
      });
    }
    for (int f = 0; f < max(M,N); f++) {
      if (f < N) {
        // Loop Body 1
      }
      if (f < M) {
        // Loop Body 2
      }
    }

    A fused loop can itself be considered for fusing with other loops. For example, in the following code, L1 and L2 are initially considered for fusing. That resulting fused loop can then be considered for fusing with L4.

    [=]() { //Kernel
      sycl::ext::intel::fpga_loop_fuse([&] {
        L1: for(...) {}
        L2: for(...) {
          L3: for(...) {}
        }
        L4: for(...) {
        }
      });
    }
  • sycl::ext::intel::fpga_loop_fuse_independent<v>(f): Directs the compiler to fuse loops within the function f up to a depth v >= 1 while overriding fusion-safety checks. Here, v = 1 by default. When you use this function, you are guaranteeing to the compiler that fusing pairs of loops affected by the loop fuse function is safe. That is, there are no negative distance dependencies between the pairs of loops. If it is not safe, you might get functional errors in your kernel. Besides this difference, the sycl::ext::intel::fpga_loop_fuse<v>(f) and sycl::ext::intel::fpga_loop_fuse_independent<v>(f) functions behave identically.

Function Calls in Loop Fuse Code Blocks

If a function call occurs in a code block annotated with a loop fuse function and inlining that function call contains a loop, the resulting loop can be a candidate for loop fusion.

Nested Loop Fusion Functions

When you nest loop fusion functions, you might create overlapping sets of candidate loops. Consider the following example:

[=]() { //Kernel
  sycl::ext::intel::fpga_loop_fuse_independent<2>([&] {
    L1: for(...) {}
    L2: for(...) {
      sycl::ext::intel::fpga_loop_fuse<2>([&] {
        L3: for(...) {}
        L4: for(...) {
          L5: for(...) {}
          L6: for(...) {}
        }
      });
    }
  });
}

In this example, the compiler considers the following loop pairs for fusion: L1-L2, L3-L4, and L5-L6. In addition, the compiler overrides the compiler negative distance dependency analysis of L1-L2 and L3-L4 loop pairs.

nofusion Attribute

You can exempt a loop from being fused with an adjacent loop by annotating the loop with the nofusion loop attribute. This attribute prevents the annotated loop from being automatically or fused when it is subject to the loop fusion functions.

Syntax

[[intel::nofusion]]

Example

For example, the following code samples have the same effect. If one loop in a pair is annotated with the nofusion attribute, the other loop has no other loop to fuse with.

[[intel::nofusion]]
L1: for (int j = 0; j < N; ++j){
  data[j] += Q;
}
L2: for (int i = 0; i < N; ++l) {
  output[i] = Q * data[i];
}
L1: for (int j = 0; j < N; ++j){
  data[j] += Q;
}
[[intel::nofusion]]
L2: for (int i = 0; i < N; ++l) {
  output[i] = Q * data[i];
}

In the following example, the compiler does not apply the loop fusion transformation to the loops:

for (int x = 0; x < N; x++) { // loop 1
  arr1_acc[x] = x; 
}

[[intel::nofusion]] 
for (int x = 0; x < N; x++) { // loop 2
  arr2_acc[x] = x; 
}

Because you have applied the nofusion attribute to loop 2, the compiler cannot fuse loop 1 with loop 2.

TIP:

For additional information, refer to the FPGA tutorial sample "Loop Fusion" listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample in GitHub.