Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

10.2. Optimizing Loop Control

Intel® Stratix® 10 OpenCL designs leverage the the FPGA's Hyperflex™ architecture to achieve high performance. Because the Hyperflex™ architecture allows OpenCL designs to run faster, it becomes more critical to optimize loop structures in Intel® Stratix® 10 OpenCL designs; otherwise, they can cause notable performance limitations.

To achieve high performance, Intel® recommends achieving a loop initiation interval (II) of 1. An II value of 1 indicates that a loop is able to start a new iteration of a loop data path every clock cycle. Doing so helps your design consume the available FPGA resources efficiently.

Intel® has established a new loop control scheme specifically for the Intel® Stratix® 10 architecture.

Applying Loop Control Optimization in Intel® Stratix® 10 OpenCL Designs

Leveraging the Intel® Stratix® 10 Hyperflex architecture, you can now create deeply pipelined loops in your design to achieve higher fMAX. Because the offline compiler might not be able to calculate the exit condition of such a complex loop structure in a single clock cycle, the offline compiler now defers the complete calculation of the exit condition. The compiler decouples the calculation from the loop body and splits the calculation across multiple clock cycles. Doing so allows loop iterations to launch each clock cycle before the compiler finishes calculating the exit condition; however, it takes a few clock cycles to flush the loop after the loop exit condition is signaled.

Refer to the Loop analysis report in the High Level Design Report (report.html) to find out to which loops the offline compiler has applied the new loop optimization strategy.

Effects of the new loop control optimization strategy:

  1. Allows iterations of the current loop launch much faster.
  2. Subsequent invocations of the loop starts after the current invocation flushes all the data.
Note: A loop iteration is one execution of the loop body. A loop invocation is one execution of an entire loop, from the initial value of the loop counter until the exit condition becomes TRUE.

The following code example illustrates the termination overhead associated with a nested loop:

kernel loop_overhead(unsigned N) {
   for (unsigned int i = 0; i < N; i++) {
      for (unsigned int j = 0; j < N; j++) {
         //do work
         //total iterations: i * (j + s)
      }
   }
}

The II value of this loop is 1; however, the number of clock cycles it takes to issue all the loop iterations in the nest is N × (N+s), where s is number of cycles it takes to flush the loop before the launch of the next few iterations. The loop overhead s is small; it does not have a notable effect on the design unless the design has very few iterations in the inner loop.

Types of Loops that Benefit from Loop Control Optimization

Most loops, even those with loop control that is deeply pipelined and with complex exit conditions, are able to achieve an II value of 1. This optimization strategy is primarily beneficial for high throughput designs with loops that have many iterations. The fMAX increase in these designs adequately compensates for the comparatively small overhead on termination.

Note: The extent to which loop control is pipelined does not affect the loop's II value.

There are some loops to which the loop optimization strategy is not applicable:

  • Loops in an NDRange kernel

    Because the offline compiler must be able to pipeline the loop, the loop must be part of a single work-item kernel.

  • Loops with exit conditions that depend on instructions that can stall or have side effects outside the loop

The following are examples of loops that use the new loop control scheme versus those that do not:

Example 1: Loop can achieve optimal performance on Intel® Stratix® 10

kernel void good_loop(global int * restrict A, 
                      global int * restrict result,
					  unsigned N) {
  
   unsigned int sum = 0;
  
   for (unsigned int i = 0; i < N; i++) {
      sum += A[i];
   }
   *result = sum; 
}

Example 2: Loop can achieve optimal performance on Intel® Stratix® 10

In this example, the channel write has side effects outside the loop; however, the exit condition does not depend on the channel write.

channel unsigned int c0; 

kernel void producer() {
 
   for (unsigned int i = 0; i < 10; i++) {
      write_channel_intel(c0, i); 
   }   
}

Example 3: Loop cannot fully benefit from the Intel® Stratix® 10 loop control scheme because the exit condition depends on the channel read (read_channel_intel) that might have side effects outside the loop. As a result, the computation for each iteration cannot proceed until the compiler determines the exit condition, otherwise the compiler does consume additional data from the channel.

kernel void consumer (global int * restrict A, 
                      global int * restrict result, 
                      unsigned N) {
   unsigned int sum = 0;
   for (unsigned int i = 0; 
        i < N && read_channel_intel(c0) != 5; i++) {
      sum += A[i];
   }
   *result = sum;
}

If the offline compiler does not implement the new loop optimization, it also disables other fMAX optimizations.

Warning: Disabling these optimizations might reduce the amount of logic usage at the expense of fMAX. Check the offline compiler's HTML reports to verify the outcome of the compiler optimizations.