Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 7/13/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Remove Loop Bottlenecks

Bottlenecks in a loop means that one or more loop carried dependencies caused the compiler to either reduce the number of data items to be processed concurrently (in the same clock cycle) or reduce fMAX. Bottlenecks occur only on single work-item kernels and are always created for loops.

Before analyzing the throughput of a simple loop, it is important to understand the concept of dynamic initiation interval. The initiation interval (II) is the statically determined number of cycles between successive iteration launches of a given loop invocation. However, the statically scheduled II may differ from the actual realized dynamic II when considering interleaving.

NOTE:

Interleaving allows the iterations of more than one invocation of a loop to execute in parallel, provided that the static II of that loop is greater than 1. By default, the maximum amount of interleaving for a loop is equal to the static II of that loop.

In the presence of interleaving, the dynamic II of a loop can be approximated by the static II of the loop divided by the degree of interleaving, that is, by the number of concurrent invocations of the loop that are in flight.

Simple Loop Example

In a simple loop, the maximum number of data items that can be processed concurrently (also known as maximum concurrency) can be expressed as:

ConcurrencyMAX = (Block latency × Maximum interleaving iterations) / Initiation Interval

Consider a simple loop with memory dependency from the Triangular Loops example:

for (int y = x + 1; y < n; y++) {            
  local_buf[y] = local_buf[y] + SomethingComplicated(local_buf[x]);          
}

The Loop Analysis report displays the following for the simple loop:


The for loop on line 77 has a latency of 36, maximum interleaving iterations of 1, and initiation interval of 30. So, the maximum concurrency is ~1 ((Latency of 36 x Maximum interleaving iteration of 1) / II of 30). The bottleneck causing this low maximum concurrency results from a loop carried dependency caused by a memory dependency. This is reported in the Bottlenecks Viewer as shown in the following image:


Another type of loop carried dependency is a data dependency, as shown in the following example:

int res = N;
for (int i = 0; i < N; i++) {
  res += 1;
  res ^= i;
}

Nested Loop Example

In a nested loop, the maximum concurrency is more difficult to compute. For example, the loop carried dependency in a nested loop does not necessarily affect the initiation interval of the outer loop. Additionally, a nested loop often requires the knowledge of the inner loop's trip count. Consider the Triangular Loops example:

void TriangularLoop(std::unique_ptr<queue>& q, buffer<uint32_t>& input_buf,
                    buffer<uint32_t>& output_buf, uint32_t n, event& e,
                    bool optimize) {

  e = q->submit([&](handler& h) {
   
    accessor input(input_buf, h, read_only);
    accessor output(output_buf, h, write_only, no_init);

    h.single_task<Task>([=]() [[intel::kernel_args_restrict]] {
     
      const int real_iterations = (n * (n + 1) / 2 - 1);
      const int extra_dummy_iterations = (kM - 2) * (kM - 1) / 2;
      const int loop_bound = real_iterations + extra_dummy_iterations;

      uint32_t local_buf[kSize];

      for (uint32_t i = 0; i < kSize; i++) {
        local_buf[i] = input[i];
      }

      if (!optimize) {  // Unoptimized loop.

        for (int x = 0; x < n; x++) {
          for (int y = x + 1; y < n; y++) {
            local_buf[y] = local_buf[y] + SomethingComplicated(local_buf[x]);
          }
        }

In this example, the bottleneck is resulted from a loop carried dependency caused by a memory dependency on the variable local_buf. The local_buf variable must finish loading in the loop on line 25 before the next outer loop (line 24) iteration is launched. Therefore, the maximum concurrency of the outer loop is 1. This information is reported in the details sections of the Loop Analysis and Schedule Viewer reports.

Addressing Bottlenecks

To address bottlenecks, primarily consider restructuring your design code.

After restructuring, consider applying the following loop attributes on arrays:

Consider the previous Simple Loop Example where the concurrency is 1 and the initiation interval is 30. Applying the [[intel::ivdep(M)]] attribute, as shown in the following code snippet, comes at an expense of lowered predicted fMAX from 240 MHz to 194.4 MHz. The original nested loop is manually coalesced or merged into a single loop. Since the loop is restructured to ensure that a minimum of M iterations is executed, the [[intel::ivdep(M)]] is used to inform the compiler that at least M iterations always separate any pair of dependent iterations. This new modified loop now has a maximum concurrency of 35 ((Latency of 35 x Maximum interleaving iteration of 1) / II of 1) and an initiation interval of 1.

// Indices to track the execution in the merged loop
int x = 0, y = 1;

// Total iterations of the merged loop
const int loop_bound = TotalIterations(M, n);

[[intel::ivdep(M)]] 
for (int i = 0; i < loop_bound; i++) {

  // Determine if this is a real or dummy iteration
  bool compute = y > x;
  if (compute) {
    local_buf[y] = local_buf[y] + SomethingComplicated(local_buf[x]);
  }
  
  y++;
  if (y == n) {
    x++;
    y = Min(n - M, x + 1);
  }
}