Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Relax Loop-Carried Dependency

Example 1: Multiply Chain

Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.

Consider the following code example:

constexpr int N = 128;
 queue.submit([&](handler &cgh) {
   accessor A(A_buf, cgh, read_only);
   accessor result(result_buf, cgh, write_only);
   cgh.single_task<class unoptimized>([=]() {
     float mul = 1.0f;
     for (int i = 0; i < N; i++)
       mul *= A[i];
 
     result[0] = mul;
   });
 });

The optimization report shows that the Intel® oneAPI DPC++/C++ Compiler infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul.

To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable, and use one copy every M iterations:

  1. Declare multiple copies of the variable mul (for example, in an array called mul_copies).
  2. Initialize all copies of mul_copies.
  3. Use the last copy in the array in the multiplication operation.
  4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
  5. Sum up all copies of mul_copies to mul and write the final value to the result.

The following code illustrates the restructured kernel:

constexpr int N = 128;
 constexpr int M = 8;
 queue.submit([&](handler &cgh) {
   accessor A(A_buf, cgh, read_only);
   accessor result(result_buf, cgh, write_only);
   cgh.single_task<class optimized>([=]() {
     float mul = 1.0f;
 
     // Step 1: Declare multiple copies of variable mul
     float mul_copies[M];
 
     // Step 2: Initialize all copies
     for (int i = 0; i < M; i++)
       mul_copies[i] = 1.0f;
 
     for (int i = 0; i < N; i++) {
       // Step 3: Perform multiplication on the last copy
       float cur = mul_copies[M-1] * A[i];
 
       // Step 4a: Shift copies
       #pragma unroll 
       for (int j = M-1; j > 0; j--)
         mul_copies[j] = mul_copies[j-1];
 
       // Step 4b: Insert updated copy at the beginning
       mul_copies[0] = cur;
     }
 
     // Step 5: Perform reduction on copies
     #pragma unroll 
     for (int i = 0; i < M; i++)
       mul *= mul_copies[i];
 
     result[0] = mul;
   });
 });

Example 2: Accumulator

Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.

Consider the following example:

queue.submit([&](handler &cgh) { 
  accessor arr(arr_buf, cgh, read_only); 
  accessor result(result_buf, cgh, write_only); 
  accessor N(N_buf, cgh, read_only); 
  cgh.single_task<class unoptimized>([=]() { 
    double temp_sum = 0; 
    for (int i = 0; i < *N; ++i) 
      temp_sum += arr[i]; 
    result[0] = temp_sum; 
  }); 
});

The kernel unoptimized is an accumulator that sums the elements of a double precision floating-point array arr[i]. For each loop iteration, the Intel® oneAPI DPC++/C++ Compiler takes 10 cycles to compute the result of the addition and then stores it in the variable temp_sum. Each loop iteration requires the value of temp_sum from the previous loop iteration, which creates a data dependency on temp_sum. To relax the data dependency, infer the array arr[i] as a shift register.

The following is the restructured kernel optimized:

//Shift register size must be statically determinable 
constexpr int II_CYCLES = 12; 
queue.submit([&](handler &cgh) { 
  accessor arr(arr_buf, cgh, read_only); 
  accessor result(result_buf, cgh, write_only); 
  accessor N(N_buf, cgh, read_only); 
  cgh.single_task<class optimized>([=]() { 
    //Create shift register with II_CYCLE+1 elements 
    double shift_reg[II_CYCLES+1]; 
 
    //Initialize all elements of the register to 0 
    for (int i = 0; i < II_CYCLES + 1; i++) { 
      shift_reg[i] = 0; 
    } 
     
    //Iterate through every element of input array 
    for(int i = 0; i < N; ++i){ 
      //Load ith element into end of shift register 
      //if N > II_CYCLE, add to shift_reg[0] to preserve values 
      shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; 
   
      #pragma unroll 
      //Shift every element of shift register 
      for(int j = 0; j < II_CYCLES; ++j) 
      { 
        shift_reg[j] = shift_reg[j + 1]; 
      }  
    } 
  
    //Sum every element of shift register 
    double temp_sum = 0; 
   
    #pragma unroll  
    for(int i = 0; i < II_CYCLES; ++i) 
    { 
      temp_sum += shift_reg[i]; 
    } 
  
    result[0] = temp_sum; 
  }); 
});