Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Relax Loop-Carried Dependency

Example 1: Multiply Chain

Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.

Consider the following code example:

constexpr int N = 128; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); cgh.single_task<class unoptimized>([=]() { float mul = 1.0f; for (int i = 0; i < N; i++) mul *= A[i]; result[0] = mul; }); });

The optimization report shows that the Intel® oneAPI DPC++/C++ Compiler infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul.

To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable, and use one copy every M iterations:

  1. Declare multiple copies of the variable mul (for example, in an array called mul_copies).
  2. Initialize all copies of mul_copies.
  3. Use the last copy in the array in the multiplication operation.
  4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
  5. Sum up all copies of mul_copies to mul and write the final value to the result.

The following code illustrates the restructured kernel:

constexpr int N = 128; constexpr int M = 8; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); cgh.single_task<class optimized>([=]() { float mul = 1.0f; // Step 1: Declare multiple copies of variable mul float mul_copies[M]; // Step 2: Initialize all copies for (int i = 0; i < M; i++) mul_copies[i] = 1.0f; for (int i = 0; i < N; i++) { // Step 3: Perform multiplication on the last copy float cur = mul_copies[M-1] * A[i]; // Step 4a: Shift copies #pragma unroll for (int j = M-1; j > 0; j--) mul_copies[j] = mul_copies[j-1]; // Step 4b: Insert updated copy at the beginning mul_copies[0] = cur; } // Step 5: Perform reduction on copies #pragma unroll for (int i = 0; i < M; i++) mul *= mul_copies[i]; result[0] = mul; }); });

Example 2: Accumulator

Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.

Consider the following example:

queue.submit([&](handler &cgh) { accessor arr(arr_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); accessor N(N_buf, cgh, read_only); cgh.single_task<class unoptimized>([=]() { double temp_sum = 0; for (int i = 0; i < *N; ++i) temp_sum += arr[i]; result[0] = temp_sum; }); });

The kernel unoptimized is an accumulator that sums the elements of a double precision floating-point array arr[i]. For each loop iteration, the Intel® oneAPI DPC++/C++ Compiler takes 10 cycles to compute the result of the addition and then stores it in the variable temp_sum. Each loop iteration requires the value of temp_sum from the previous loop iteration, which creates a data dependency on temp_sum. To relax the data dependency, infer the array arr[i] as a shift register.

The following is the restructured kernel optimized:

//Shift register size must be statically determinable constexpr int II_CYCLES = 12; queue.submit([&](handler &cgh) { accessor arr(arr_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); accessor N(N_buf, cgh, read_only); cgh.single_task<class optimized>([=]() { //Create shift register with II_CYCLE+1 elements double shift_reg[II_CYCLES+1]; //Initialize all elements of the register to 0 for (int i = 0; i < II_CYCLES + 1; i++) { shift_reg[i] = 0; } //Iterate through every element of input array for(int i = 0; i < N; ++i){ //Load ith element into end of shift register //if N > II_CYCLE, add to shift_reg[0] to preserve values shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; #pragma unroll //Shift every element of shift register for(int j = 0; j < II_CYCLES; ++j) { shift_reg[j] = shift_reg[j + 1]; } } //Sum every element of shift register double temp_sum = 0; #pragma unroll for(int i = 0; i < II_CYCLES; ++i) { temp_sum += shift_reg[i]; } result[0] = temp_sum; }); });