Visible to Intel only — GUID: GUID-2C2F1829-A9E2-4452-9A1F-020B88C7AA20
Visible to Intel only — GUID: GUID-2C2F1829-A9E2-4452-9A1F-020B88C7AA20
Relax Loop-Carried Dependency
Example 1: Multiply Chain
Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.
Consider the following code example:
constexpr int N = 128;
queue.submit([&](handler &cgh) {
accessor A(A_buf, cgh, read_only);
accessor result(result_buf, cgh, write_only);
cgh.single_task<class unoptimized>([=]() {
float mul = 1.0f;
for (int i = 0; i < N; i++)
mul *= A[i];
result[0] = mul;
});
});
The optimization report shows that the Intel® oneAPI DPC++/C++ Compiler infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul.
To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable, and use one copy every M iterations:
- Declare multiple copies of the variable mul (for example, in an array called mul_copies).
- Initialize all copies of mul_copies.
- Use the last copy in the array in the multiplication operation.
- Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
- Sum up all copies of mul_copies to mul and write the final value to the result.
The following code illustrates the restructured kernel:
constexpr int N = 128;
constexpr int M = 8;
queue.submit([&](handler &cgh) {
accessor A(A_buf, cgh, read_only);
accessor result(result_buf, cgh, write_only);
cgh.single_task<class optimized>([=]() {
float mul = 1.0f;
// Step 1: Declare multiple copies of variable mul
float mul_copies[M];
// Step 2: Initialize all copies
for (int i = 0; i < M; i++)
mul_copies[i] = 1.0f;
for (int i = 0; i < N; i++) {
// Step 3: Perform multiplication on the last copy
float cur = mul_copies[M-1] * A[i];
// Step 4a: Shift copies
#pragma unroll
for (int j = M-1; j > 0; j--)
mul_copies[j] = mul_copies[j-1];
// Step 4b: Insert updated copy at the beginning
mul_copies[0] = cur;
}
// Step 5: Perform reduction on copies
#pragma unroll
for (int i = 0; i < M; i++)
mul *= mul_copies[i];
result[0] = mul;
});
});
Example 2: Accumulator
Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.
Consider the following example:
queue.submit([&](handler &cgh) {
accessor arr(arr_buf, cgh, read_only);
accessor result(result_buf, cgh, write_only);
accessor N(N_buf, cgh, read_only);
cgh.single_task<class unoptimized>([=]() {
double temp_sum = 0;
for (int i = 0; i < *N; ++i)
temp_sum += arr[i];
result[0] = temp_sum;
});
});
The kernel unoptimized is an accumulator that sums the elements of a double precision floating-point array arr[i]. For each loop iteration, the Intel® oneAPI DPC++/C++ Compiler takes 10 cycles to compute the result of the addition and then stores it in the variable temp_sum. Each loop iteration requires the value of temp_sum from the previous loop iteration, which creates a data dependency on temp_sum. To relax the data dependency, infer the array arr[i] as a shift register.
The following is the restructured kernel optimized:
//Shift register size must be statically determinable
constexpr int II_CYCLES = 12;
queue.submit([&](handler &cgh) {
accessor arr(arr_buf, cgh, read_only);
accessor result(result_buf, cgh, write_only);
accessor N(N_buf, cgh, read_only);
cgh.single_task<class optimized>([=]() {
//Create shift register with II_CYCLE+1 elements
double shift_reg[II_CYCLES+1];
//Initialize all elements of the register to 0
for (int i = 0; i < II_CYCLES + 1; i++) {
shift_reg[i] = 0;
}
//Iterate through every element of input array
for(int i = 0; i < N; ++i){
//Load ith element into end of shift register
//if N > II_CYCLE, add to shift_reg[0] to preserve values
shift_reg[II_CYCLES] = shift_reg[0] + arr[i];
#pragma unroll
//Shift every element of shift register
for(int j = 0; j < II_CYCLES; ++j)
{
shift_reg[j] = shift_reg[j + 1];
}
}
//Sum every element of shift register
double temp_sum = 0;
#pragma unroll
for(int i = 0; i < II_CYCLES; ++i)
{
temp_sum += shift_reg[i];
}
result[0] = temp_sum;
});
});