Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide
A newer version of this document is available. Customers should click here to go to the newest version.
Visible to Intel only — GUID: ewa1416320296962
Ixiasoft
Visible to Intel only — GUID: ewa1416320296962
Ixiasoft
6.1.3. Transferring Loop-Carried Dependency to Local Memory
Consider the following kernel example:
1 #define N 128 2 3 __kernel void unoptimized( __global int* restrict A ) 4 { 5 for (unsigned i = 0; i < N; i++) 6 A[N-i] = A[i]; 7 }

Global memory accesses have long latencies. In this example, the loop-carried dependency on the array A[i] causes the long latency. This latency is reflected by an II of 227 in the optimization report. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:
- Copy the array with the loop-carried dependency to local memory. In this example, array A[i] becomes array B[i] in local memory.
- Execute the loop with the loop-carried dependence on array B[i].
- Copy the array back to global memory.
Below is the restructured kernel optimized:
1 #define N 128 2 3 __kernel void optimized( __global int* restrict A ) 4 { 5 int B[N]; 6 7 for (unsigned i = 0; i < N; i++) 8 B[i] = A[i]; 9 10 for (unsigned i = 0; i < N; i++) 11 B[N-i] = B[i]; 12 13 for (unsigned i = 0; i < N; i++) 14 A[i] = B[i]; 15 }
An optimization report similar to the one below indicates the successful reduction of II from 227 to 2: