2.5.1. Reviewing Loop Information

Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

Download PDF

ID 683521

Date 12/13/2021

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

2.5.1. Reviewing Loop Information

The Loops Analysis report contains information about all the loops (coalesced, unrolled, and fused loops) in your design and their unroll statuses. This report helps you examine whether the Intel® FPGA SDK for OpenCL™ Offline Compiler can maximize the throughput of your kernel.

Note: The f_MAX II report is now deprecated and its information is merged with the Loop Analysis report.

To view detailed information about the throughput bottlenecks, use the Bottlenecks viewer.

To access the report, click Throughput Analysis > Loop Analysis. The left-hand Loops List pane displays the following types of loops:

Fused loops
Fused subloops
Coalesced loops
Fully unrolled loops
Partial unrolled loops
Regular loops

The Loops Analysis report captures the following key performance metrics on all blocks:

Source Location: Indicates the loop location in the source code.
Pipelined: Indicates whether the body of a loop is pipelined. Pipelining allows for many data items to be processed concurrently (in the same clock cycle) while making efficient use of the hardware in the datapath by keeping it occupied.
II: Shows the sustainable initiation interval (II) of the loop. Processing data in loops is an additional source of pipeline parallelism. When you pipeline a loop, the next iteration of the loop begins before previous iterations complete.
You can determine the number of clock cycles between iterations by the number of clock cycles you require to resolve any dependencies between iterations. You can refer to this number as the initiation interval (II) of the loop.

The Intel® FPGA SDK for OpenCL™ Offline Compiler automatically identifies these dependencies and builds hardware to resolve these dependencies while minimizing the II. For additional information, refer to Specifying a loop initiation interval (II).

Scheduled f _MAX : Shows the scheduled maximum clock frequency at which the loop operates. The f_MAX is the maximum rate at which the outputs of registers are updated.

The physical propagation delay of the signal between two consecutive registers limits the clock speed. This propagation delay is a function of the complexity of the Boolean logic in the path. The path with the most logic (and the highest delay) limits the speed of the entire circuit, and you can refer to this path as the critical path.

The f_MAX is calculated as the inverse of the critical path delay. High f_MAX is desirable because it correlates directly with high performance in the absence of other bottlenecks. The offline compiler attempts to optimize the kernel for different objectives for the scheduled f_MAX depending on whether the f_MAX target is set and whether the #pragma II is set for each of the loops. The f_MAX target is a strong suggestion and the compiler does not error out if it is not able to achieve this f_MAX, whereas the #pragma II triggers an error if the compiler cannot achieve the requested II. The f_MAX achieved for each block of code is shown in the Loops report.

The following table outlines the behavior of the scheduler in the Intel® FPGA SDK for OpenCL™ Offline Compiler:

Explicitly Specify f_MAX?	Explicitly Specify II?	Compiler Behavior
No	No	Use heuristic to achieve best f_MAX/II trade-off.
No	Yes	Best effort to achieve the II for the corresponding loop (may not achieve the best possible f_MAX).
Yes	No	Best effort to achieve f_MAX specified (may not achieve the best possible II).
Yes	Yes	Best effort to achieve the f_MAX specified at the given II. The compiler errors out if it cannot achieve the requested II.

Note: If you are using an f_MAX target in the command line or for a kernel, use #pragma II = <N> for performance-critical loops in your design.

Latency: Shows the number of clock cycles a loop takes to complete one or more instructions. Typically, you want to have low latency. However, lowering latency often results in decreased f_MAX.
Speculated Iterations: Shows the loop speculation. Loop speculation is an optimization technique that enables more efficient loop pipelining by allowing future iterations to be initiated before determining whether the loop was exited already. For more information, refer to Loop Speculation.
Max Interleaving Iterations: Indicates the number of interleaved invocations of an inner loop that can be executed simultaneously. For more information, refer to Loop Interleaving Control.

You can use the Loops Analysis report to determine where to deploy one or more pragmas on your loops. Refer to the following pragma documentation in the Intel FPGA SDK for OpenCL Programming Guide:

Table 3. Loop Pragmas
Pragma	Reference
`#pragma unroll`	Unrolling a Loop
`#pragma loop_coalesce`	Coalescing Nested Loops
`#pragma ii`	Specifying a loop initiation interval (II)
`#pragma speculated_iterations`	Loop Speculation
`#pragma max_concurrency`	Loop Concurrency
`#pragma max_interleaving`	Loop Interleaving Control
`#pragma disable_loop_pipelining`	Disabling Pipelining of a Loop
`#pragma loop_fuse`	Fusing Adjacent Loops
`#pragma nofusion`	Marking Loops to Prevent Automatic Fusion

OpenCL Kernel Example

The following is an OpenCL kernel example that includes four loops:


 1  // ND-Range kernel with unrolled loops
 2  __attribute((reqd_work_group_size(1024,1,1)))
 3  kernel void t (global int * out, int N) {
 4    int i = get_global_id(0);
 5    int j = 1;
 6    for (int k = 0; k < 4; k++) {
 7      #pragma unroll
 8      for (int n = 0; n < 4; n++) {
 9        j += out[k+n];
10      }
11    }
12    out[i] = j;
13
14    int m = 0;
15    #pragma unroll 1
16    for (int k = 0; k < N; k++) {
17      m += out[k/3];
18    }
19    #pragma unroll
20    for (int k = 0; k < 6; k++) {
21      m += out[k];
22    }
23    #pragma unroll 2
24    for (int k = 0; k < 6; k++) {
25      m += out[k];
26    }
27    out[2] = m;
28  }

The loop analysis report of this design example highlights the unrolling strategy for the different kinds of loops defined in the code.

The Intel® FPGA SDK for OpenCL™ Offline Compiler executes the following loop unrolling strategies based on the source code:

Fully unrolls the inner loop (line 8) within the first loop because of the #pragma unroll specification
Does not unroll the second outer loop, Block4 (line 16), because of the #pragma unroll 1 specification
Fully unrolls the third outer loop (line 20) because of the #pragma unroll specification
Unrolls the fourth outer loop, Block5 (line 24), twice because of the #pragma unroll 2 specification

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

2.5.1. Reviewing Loop Information

OpenCL Kernel Example