Loop Analysis

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 7/13/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-ABD7982E-6528-48E0-9245-F39B3C523990

View Details

Loop Analysis

The report.html file contains information about all loops in your design and their unroll statuses. The Loop Analysis report helps you examine whether the Intel® oneAPI DPC++/C++ Compiler can maximize your kernel's throughput.

To view the Loop Analysis report, click Throughput Analysis > Loop Analysis. The purpose of this view is to show estimates of performance indicators (such as II) and potential performance bottlenecks. For each loop, you can identify the following using the report:

Whether the loop is pipelined
Whether the loop uses a hyper-optimized loop structure
Any pragma or attribute applied to the loop
II of the loop

NOTE:

Loop Analysis report does not report anything about loops in NDRange kernels.
FPGA optimization reports support user-defined loop labels and replace the system-generated loop labels.

The left-hand Loops List pane of the Loop Analysis report displays the following types of loops the compiler detects in your design and indicates any transformations the compiler may have applied. These transformations may be applied automatically or due to a source-code annotation, such as a pragma or an attribute.

Fused loops (see Fuse Loops to Reduce Overhead and Improve Performance)
Fused subloops
Coalesced loops
Fully unrolled loops
Partially unrolled loops

Loop Pragma and Attributes

You can use the Loop Analysis report to help determine where to deploy one or more of the following pragmas or attributes on your loops:

Key Performance Metrics

The Loop Analysis report captures the following key performance metrics on all blocks:

Source Location: Indicates the loop location in the source code.
Pipelined: Indicates whether a loop is pipelined. Pipelining allows for many data items to be processed concurrently (in the same clock cycle) while efficiently using of the hardware in the datapath by keeping it occupied.
II: Shows the sustainable initiation interval (II) of the loop. Processing data in loops is an additional source of pipeline parallelism. When you pipeline a loop, the next iteration of the loop begins before previous iterations complete. You can determine the number of clock cycles between iterations by the number of clock cycles you require to resolve any dependencies between iterations. You can refer to this number as the initiation interval (II) of the loop. The Intel® oneAPI DPC++/C++ Compiler automatically identifies these dependencies and builds hardware to resolve these dependencies while minimizing the II.

Estimated f_MAX: Shows the estimated maximum clock frequency at which the loop operates. You can also reference it as "Scheduled f_MAX." The f_MAX is the maximum rate at which the outputs of registers are updated. If the estimated f_MAX is below the target frequency, then the estimated f_MAX appears in red color and a question mark with a tooltip displaying the target frequency displays.

The physical propagation delay of the signal between two consecutive registers limits the clock speed. This propagation delay is a function of the complexity of the Boolean logic in the path. The path with the most logic (and the highest delay) limits the speed of the entire circuit, and you can refer to this path as the critical path.

The f_MAX is calculated as the inverse of the critical path delay. High f_MAX is desirable because it correlates directly with high performance in the absence of other bottlenecks. The compiler attempts to optimize for different objectives for the estimated f_MAX depending on whether the f_MAX target is set and whether the #pragma II is set for each of the loops. The f_MAX target is a strong suggestion, and the compiler does not error out if it cannot achieve this f_MAX, whereas the #pragma II triggers an error if the compiler is not able to achieve the requested II. The f_MAX achieved for each block of code is shown in the Loop Analysis report. This behavior is outlined in the following table:

Explicitly specify f_MAX?	Explicitly specify II?	Compiler's Behavior
No	No	Use heuristic to achieve best f_MAX/II trade-off.
No	Yes	Best effort to achieve the II for the corresponding loop (may not achieve the best possible f_MAX).
Yes	No	Best effort to achieve f_MAX specified (may not achieve the best possible II).
Yes	Yes	Best effort to achieve the f_MAX specified at the given II. The compiler errors out if it cannot achieve the requested II.

NOTE:

Intel® recommends that if you are using an f_MAX target in the command line or for a kernel, use #pragma II = <N> for performance-critical loops in your design.

Latency: Shows the number of clock cycles a loop takes to complete one or more instructions. Typically, you want to have low latency. However, lowering latency often results in decreased f_MAX.
Speculated Iterations: Shows the loop speculation. Loop speculation is an optimization technique that enables more efficient loop pipelining by allowing future iterations to be initiated before determining whether the loop was exited already.
Max Interleaving Iterations: Indicates the number of interleaved invocations of an inner loop that can be executed simultaneously. For more information, refer to max_interleaving Attribute.

Example

The following is a SYCL* kernel example that includes three loops:

 cgh.single_task<class example>([=]() {
   #pragma unroll
   for (int i = 0; i < 10; i++) {
     acc_data[i] += i;
   }
   #pragma unroll 1
   for (int k = 0; k < N; k++) {
     #pragma unroll 5
     for (int j = 0; j < N; j++) {
       acc_data[j] = j + k;
     }
   }
 });

The Loop Analysis report of this design example highlights the unrolling strategy for the different kinds of loops defined in the code.

The Intel® oneAPI DPC++/C++ Compiler implements the following loop unrolling strategies based on the source code:

Fully unrolls the first inner loop (line 3) because of the #pragma unroll specification.
Does not unroll the second loop (line 7), which is an outer loop because of the #pragma unroll 1 specification.
Unrolls the third loop (line 9, an inner loop of the second loop) five times because of the #pragma unroll 5 specification.

For more examples, refer to Loops section.

Parent topic: Review the FPGA Optimization Report

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Loop Analysis

Loop Pragma and Attributes

Key Performance Metrics

Example