Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide
ID
683521
Date
3/28/2022
Public
A newer version of this document is available. Customers should click here to go to the newest version.
1. Introduction to Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide
2. Reviewing Your Kernel's report.html File
3. OpenCL Kernel Design Concepts
4. OpenCL Kernel Design Best Practices
5. Profiling Your Kernel to Identify Performance Bottlenecks
6. Strategies for Improving Single Work-Item Kernel Performance
7. Strategies for Improving NDRange Kernel Data Processing Efficiency
8. Strategies for Improving Memory Access Efficiency
9. Strategies for Optimizing FPGA Area Usage
10. Strategies for Optimizing Intel® Stratix® 10 OpenCL Designs
11. Strategies for Improving Performance in Your Host Application
12. Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide Archives
A. Document Revision History for the Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide
2.1. High-Level Design Report Layout
2.2. Reviewing the Summary Report
2.3. Viewing Throughput Bottlenecks in the Design
2.4. Using Views
2.5. Analyzing Throughput
2.6. Reviewing Area Information
2.7. Optimizing an OpenCL Design Example Based on Information in the HTML Report
2.8. Accessing HLD FPGA Reports in JSON Format
4.1. Transferring Data Via Intel® FPGA SDK for OpenCL™ Channels or OpenCL Pipes
4.2. Unrolling Loops
4.3. Optimizing Floating-Point Operations
4.4. Allocating Aligned Memory
4.5. Aligning a Struct with or without Padding
4.6. Maintaining Similar Structures for Vector Type Elements
4.7. Avoiding Pointer Aliasing
4.8. Avoid Expensive Functions
4.9. Avoiding Work-Item ID-Dependent Backward Branching
5.1. Best Practices for Profiling Your Kernel
5.2. Instrumenting the Kernel Pipeline with Performance Counters (-profile)
5.3. Obtaining Profiling Data During Runtime
5.4. Reducing Area Resource Use While Profiling
5.5. Temporal Performance Collection
5.6. Performance Data Types
5.7. Interpreting the Profiling Information
5.8. Profiler Analyses of Example OpenCL Design Scenarios
5.9. Intel® FPGA Dynamic Profiler for OpenCL™ Limitations
8.1. General Guidelines on Optimizing Memory Accesses
8.2. Optimize Global Memory Accesses
8.3. Performing Kernel Computations Using Constant, Local or Private Memory
8.4. Improving Kernel Performance by Banking the Local Memory
8.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor
8.6. Minimizing the Memory Dependencies for Loop Pipelining
8.7. Static Memory Coalescing
10.1.2. Using a Single Kernel to Describe Systolic Arrays
For an Intel® Stratix® 10 OpenCL design, Intel® recommends that you describe a systolic array as a single kernel, using a function for the processing element (PE), and a fully-unrolled loop or nested loop to represent the array.
Unoptimized multi-kernel systolic array pseudocode:
// data distribution network over an array of channels
channel int c[ROWS][COLS];
channel int d[ROWS][COLS];
attribute((num_compute_units(ROWS,COLS))
kernel void PE() {
// get data values from my neighbors
while(1){
x = read_channel_intel(c[ROWS-1][COLS]);
y = read_channel_inel(d[ROWS][COLS-1]);
// some code that uses x and y
...
// send the same data values to the next neighbors
write_channel_intel(c[ROWS][COLS], x);
write_channel_intel(d[ROWS][COLS], y);
}
}
Optimized single-kernel pseudocode:
kernel void allPEs() {
while(1){
int c[ROWS], d[COLS];
#pragma unroll
for (int i = 0; i < ROWS; i++)
#pragma unroll
for (int j = 0; j < COLS; j++) {
PE(c[i], d[j]);
}
}
}
}
Note: Instead of a kernel, the PE body becomes the function call PE(). Unrolling the loops results in an array of PEs, each of which uses a portion of the FPGA in a 2D array.
Depending on the size of the array, it can be challenging for the Intel® FPGA SDK for OpenCL™ Offline Compiler to generate hardware that distributes the same values c and d to all PEs on a row or column of the array within a single clock cycle. Doing so might cause fMAX to degrade. To remedy this problem, consider using the __fpga_reg() function to instruct the offline compiler to insert registers on c and d with every new PE. Intel® recommends that you only use the __fpga_reg() function when you know that the PEs are spatially separate from one another on the FPGA.
Note: The __fpga_reg() built-in function is an advanced feature. The offline compiler does not provide guidance on where you should insert the __fpga_reg() function calls. To help determine whether it is appropriate to insert the __fpga_reg() function call, you can experimentally quantify the impact additional registers might have on fMAX, and inspect the Intel® Quartus® Prime compilation reports.
Optimized pseudocode with the __fpga_reg() function:
kernel void allPEs() {
int c[ROWS], d[COLS];
while(1){
#pragma unroll
for (int i = 0; i < ROWS; i++)
#pragma unroll
for (int j = 0; j < COLS; j++) {
// compute and store outputs
PE(c[i], d[j]);
c[i] = __fpga_reg(c[i]);
d[j] = __fpga_reg(d[j]);
}
}
}
}
After the offline compiler unrolls the loop, there is one more register before every PE on both c and d, allowing the Intel® Quartus® Prime Pro Edition software to place the PEs apart. You may add more than one register by inserting multiple __fpga_reg() calls in your code. For example, the call __fpga_reg(__fpga_reg(x)) adds two registers on the data path. However, having excessive __fpga_reg() calls in your kernel increases the design area, and the congestion might result in fMAX degradation.
Related Information