A. Document Revision History for the Intel® FPGA SDK for OpenCL™ Pro...

Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

Download PDF

ID 683521

Date 3/28/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

A. Document Revision History for the Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide

Document Version	Intel® Quartus® Prime Version	Changes
2022.03.28	22.1	Maintenance release.
2021.12.13	21.4	Maintenance release.
2021.10.04	21.3	Maintenance release.
2021.06.21	21.2	Removed some outdated images and improved some of the descriptions in the Reviewing Your Kernel's report.html File chapter.
2021.03.29	21.1	Updated the messages in Area Report Messages for Private Variable Storage. Added a new section in Optimize Global Memory Accesses about how to calculate the global memory bandwidth use. Added a new topic Reviewing Global Memory Information to describe the global memory view of the System Viewer in the report.html file. Updated the topic Features of the Schedule Viewer to include details about the dependency lines. Changed the Graph Viewer report name to System Viewer.
2020.12.14	20.4	Maintenance release.
2020.09.28	20.3	Minor update to Loop Fusion topic about trip count condition relaxation. Added a topic Viewing Throughput Bottlenecks in the Design. Added a new topic Loop Bottlenecks. Updated Accessing HLD FPGA Reports in JSON Format and High-level Design Report Layout topics to include Bottlenecks viewer. Made minor update in Profiling Your Kernel to Identify Performance Bottlenecks and Best Practices for Profiling Your Kernel. In Instrumenting the Kernel Pipeline with Performance Counters (-profile), removed a bullet point about running the host application from the local disk. Updated the topic title and description of Invoking the Profiler Runtime Wrapper Updated the Profiling Autorun Kernels topic completely. Renamed the topic Intel® VTune™ Profiler as Viewing Profiling Data Using Intel® VTune™ Profiler and made minor update to the topic description. In the Performance Data Types topic, updated the description, added two new information types in the Types of Performance Data table, removed the Types of Information of table, and added a note. Made minor update in the Interpreting the Profiling Information topic description. Made minor update in the Stall, Occupancy, Bandwidth topic description. Removed information about the Intel FPGA dynamic profiler for OpenCL and the screenshot in High Stall Percentage topic. Minor update to the topic titles of No Stalls, Low Occupancy Percentage, and Low Bandwidth and No Stalls, High Occupancy Percentage, and Low Bandwidthand updated their images. In Intel FPGA Dynamic Profiler for OpenCL Limitations, removed a limitation and added a new limitation. Removed the following topics: Intel FPGA Dynamic Profiler for OpenCL GUI Launching the Intel FPGA Dynamic Profiler for OpenCL GUI (report) Source Code Tab Tool Tip Options Kernel Execution Tab Autorun Captures Tab Activity Cache Hit Low Bandwidth Efficiency Autorun Profiler Data Added the following new topics: Reducing Area Resource Usage While Profiling Obtaining Profiling Data During Runtime Splitting Execution and Data Post Processing Temporal Performance Collection Channel Depths
2020.06.22	20.2	Updated a guideline about the use of `local_mem_size` attribute in Preloading Data to Local Memory. Added scheduler's behavior in different scenarios to Reviewing Loop Information. Removed Out of Order Loop Iterations section in Nested Loops topic. Made minor update regarding the support for double pumping in Intel® Stratix® 10 devices in Simplifying Memory Access to Local Memories
2020.04.13	20.1	Updated the topic title and entire topic of Optimizing for Two or More Banks of Global Memory. Updated the entire Reviewing Your Kernel's report.html File chapter. Removed the Reviewing f_MAX II Information topic since F_max II report is deprecated. See Loop Analysis report. Added f_max related information to the Loop Analysis report. Added a new topic Performance Data Types. Added a new topic Intel VTune Profiler. Added a new topic Invoking the Profiler Runtime Wrapper to Obtain Profiling . Made minor updates and reorganized the existing topics of Profiling Your Kernel to Identify Performance Bottlenecks chapter. Added a new topic Loop Fusion. Updated the Loops in a Single Work-Item Kernel topic completely. Updated the Loop-Carried Dependencies that Affect the Initiation Interval of a Loop topic completely. Updated the Trade-Off Between Initiation Interval and Maximum Frequency topic completely.
2019.09.30	19.3	Updated the topic Load-Store Units completely. Removed Streaming Load-Store Units, Semi-Streaming Load-Store Units, and Global Infrequent Load-Store Units sections. Changed Local-Pipelined Load-Store Units as Pipelined Load-Store Units and added more information within this section. Updated the code snippet in the Cached section. Added new topics Controlling the Load-Store Units and When to Use Each LSU. Updated the Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor topic completely and replaced the code snippets. Updated the Channels topic to include more information about the `depth` attribute. Added a new topic about Schedule Viewer. Minor updates in Reviewing Block Information and Reviewing Cluster Information topics. Added a new topic Reviewing System Information and moved some of the existing instructions to this page. Removed system view related information and images from the Features of the Graph Viewer topic and moved it to the Reviewing System Information topic. Updated images in High Level Design Report Layout and Reviewing the Report Summary topics. Made minor updates in Accessing HLD FPGA Reports in JSON Format topic.
2019.07.01	19.2	Added the following topics from the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide in the Profiling Your Kernel to Identify Performance Bottlenecks: Launching the GUI (report) Instrumenting the Kernel Pipeline with Performance Counters (-profile) Profiling Autorun Kernels Removed the topic HTML Report: Area Report Messages and moved its subtopics under Reviewing Area Information. In Reviewing Area Information, included a note about analyze-area from the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide System Viewer, Block Viewer and Cluster Viewer topics merged into the Graph Viewer report. Relevant topics and images were updated accordingly. In Single Work-Item Kernel versus NDRange Kernel, `accum_swg` kernel code line 6 was updated.
2019.05.08	19.1	Updated Kernel Execution Tab since “Memory Copy (from device)” and “Memory Copy (to device)” are no longer supported. Added document archives chapter.
2019.04.01	19.1	In Nested Loops topic, updated the code snippet and images in the Out-of-Order Loop Iterations section. In Loops in a Single Work-Item Kernel topic: Updated the Trade-Off Between Critical Path and Maximum Frequency section to discuss kernel `lowered_fmax`. Added Loop Speculation section. Most of the topics under Reviewing Your Kernel's report.html Filechapter were updated to map the content and images to GUI changes in the HTML report. Removed the topic Area Analysis by Source since this view has been removed from the HTML report. Removed the topic Area Analysis Added the following new topics: Reviewing f_MAX II Information Analyzing Throughput Reviewing Block Information Reviewing Cluster Information Added a new chapter Strategies for Improving Performance in Your Host Application and added the following new topics under it: Utilizing Hardware Kernel Invocation Queue Double Buffered Host Application Utilizing Kernel Invocation Queue Moved Multi-Threaded Host Application topic under Strategies for Improving Performance in Your Host Application chapter and made minor improvements in the description. Updating the code snippets, text and images in Optimizing an OpenCL Design Example Based on Information in the HTML Report. Removed step 3 along with flowchart in Reviewing Loop Information. Removed Loop Analysis Report of an OpenCL Design Example topics and merged its content with Reviewing Loop Information. Moved Changing the Memory Access Pattern Example and Reducing the Area Consumed by Nested Loops Using loop_coalesce to HTML Report: Kernel Design Concepts section. Modified the topic title Verifying Information on Memory Replication and Stalls to Using Views. Added a new topic Optimizing for Two or More Banks of Global Memory to describe how to optimize global memory. Removed the topic Simplifying Loop-Carried Dependency. Updated Kernels topic with more information about blocks and clusters. Updated Local Memory topic completely and added more images to explain the concept. Rewrote the Features of the Kernel Memory Viewer topic completely.
2018.09.24	18.1	In Intel® FPGA SDK for OpenCL™ Pro Edition, the Intel® FPGA SDK for OpenCL™ Offline Compiler has a new front end. For a summary of changes introduced by this new front end, see Improved Intel® FPGA SDK for OpenCL™ Compiler Front End in the Intel® FPGA SDK for OpenCL™ Pro Edition Release Notes. Moved Static Memory Coalescing from Strategies for Improving NDRange Kernel Data Processing Efficiency to Strategies for Improving Memory Access Efficiency. Added information about the `ivdep` pragma `safelen(N)` clause to Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays. Removed image that showed comparison between parallel threads and loop pipelining, along with explanation to Multi-Threaded Host Application. This image and its explanation did not apply to host applications.
2018.05.04	18.0	Removed Intel® FPGA SDK for OpenCL™ Standard Edition information. Added a new Strategies for Optimizing Intel Stratix 10 OpenCL Designs chapter. In Preloading Data to Local Memory, added information on automatic padding of local memory elements. Removed the topic Resource-Driven Optimization because it described an obsolete optimization behavior.

Table 21. Intel® FPGA SDK for OpenCL™ Best Practices Guide Document Revision History
Date	Version	Changes
December 2017	2017.12.08	Added the following new topics: Autorun Captures Tab Autorun Profiler Data
November 2017	2017.11.06	Moved all topics into individual chapters. Changed some of the topic titles to task-based titles. Changed all occurrences of Fmax to f_max. Rebranded Dynamic Profiler to Intel FPGA Dynamic Profiler for OpenCL. Added a new short description to Stall, Occupancy, Bandwidth. Added a new image to show comparison between parallel threads and loop pipelining, along with explanation to Multi-Threaded Host Application. Added an FPGA architecture along with some explanation in FPGA Overview. Added OpenCL Design Components image to OpenCL Design Components. Added an important note to Aligning a Struct with or without Padding about 4-byte alignment and remove information related to a struct that is aligned and not padded. Added two bullet points to the last Attention section in Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor. Added Minimizing the Memory Dependencies for Loop Pipelining. Added area report hierarchy details to Reviewing Area Information. Added Best Practices for Channels and Pipes. Updated Allocating Aligned Memory. Added Reducing the Area Consumed by Nested Loops Using loop_coalesce. Added Changing the Memory Access Pattern Example. Updated the image Optimization Work Flow of a Single Work-Item Kernel. In the following topics, implemented single dash and `-option=<value>` conventions for aoc command. Optimization Work Flow of a Single Work-Item Kernel Optimizing Floating-Point Operations Manual Partitioning of Global Memory Constant Cache Memory Compilation Considerations High Stall and High Occupancy Percentages In Source Code Tab and Tool Top Options, updated the images to reflect Intel. In High Stall Percentage, added a screenshot for high stall percentage identification along with relevant explanation. In Local Memory, added a sentence about the overall state of the local memory as observed in the HTML report. In Load-Store Units, updated the description of semi-streaming LSU to describe how data travels throughout the block. New example codes and relevant explanation added to Nested Loops. Updated the code fragment in Intel FPGA SDK for OpenCL Pipeline Approach section by removing the `index` keyword updated Figure 4. In Single Work-Item Kernel versus NDRange Kernel, Removed the criteria for creating single work item kernels for your design. Added new example codes and relevant explanation Removed the subtopic on Single Work-Item Execution and merged its content with this topic.
May 2017	2017.05.08	Rebranded some functions in code examples as follows: Rebranded read_channel_altera to read_channel_intel. Rebranded write_channel_altera to write_channel_intel. Rebranded read_channel_nb_altera to read_channel_nb_intel. Rebranded write_channel_nb_altera to write_channel_nb_intel. Added Load-Store Units. Added Reviewing the Summary Report. Added Features of the Kernel Memory Viewer. Revised the Local Memory Banks section of Local Memory to include information about the `bank_bits` attribute. Revised Optimization Work Flow of a Single Work-Item Kernel in Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to reflect changes to the profiling commands.
December 2016	2016.12.02	Minor editorial modification.
October 2016	2016.10.31	Rebranded the Altera SDK for OpenCL to Intel® FPGA SDK for OpenCL™ . Rebranded the Altera Offline Compiler to Intel® FPGA SDK for OpenCL™ Offline Compiler. In Align a Struct with or without Padding, modified code snippets to correct the placement of attributes with respect to the struct declaration. Added the topic Review Your Kernel's report.html File, with subtopics describing the HTML GUI, the various reports the GUI provides, and a walkthrough on how to leverage the information in the HTML report to optimize an OpenCL design example. Changed Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage to HTML Report: Area Report Messages, and removed the following subsections: Area Report Messages for Global Memory and Global Memory Interconnect Area Report Messages for Local Memory Area Report Messages for Channels Added the topic HTML Report: Kernel Design Concepts, which includes subtopics on kernels, global memory interconnect, local memory, nested loops, loops in single work-item kernels, and channels. In Interpreting the Profiling Information, reorganized the content and added the following: Additional explanations on stall, occupancy, bandwidth, activity, and cache hit. Suggestions on addressing unsatisfactory Profiler metrics. In Addressing Single Work-Item Kernel Dependencies Based On Optimization Report Feedback, modified the figure Optimization Work Flow of a Single Work-Item Kernel to replace area report with HTML report. Removed the Optimization Report section along with the associated subsections because the information is now part of the HTML report. Changed Review Kernel Properties and Loop Unroll Status in the Optimization Report to Review Kernel Properties and Loop Unroll Status in the HTML Report because the optimization report is now part of the report.html file.
May 2016	2016.05.02	Added the topic Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays to introduce the `ivdep` pragma. Under Strategies for Improving Memory Access Efficiency, added the following topics to explain how to use the `numbanks` and `bankwidth` kernel attributes to configure the geometry of local memory system: Improve Kernel Performance by Banking the Local Memory Optimize the Geometric Configuration of Local Memory Banks Based on Array Index Under Strategies for Improving Memory Access Efficiency, added the topic Optimize Accesses to Local Memory by Controlling the Memory Replication Factor to explain the usage of the `singlepump` and `doublepump` kernel attributes. Added information on the area report messages. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information. Removed the Kernel-Specific Area Report section because it is replaced by the enhanced area report. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information. Updated the subsections under Optimization Report to include the enhanced optimization report messages. Added the Optimization Report Message for Speed-Limiting Constructs Updated the subsections under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to include the enhanced optimization report messages. Updated the figure Optimization Work Flow for a Single Work-Item Kernel to include steps on accessing the enhanced area report to review resource usage. Under Strategies for Improving NDRange Kernel Data Processing Efficiency, added the Review Kernel Properties and Loop Unroll Status in the Optimization Report section.
November 2015	2015.11.02	Added the topic Multi-Threaded Host Application. Added Caution note regarding memory barrier in Specify a Maximum Work-Group Size or a Required Work-Group Size.
May 2015	15.0.0	In Memory Access Considerations, added Caution note regarding performance degradation that might occur when declaring __constant pointer arguments in kernels targeting Cyclone® V devices. In Good Design Practices for Single Work-Item Kernel, removed the Initialize Data Prior to Usage in a Loop section and added a Declare Variables in the Deepest Scope Possible section. Added Removing Loop-Carried Dependency by Inferring Shift Registers. The topic discusses how, in single work-item kernels, inferring double precision floating-point array as a shift register can remove loop-carried dependencies. Added Kernel-Specific Area Reports to show examples of kernel-specific .area files that the Altera Offline Compiler generates during compilation. Renamed Transfer Data Via offline compiler Channels to Transfer Data Via offline compiler Channels or OpenCL Pipes and added the following: More information on how channels can help improve kernel performance. Information on OpenCL pipes. Renamed Data Type Considerations to Data Type Selection Considerations.
December 2014	14.1.0	Reorganized the information flow in the Optimization Report Messages section to update report messages and the layout of the optimization report. Included new optimization report messages detailing the reasons for unsuccessful and suboptimal pipelined executions. Added the Optimization Report Messages for Simplified Analysis of a Complex Design subsection under Optimization Report Messages to describe new report message for simplified kernel analysis. Renamed Using Feedback from the Optimization Report to Address Single Work-Item Kernels Dependencies to Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback. Added the Transferring Loop-Carried Dependency to Local Memory subsection under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to describe new strategy for resolving loop-carried dependency. Updated the Resource-Driven Optimization and Compilation Considerations sections to reflect the deprecation of the -O3 and `--util <N>` Altera® Offline Compiler (offline compiler) command options. Consolidated and simplified the Heterogeneous Memory Buffers and Host Application Modifications for Heterogeneous Memory Accesses sections. Added the section Align a Struct and Remove Padding between Struct Fields. Removed the section Ensure 4-Byte Alignment to All Data Structures. Modified the figure Single Work-Item Optimization Work Flow to include emulation and profiling.
June 2014	14.0.0	Renamed document as the Intel® FPGA SDK for OpenCL™ Best Practices Guide. Reorganized information flow. Renamed Good Design Practices to Good OpenCL Kernel Design Practices. Added channels information in Transfer data via offline compiler Channels. Added profiler information in Profile Your Kernel to Identify Performance Bottlenecks. Added the section Single Work-Item Kernel Versus NDRange Kernel. Updated Single Work-Item Execution section. Removed Performance Warning Messages section. Renamed Single Work-Item Kernel Programming Considerations to Good Design Practices for Single Work-Item Kernel. Added the section Strategies for Improving Single Work-Item Kernel Performance. Renamed Optimization of Data Processing Efficiency to Strategies for Improving NDRange Kernel Data Processing Efficiency. Removed Resource Sharing section. Renamed Floating-Point Operations to Optimize Floating-Point Operations. Renamed Optimization of Memory Access Efficiency to Strategies for Improving Memory Access Efficiency. Updated Manual Partitioning of Global Memory section. Added the section Strategies for Optimizing FPGA Area Usage.
December 2013	13.1.1	Updated the section Specify a Maximum Work-Group Size or a Required Work-Group Size. Added the section Heterogeneous Memory Buffers. Updated the section Single Work-Item Execution. Added the section Performance Warning Messages. Updated the section Single Work-Item Kernel Programming Considerations .
November 2013	13.1.0	Reorganized information flow. Updated the section Intel® FPGA SDK for OpenCL™ Compilation Flow. Updated the section Pipelines; inserted the figure Example Multistage Pipeline Diagram. Removed the following figures: Instruction Flow through a Five-Stage Pipeline Processor. Vector Addition Kernel Compiled to an FPGA. Effect of Kernel Vectorization on Array Summation. Data Flow Implementation of a Four-Element Accumulation Kernel. Data Flow Implementation of a Four-Element Accumulation Kernel with Loop Unrolled. Complete Loop Unrolling. Unrolling Two Loop Iterations. Memory Master Interconnect. Local Memory Read and Write Ports. Local Memory Configuration. Updated the section Good Design Practices. Removed the following sections: Predicated Execution. Throughput Analysis. Case Studies. Updated and renamed Optimizing Data Processing Efficiency to Optimization of Data Processing Efficiency. Renamed Replicating Compute Units versus Kernel SIMD Vectorization to Compute Unit Replication versus Kernel SIMD Vectorization. Renamed Using num_compute_units and num_simd_work_items Together to Combination of Compute Unit Replication and Kernel SIMD Vectorization. Updated and renamed Memory Streaming to Contiguous Memory Accesses. Updated and renamed Optimizing Memory Access to General Guidelines on Optimizing Memory Accesses. Updated and renamed Optimizing Memory Efficiency to Optimization of Memory Access Efficiency. Inserted the subsection Single Work-Item Execution under Optimization of Memory Access Efficiency.
June 2013	13.0 SP1.0	Updated support status of OpenCL kernel source code containing complex exit paths. Updated the figure Effect of Kernel Vectorization on Array Summation to correct the data flow between Store and Global Memory. Updated content for the `unroll` pragma directive in the section Loop Unrolling. Updated content of the Local Memory section. Updated the figure Local Memories Transferring Data Blocks within Matrices A and B to correct the data transfer pattern in Matrix B. Removed the figure Loop Unrolling with Vectorization. Removed the section Optimizing Local Memory Bandwidth.
May 2013	13.0.1	Updated terminology. For example, pipeline is replaced with compute unit; vector lane is replaced with SIMD vector lane. Added the following sections under Good Design Practices: Preprocessor Macros. Floating-Point versus Fixed-Point Representations. Recommended Optimization Methodology. Sequence of Optimization Techniques. Updated code fragments. Updated the figure Data Flow with Multiple Compute Units. Updated the figure Compute Unit Replication versus Kernel SIMD Vectorization. Updated the figure Optimizing Throughput Using Compute Unit Replication and SIMD Vectorization. Updated the figure Memory Streaming. Inserted the figure Local Memories Transferring Data Blocks within Matrices A and B. Reorganized the flow of information. Number of figures, tables, and examples have been updated. Included information on new kernel attributes: `max_share_resources` and `num_share_resources` .
May 2013	13.0.0	Updated pipeline discussion. Updated case study code examples and results tables. Updated figures.
November 2012	12.1.0	Initial release.