Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 3/28/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

A. Document Revision History for the Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide

Document Version Intel® Quartus® Prime Version Changes
2022.03.28 22.1
  • Maintenance release.
2021.12.13 21.4
  • Maintenance release.
2021.10.04 21.3
  • Maintenance release.
2021.06.21 21.2
  • Removed some outdated images and improved some of the descriptions in the Reviewing Your Kernel's report.html File chapter.
2021.03.29 21.1
  • Updated the messages in Area Report Messages for Private Variable Storage.
  • Added a new section in Optimize Global Memory Accesses about how to calculate the global memory bandwidth use.
  • Added a new topic Reviewing Global Memory Information to describe the global memory view of the System Viewer in the report.html file.
  • Updated the topic Features of the Schedule Viewer to include details about the dependency lines.
  • Changed the Graph Viewer report name to System Viewer.
2020.12.14 20.4 Maintenance release.
2020.09.28 20.3
  • Minor update to Loop Fusion topic about trip count condition relaxation.
  • Added a topic Viewing Throughput Bottlenecks in the Design.
  • Added a new topic Loop Bottlenecks.
  • Updated Accessing HLD FPGA Reports in JSON Format and High-level Design Report Layout topics to include Bottlenecks viewer.
  • Made minor update in Profiling Your Kernel to Identify Performance Bottlenecks and Best Practices for Profiling Your Kernel.
  • In Instrumenting the Kernel Pipeline with Performance Counters (-profile), removed a bullet point about running the host application from the local disk.
  • Updated the topic title and description of Invoking the Profiler Runtime Wrapper
  • Updated the Profiling Autorun Kernels topic completely.
  • Renamed the topic Intel® VTune™ Profiler as Viewing Profiling Data Using Intel® VTune™ Profiler and made minor update to the topic description.
  • In the Performance Data Types topic, updated the description, added two new information types in the Types of Performance Data table, removed the Types of Information of table, and added a note.
  • Made minor update in the Interpreting the Profiling Information topic description.
  • Made minor update in the Stall, Occupancy, Bandwidth topic description.
  • Removed information about the Intel FPGA dynamic profiler for OpenCL and the screenshot in High Stall Percentage topic.
  • Minor update to the topic titles of No Stalls, Low Occupancy Percentage, and Low Bandwidth and No Stalls, High Occupancy Percentage, and Low Bandwidthand updated their images.
  • In Intel FPGA Dynamic Profiler for OpenCL Limitations, removed a limitation and added a new limitation.
  • Removed the following topics:
    • Intel FPGA Dynamic Profiler for OpenCL GUI
    • Launching the Intel FPGA Dynamic Profiler for OpenCL GUI (report)
    • Source Code Tab
    • Tool Tip Options
    • Kernel Execution Tab
    • Autorun Captures Tab
    • Activity
    • Cache Hit
    • Low Bandwidth Efficiency
    • Autorun Profiler Data
  • Added the following new topics:
    • Reducing Area Resource Usage While Profiling
    • Obtaining Profiling Data During Runtime
    • Splitting Execution and Data Post Processing
    • Temporal Performance Collection
    • Channel Depths
2020.06.22 20.2
  • Updated a guideline about the use of local_mem_size attribute in Preloading Data to Local Memory.
  • Added scheduler's behavior in different scenarios to Reviewing Loop Information.
  • Removed Out of Order Loop Iterations section in Nested Loops topic.
  • Made minor update regarding the support for double pumping in Intel® Stratix® 10 devices in Simplifying Memory Access to Local Memories
2020.04.13 20.1
  • Updated the topic title and entire topic of Optimizing for Two or More Banks of Global Memory.
  • Updated the entire Reviewing Your Kernel's report.html File chapter.
  • Removed the Reviewing fMAX II Information topic since Fmax II report is deprecated. See Loop Analysis report.
  • Added fmax related information to the Loop Analysis report.
  • Added a new topic Performance Data Types.
  • Added a new topic Intel VTune Profiler.
  • Added a new topic Invoking the Profiler Runtime Wrapper to Obtain Profiling .
  • Made minor updates and reorganized the existing topics of Profiling Your Kernel to Identify Performance Bottlenecks chapter.
  • Added a new topic Loop Fusion.
  • Updated the Loops in a Single Work-Item Kernel topic completely.
  • Updated the Loop-Carried Dependencies that Affect the Initiation Interval of a Loop topic completely.
  • Updated the Trade-Off Between Initiation Interval and Maximum Frequency topic completely.
2019.09.30 19.3
  • Updated the topic Load-Store Units completely.
    • Removed Streaming Load-Store Units, Semi-Streaming Load-Store Units, and Global Infrequent Load-Store Units sections.
    • Changed Local-Pipelined Load-Store Units as Pipelined Load-Store Units and added more information within this section.
    • Updated the code snippet in the Cached section.
    • Added new topics Controlling the Load-Store Units and When to Use Each LSU.
  • Updated the Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor topic completely and replaced the code snippets.
  • Updated the Channels topic to include more information about the depth attribute.
  • Added a new topic about Schedule Viewer.
  • Minor updates in Reviewing Block Information and Reviewing Cluster Information topics.
  • Added a new topic Reviewing System Information and moved some of the existing instructions to this page.
  • Removed system view related information and images from the Features of the Graph Viewer topic and moved it to the Reviewing System Information topic.
  • Updated images in High Level Design Report Layout and Reviewing the Report Summary topics.
  • Made minor updates in Accessing HLD FPGA Reports in JSON Format topic.
2019.07.01 19.2
  • Added the following topics from the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide in the Profiling Your Kernel to Identify Performance Bottlenecks:
    • Launching the GUI (report)
    • Instrumenting the Kernel Pipeline with Performance Counters (-profile)
    • Profiling Autorun Kernels
  • Removed the topic HTML Report: Area Report Messages and moved its subtopics under Reviewing Area Information.
  • In Reviewing Area Information, included a note about analyze-area from the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide
  • System Viewer, Block Viewer and Cluster Viewer topics merged into the Graph Viewer report. Relevant topics and images were updated accordingly.
  • In Single Work-Item Kernel versus NDRange Kernel, accum_swg kernel code line 6 was updated.
2019.05.08 19.1
  • Updated Kernel Execution Tab since “Memory Copy (from device)” and “Memory Copy (to device)” are no longer supported.
  • Added document archives chapter.
2019.04.01 19.1
2018.09.24 18.1
2018.05.04 18.0
Table 21.   Intel® FPGA SDK for OpenCL™ Best Practices Guide Document Revision History
Date Version Changes
December 2017 2017.12.08
  • Added the following new topics:
    • Autorun Captures Tab
    • Autorun Profiler Data
November 2017 2017.11.06
May 2017 2017.05.08
December 2016 2016.12.02 Minor editorial modification.
October 2016 2016.10.31
  • Rebranded the Altera SDK for OpenCL to Intel® FPGA SDK for OpenCL™ .
  • Rebranded the Altera Offline Compiler to Intel® FPGA SDK for OpenCL™ Offline Compiler.
  • In Align a Struct with or without Padding, modified code snippets to correct the placement of attributes with respect to the struct declaration.
  • Added the topic Review Your Kernel's report.html File, with subtopics describing the HTML GUI, the various reports the GUI provides, and a walkthrough on how to leverage the information in the HTML report to optimize an OpenCL design example.
  • Changed Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage to HTML Report: Area Report Messages, and removed the following subsections:
    • Area Report Messages for Global Memory and Global Memory Interconnect
    • Area Report Messages for Local Memory
    • Area Report Messages for Channels
  • Added the topic HTML Report: Kernel Design Concepts, which includes subtopics on kernels, global memory interconnect, local memory, nested loops, loops in single work-item kernels, and channels.
  • In Interpreting the Profiling Information, reorganized the content and added the following:
    • Additional explanations on stall, occupancy, bandwidth, activity, and cache hit.
    • Suggestions on addressing unsatisfactory Profiler metrics.
  • In Addressing Single Work-Item Kernel Dependencies Based On Optimization Report Feedback, modified the figure Optimization Work Flow of a Single Work-Item Kernel to replace area report with HTML report.
  • Removed the Optimization Report section along with the associated subsections because the information is now part of the HTML report.
  • Changed Review Kernel Properties and Loop Unroll Status in the Optimization Report to Review Kernel Properties and Loop Unroll Status in the HTML Report because the optimization report is now part of the report.html file.
May 2016 2016.05.02
  • Added the topic Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays to introduce the ivdep pragma.
  • Under Strategies for Improving Memory Access Efficiency, added the following topics to explain how to use the numbanks and bankwidth kernel attributes to configure the geometry of local memory system:
    • Improve Kernel Performance by Banking the Local Memory
    • Optimize the Geometric Configuration of Local Memory Banks Based on Array Index
  • Under Strategies for Improving Memory Access Efficiency, added the topic Optimize Accesses to Local Memory by Controlling the Memory Replication Factor to explain the usage of the singlepump and doublepump kernel attributes.
  • Added information on the area report messages. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information.
  • Removed the Kernel-Specific Area Report section because it is replaced by the enhanced area report. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information.
  • Updated the subsections under Optimization Report to include the enhanced optimization report messages.
    • Added the Optimization Report Message for Speed-Limiting Constructs
  • Updated the subsections under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to include the enhanced optimization report messages.
  • Updated the figure Optimization Work Flow for a Single Work-Item Kernel to include steps on accessing the enhanced area report to review resource usage.
  • Under Strategies for Improving NDRange Kernel Data Processing Efficiency, added the Review Kernel Properties and Loop Unroll Status in the Optimization Report section.
November 2015 2015.11.02
  • Added the topic Multi-Threaded Host Application.
  • Added Caution note regarding memory barrier in Specify a Maximum Work-Group Size or a Required Work-Group Size.
May 2015 15.0.0
  • In Memory Access Considerations, added Caution note regarding performance degradation that might occur when declaring __constant pointer arguments in kernels targeting Cyclone® V devices.
  • In Good Design Practices for Single Work-Item Kernel, removed the Initialize Data Prior to Usage in a Loop section and added a Declare Variables in the Deepest Scope Possible section.
  • Added Removing Loop-Carried Dependency by Inferring Shift Registers. The topic discusses how, in single work-item kernels, inferring double precision floating-point array as a shift register can remove loop-carried dependencies.
  • Added Kernel-Specific Area Reports to show examples of kernel-specific .area files that the Altera Offline Compiler generates during compilation.
  • Renamed Transfer Data Via offline compiler Channels to Transfer Data Via offline compiler Channels or OpenCL Pipes and added the following:
    • More information on how channels can help improve kernel performance.
    • Information on OpenCL pipes.
  • Renamed Data Type Considerations to Data Type Selection Considerations.
December 2014 14.1.0
  • Reorganized the information flow in the Optimization Report Messages section to update report messages and the layout of the optimization report.
  • Included new optimization report messages detailing the reasons for unsuccessful and suboptimal pipelined executions.
  • Added the Optimization Report Messages for Simplified Analysis of a Complex Design subsection under Optimization Report Messages to describe new report message for simplified kernel analysis.
  • Renamed Using Feedback from the Optimization Report to Address Single Work-Item Kernels Dependencies to Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback.
  • Added the Transferring Loop-Carried Dependency to Local Memory subsection under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to describe new strategy for resolving loop-carried dependency.
  • Updated the Resource-Driven Optimization and Compilation Considerations sections to reflect the deprecation of the -O3 and --util <N> Altera® Offline Compiler (offline compiler) command options.
  • Consolidated and simplified the Heterogeneous Memory Buffers and Host Application Modifications for Heterogeneous Memory Accesses sections.
  • Added the section Align a Struct and Remove Padding between Struct Fields.
  • Removed the section Ensure 4-Byte Alignment to All Data Structures.
  • Modified the figure Single Work-Item Optimization Work Flow to include emulation and profiling.
June 2014 14.0.0
  • Renamed document as the Intel® FPGA SDK for OpenCL™ Best Practices Guide.
  • Reorganized information flow.
  • Renamed Good Design Practices to Good OpenCL Kernel Design Practices.
  • Added channels information in Transfer data via offline compiler Channels.
  • Added profiler information in Profile Your Kernel to Identify Performance Bottlenecks.
  • Added the section Single Work-Item Kernel Versus NDRange Kernel.
  • Updated Single Work-Item Execution section.
  • Removed Performance Warning Messages section.
  • Renamed Single Work-Item Kernel Programming Considerations to Good Design Practices for Single Work-Item Kernel.
  • Added the section Strategies for Improving Single Work-Item Kernel Performance.
  • Renamed Optimization of Data Processing Efficiency to Strategies for Improving NDRange Kernel Data Processing Efficiency.
  • Removed Resource Sharing section.
  • Renamed Floating-Point Operations to Optimize Floating-Point Operations.
  • Renamed Optimization of Memory Access Efficiency to Strategies for Improving Memory Access Efficiency.
  • Updated Manual Partitioning of Global Memory section.
  • Added the section Strategies for Optimizing FPGA Area Usage.
December 2013 13.1.1
  • Updated the section Specify a Maximum Work-Group Size or a Required Work-Group Size.
  • Added the section Heterogeneous Memory Buffers.
  • Updated the section Single Work-Item Execution.
  • Added the section Performance Warning Messages.
  • Updated the section Single Work-Item Kernel Programming Considerations .
November 2013 13.1.0
  • Reorganized information flow.
  • Updated the section Intel® FPGA SDK for OpenCL™ Compilation Flow.
  • Updated the section Pipelines; inserted the figure Example Multistage Pipeline Diagram.
  • Removed the following figures:
    • Instruction Flow through a Five-Stage Pipeline Processor.
    • Vector Addition Kernel Compiled to an FPGA.
    • Effect of Kernel Vectorization on Array Summation.
    • Data Flow Implementation of a Four-Element Accumulation Kernel.
    • Data Flow Implementation of a Four-Element Accumulation Kernel with Loop Unrolled.
    • Complete Loop Unrolling.
    • Unrolling Two Loop Iterations.
    • Memory Master Interconnect.
    • Local Memory Read and Write Ports.
    • Local Memory Configuration.
  • Updated the section Good Design Practices.
  • Removed the following sections:
    • Predicated Execution.
    • Throughput Analysis.
    • Case Studies.
  • Updated and renamed Optimizing Data Processing Efficiency to Optimization of Data Processing Efficiency.
  • Renamed Replicating Compute Units versus Kernel SIMD Vectorization to Compute Unit Replication versus Kernel SIMD Vectorization.
  • Renamed Using num_compute_units and num_simd_work_items Together to Combination of Compute Unit Replication and Kernel SIMD Vectorization.
  • Updated and renamed Memory Streaming to Contiguous Memory Accesses.
  • Updated and renamed Optimizing Memory Access to General Guidelines on Optimizing Memory Accesses.
  • Updated and renamed Optimizing Memory Efficiency to Optimization of Memory Access Efficiency.
  • Inserted the subsection Single Work-Item Execution under Optimization of Memory Access Efficiency.
June 2013 13.0 SP1.0
  • Updated support status of OpenCL kernel source code containing complex exit paths.
  • Updated the figure Effect of Kernel Vectorization on Array Summation to correct the data flow between Store and Global Memory.
  • Updated content for the unroll pragma directive in the section Loop Unrolling.
  • Updated content of the Local Memory section.
  • Updated the figure Local Memories Transferring Data Blocks within Matrices A and B to correct the data transfer pattern in Matrix B.
  • Removed the figure Loop Unrolling with Vectorization.
  • Removed the section Optimizing Local Memory Bandwidth.
May 2013 13.0.1
  • Updated terminology. For example, pipeline is replaced with compute unit; vector lane is replaced with SIMD vector lane.
  • Added the following sections under Good Design Practices:
    • Preprocessor Macros.
    • Floating-Point versus Fixed-Point Representations.
    • Recommended Optimization Methodology.
    • Sequence of Optimization Techniques.
  • Updated code fragments.
  • Updated the figure Data Flow with Multiple Compute Units.
  • Updated the figure Compute Unit Replication versus Kernel SIMD Vectorization.
  • Updated the figure Optimizing Throughput Using Compute Unit Replication and SIMD Vectorization.
  • Updated the figure Memory Streaming.
  • Inserted the figure Local Memories Transferring Data Blocks within Matrices A and B.
  • Reorganized the flow of information. Number of figures, tables, and examples have been updated.
  • Included information on new kernel attributes: max_share_resources and num_share_resources .
May 2013 13.0.0
  • Updated pipeline discussion.
  • Updated case study code examples and results tables.
  • Updated figures.
November 2012 12.1.0 Initial release.