Intel® FPGA SDK for OpenCL™ - Support Center
Product Discontinuance Notice
Intel is discontinuing Intel® FPGA SDK for OpenCL™, more information can be found in product discontinuance notification (PDN2219).
The Intel® FPGA SDK for OpenCL support page provides information on how to emulate, compile, and profile your kernel. There are also guidelines on how to optimize your kernel as well as information on how to debug your system while running host application. This page is organized into two major categories based on the development platform—kernel developer for FPGA and host code developer for CPUs.
Software Requirement
You must have administrator privileges on the development system to install the necessary packages and drivers required for the host software development.
The host system must be running one of the following supported Windows* and Linux* operating systems listed on the Operating System Support page.
Develop your host application for the Intel® FPGA SDK for OpenCL™ using one of the following development environments:
Windows OS systems
- Intel FPGA SDK for OpenCL
- Board support package (BSP)
- Microsoft* Visual Studio Professional version 2010 or later.
Linux OS systems
- Intel FPGA SDK for OpenCL
- BSP
- RPM (RPM Package Manager; originally Red Hat Package Manager)
- C compiler included with GCC
- Perl command version 5 or later
1. Kernel Developer
SDK User Interface
Intel® FPGA SDK for OpenCL™ provides two modes of development experience for users. For code builders, all the tools are integrated into the GUI, which allows them to design, compile, and debug the kernel. On the other hand, the command-line options are for conventional users.
- GUI/code builder: Not available at the moment
- Command-line option:
Here are some useful commands for kernel developers:
aoc kernel.cl -o bin/kernel.aocx –board=<board_name>
- Compiles kernel.cl source file into a FPGA programming file (kernel.aocx) for board specified by <board_name>; -o is used to specify the output file name and location
aoc kernel.cl -o bin/kernel.aocx –board=<board_name> -march=emulator
- Builds an aocx file for emulation which can be used to test the functionality of the kernel
aoc -list-boards
- Prints a list of available boards and exits
aoc -help
- Prints complete list of aoc command options and help information for each of these options
aocl version
- Shows version information for the installed version of Intel FPGA SDK for OpenCL
aocl install
- Installs drivers for your board into the current host system
aocl diagnose
- Runs board vendor's test program for the board
aocl program
- Configures a new FPGA image onto the board
aocl flash
- Initializes the FPGA with a specified startup configuration
aocl help
- Prints complete list of aocl command options and help information for each of these options
OpenCL Specification
Khronos Compatibility
Intel® FPGA SDK for OpenCL™ is based on a published Khronos Specification and is supported by many vendors who are part of the Khronos group. Intel FPGA SDK for OpenCL has passed the Khronos Conformance Testing Process. It conforms to the OpenCL 1.0 standard and provides both the OpenCL 1.0 and OpenCL 2.0 headers by the Khronos Group.
Attention: The SDK currently does not support all OpenCL 2.0 application programming interfaces (APIs). If you use the OpenCL 2.0 headers and make a call to an unsupported API, the call will return an error code to indicate that the API is not fully supported.
The Intel FPGA SDK for OpenCL host runtime conforms with the OpenCL platform layer and API with some clarifications and exceptions, which can be found at the Support Statuses of OpenCL Features section of the Intel FPGA SDK for OpenCL Programming Guide.
Other Related Links:
- For more information on OpenCL, visit the Kronos Group OpenCL Overview page.
- Current conformance status can be found at the Kronos Group Adopter Program page.
- For more information on the OpenCL 1.0 standard, refer to The OpenCL Specification by Khronos.
OpenCL Extensions
Channels (I/Os or Kernel)
The Intel® FPGA SDK for OpenCL™ channel extension provides a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. Use the following links for more information on how to implement, use, and emulate channels:
- Implementing the Intel FPGA SDK for OpenCL Channels Extension
- Using Channels with Kernel Copies
- HTML Report: Kernel Design Concepts - Channels
- Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes
- Requirement for Multiple Command Queues in Channels or Pipes Implementation
Note: If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes. For more information on pipes, see the following section on pipes.
Pipes
Intel FPGA SDK for OpenCL provides preliminary support for OpenCL pipe functions, which are part of the OpenCL Specification version 2.0. They provide a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency.
The Intel FPGA SDK for OpenCL implementation of pipes is not fully conformant to the OpenCL Specification version 2.0. The goal of the SDK's pipe implementation is to provide a solution that works seamlessly on a different OpenCL 2.0-conformant device. To enable pipes for Intel FPGA products, your design must meet certain requirements.
See the following links for more information on how to implement OpenCL pipes:
- Implementing OpenCL Pipes
- Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes
- Requirement for Multiple Command Queues in Channels or Pipes Implementation
Emulator
In a multistep design flow, you can assess the functionality of your OpenCL™ kernel by executing it on one or multiple emulation devices on an x86-64 Windows* or Linux* host. The compilation of the design for emulation takes seconds to generate an .aocx file and allows you to iterate on your design more effectively without having to go through the lengthy hours required for the full compilation.
For Linux systems, the emulator offers symbolic debug support. Symbolic debug allows you to locate the origins of functional errors in your kernel code.
The link below has an overview of the design flow for OpenCL kernels and illustrates the different stages for which you can emulate your kernel.
Multistep Intel® FPGA SDK for OpenCL Design Flow
The Emulating and Debugging Your OpenCL Kernel section from the Programming Guide contains more details on the differences between kernel operation on hardware and emulation.
Other Related Links:
- Emulating and Debugging Your OpenCL Kernel
- Emulating I/O channels
- Verifying Host Runtime Functionality via Emulation (Windows)
- Verifying Host Runtime Functionality via Emulation (Linux)
Optimization
With the Intel® FPGA SDK for OpenCL™ Offline Compiler technology, you do not need to change your kernel to fit it optimally into a fixed hardware architecture. Instead, the offline compiler customizes the hardware architecture automatically to accommodate your kernel requirements.
In general, you should optimize a kernel that targets a single compute unit first. After you optimize this compute unit, increase the performance by scaling the hardware to fill the remainder of the FPGA. The hardware footprint of the kernel correlates with the time it takes for hardware compilation. Therefore, the more optimizations you can perform with a smaller footprint (that is, a single computing unit), the more hardware compilations you can perform in a given amount of time.
OpenCL Optimization for Intel FPGAs
To optimize the implementation of your design and get the maximum performance, understand your theoretical maximum performance and understand what your limitations are. Follow these steps:
- Start with a simple known good functional implementation.
- Use an emulator to validate the functionality.
- Remove or minimize the pipeline stalls that are reported with the optimization report.
- Plan memory access for optimal memory bandwidth.
- Use a profiler to debug performance issues.
The Profiler gives more insight into the system performance, which gives you direction to further optimize the algorithm in usage of the memory.
Remember that for FPGAs, the more resources that can be allocated, the more unrolling, parallelization, and higher performance can be attained.
Helpful Reports and Resources for Optimization
There are a number of system generated reports available to users. These reports give insight into the code, resource usage, and hints on where to focus to further improve the performance:
- Loop Analysis Report of an OpenCL Design Example
- Verifying Information on Memory Replication and Stalls
- Reviewing Area Information
- HTML Report: Area Report Messages
Memory Optimization
Understanding memory systems is crucial to efficiently implement an application using OpenCL.
Global Memory Interconnect
Unlike a GPU, an FPGA can build any custom load-store unit (LSU) that is most optimal for your application. As a result, your ability to write OpenCL code that selects the ideal LSU types for your application might help improve the performance of your design significantly.
For more information, refer to the Global Memory Interconnect section of the Intel FPGA SDK for the OpenCL Best Practices Guide.
Local Memory
Local memory is a complex system. Unlike typical GPU architecture where there are different levels of caches, an FPGA implements local memory in dedicated memory blocks inside the FPGA. For more information, refer to the Local Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
There are a number of ways memory used can be optimized for improving the overall performance. For more information on some of the key techniques, refer to the Allocating Aligned Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on the strategies to improve memory access efficiency, refer to the Strategies for Improving Memory Access Efficiency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Pipelines
Understanding pipelines is crucial for leveraging the best performance of your implementation. Efficient use of pipelines directly improves the performance throughput. For more details, refer to the Pipelines section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on data transfer, refer to the Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Stall, Occupancy, Bandwidth
Profile your kernel to identify performance bottlenecks. For more information on how profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance, refer to the Profiling Your Kernel to Identify Performance Bottlenecks section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Loop Optimization
Some techniques for optimizing the loops are:
For some tips on removing loop-carried dependencies in various scenarios for a single work item kernel, refer to the Removing Loop-Carried Dependency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on optimizing floating-point operations, refer to the Optimizing Floating-Point Operations section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Area Optimization
Area usage is an important design consideration if your OpenCL kernels are executable on FPGAs of different sizes. When you design your OpenCL application, Intel recommends that you follow certain design strategies for optimizing hardware area usage.
Optimizing kernel performance generally requires additional FPGA resources. In contrast, area optimization often results in decreased performance. During kernel optimization, Intel recommends that you run multiple versions of the kernel on the FPGA board to determine the kernel programming strategy that generates the best size versus performance trade-off.
For more information on strategies for optimizing FPGA area usage, refer to the Strategies for Optimizing FPGA Area Usage section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Reference Design Examples
Some design examples that illustrate the optimization techniques are as follow:
Matrix Multiplication Design Example
This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation.
This example illustrates:
- Single-precision floating-point optimizations
- Local memory buffering
- Compile optimizations (loop unrolling, num_simd_work_items attribute)
- Floating-point optimizations
- Multiple device execution
Time-Domain FIR Filter Design Example
This design example implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite.
This design is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters.
This example illustrates:
- Single-precision floating-point optimizations
- Efficient 1D sliding window buffer implementation
- Single work-item kernel optimization methods
Video Downscaling Design Example
This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.
This example illustrates
- Kernel channels
- Multiple simultaneous kernels
- Kernel-to-kernel channels
- Sliding window design pattern
- Memory access pattern optimizations
This design example is an OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative, and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone® V SoC Development Kit.
This example illustrates:
- Single work-item kernel
- Sliding window design pattern
- Resource usage reduction techniques
- Visual output
Training
Online training specific to OpenCL optimization with design examples are available at:
- OpenCL Optimization Techniques: Image Processing Algorithm Example
- OpenCL Optimization Techniques: Secure Hash Algorithm Example
References
Profiling
In a multistep design flow, if the estimated kernel performance from emulation is acceptable, you can chose to collect information about how your design performs while executing on the FPGA.
You can instruct the Intel® FPGA SDK for OpenCL™ Offline Compiler to instrument performance counters in the Verilog code in the .aocx file with the -profile option. During execution, the Intel FPGA SDK for OpenCL Profiler measures and reports performance data that are collected during the OpenCL kernel execution on the FPGA. You can then review the performance data in the Profiler GUI.
The Profiling Your OpenCL Kernel section of the Intel FPGA SDK for OpenCL Programming Guide contains more information on how to profile your kernel.
How to Analyze Profiling Data
Profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance. The Profile Your Kernel to Identify Performance Bottlenecks section of the Intel FPGA SDK for OpenCL Best Practices Guide contains more in-depth information on the Dynamic Profiler GUI and how to interpret profiling data such as stall, bandwidth, cache hits, and so on. It also contains Profiler analysis of several OpenCL design example scenarios.
2. Host Code Developer
Runtime Host Libraries
Intel® FPGA SDK for OpenCL™ provides a compiler and tools for you to build and run OpenCL applications that target Intel FPGA products.
If you only require the Intel FPGA SDK for OpenCL's kernel deployment functionality, download and install the Intel FPGA Runtime Environment (RTE) for OpenCL.
The RTE is a subset of the Intel FPGA SDK for OpenCL. Unlike the SDK, which provides an environment that enables the development and deployment of OpenCL kernel programs, the RTE provides tools and runtime components that enable you to build and execute a host program, and execute precompiled OpenCL kernel programs on target accelerator boards.
Do not install the SDK and the RTE on the same host system. The SDK already contains the RTE.
Utilities and Host Runtime Libraries
The RTE for OpenCL provides utilities, host runtime libraries, drivers, and RTE-specific libraries and files.
- The RTE Utility includes commands you can invoke to perform high-level tasks. The RTE utilities are a subset of of the Intel FPGA SDK for OpenCL utilities
- The host runtime provides the OpenCL host platform API and runtime API for your OpenCL host application
The host runtime consists of the following libraries:
- Statically-linked libraries provide OpenCL host APIs, hardware abstractions, and helper libraries
- Dynamic link libraries (DLLs) provide hardware abstractions and helper libraries
For more information on utilities and host runtime libraries, refer to the Contents of the Intel FPGA RTE for OpenCL section of the Intel FPGA RTE for OpenCL Getting Started Guide.
Data Streaming (Host Channel)
You can now significantly reduce the system latency of your systems using host channels that allows streaming data from the host to stream directly into the FPGA kernel through the PCIe* interface while bypassing the memory controller. The FPGA kernel can begin processing the data immediately and does not have to wait for the data transfer to complete. Host channels are supported in the OpenCL runtime application programming interfaces (APIs) and include emulation support.
For more details on host channels and emulation support, refer to the Emulating I/O Channels section of the Intel® FPGA SDK for OpenCL™ Programming Guide.
Profilling
Profiling allows you to learn where your program spent its time and what are the different functions that are called. This information shows you which part of your program is running slower than you expected that might need a rewrite for faster program execution. It can also tell you which functions are being called more or less often than you expected.
gprof
The gprof is an open-source tool available in Linux* operating systems for profiling the source code. It works on time-based sampling. During intervals the program counter is interrogated to decide at which point in the code the execution has arrived.
To use the gprof, recompile the source code using the compiler profiling flag -pg
Run the executables to generate the files containing profiling information:
A specific file named “gmon.out” containing all the information that the gprof tool requires to produce a human-readable profiling data is generated. So, now use the gprof tool in the following way:
$ gprof source code gmon.out > profile_data.txt
profile_data.txt is the file that contains the information that the gprof tool uses to produce human-readable profiling data. This contains two parts: flat profile and call graph.
The flat profile shows how much time your program spent in each function, and how many times that function was called.
The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function.
More information on the usage of gprof for profiling is available on the GNU website.
Intel® VTune™ Amplifier
The Intel® VTune™ Amplifier used for profiling helps you speed up and optimize execution of your code on Linux embedded platforms, Android*, or Windows* systems providing the following types of analysis:
- Performance analysis: Find serial and parallel code bottlenecks, analyze algorithm choices, and GPU engine usage, and understand where and how your application can benefit from available hardware resources
- Intel Energy Profiler analysis: Analyze power events and identify those that waste energy
For more information on the Intel V-tune Amplifier, visit the Getting Started with Intel VTune Amplifier 2018 for Linux OS website.
Multithreading
OpenCL™ host pipelined multithread provides a framework to achieve high throughput for algorithms where a large number of input data needs to be processed and the process for each data needs to be done in sequential order. One of the best applications of this framework is in heterogeneous platforms where high-throughput hardware or platform is used to accelerate the most time-consuming part of the application. Remaining parts of the algorithm must run in a sequential order on other platforms such as CPUs, to either prepare the input data for the accelerated task or to use the output of that task to prepare the final output. In this scenario, although the performance of the algorithm is partially accelerated, the total system throughput is much lower because of the sequential nature of the original algorithm.
In this AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Application Note, a new pipelined framework for high-throughput design is proposed. This framework is optimal for processing large input data through algorithms where data dependency forces sequential execution of all stages or tasks of the algorithm.
FPGA Initiailization from Host
FPGAs are highly used in the acceleration space. OpenCL has a specific way to be used by the CPU to offload task to FPGA. The file attached below contains the common initialization steps needed for the host code to launch the FPGA kernel. Download the file containing initialization steps here.
The init() function can be called from the main() function to initialize the FPGA. The code first finds the device upon which the kernel will run, and then programs it with the aocx supplied in the same directory as the host execuatable. After the initialization steps in the code, the user must set the kernel arguments according to their designs needs.
There is also a cleanup() function which frees the resources after executing the kernel.
3. Debug
Emulation
The Intel® FPGA SDK for OpenCL™ Emulator can be used to check the functionality of the kernel. User can also debug OpenCL kernel functionality as part of the host application on Linux* systems. The debugging feature provided with the Intel FPGA SDK for OpenCL Emulator allows you to do so.
For more information, refer to these sections in the Intel FPGA SDK for OpenCL Programming Guide:
Profiling
For more information on profiling, refer to these sections in the Intel® FPGA SDK for OpenCL™ Programming Guide:
Runtime Debug Variables |
|
---|---|
There are certain environment variables that can be set to get more debug information while running the host application. These are Intel® FPGA SDK for OpenCL™ specific environment variables, which can help diagnose problems with custom platform designs. The following table lists all of these environment variables as well as describes them in detail. | |
Environment Variable Title | Description |
ACL_HAL_DEBUG |
Set this variable to a value of 1 to 5 to increase debug output from the hardware abstraction layer (HAL), which interfaces directly with the MMD layer. |
ACL_PCIE_DEBUG |
Set this variable to a value of 1 to 10,000 to increase debug output from the MMD. This variable setting is useful for confirming that the version ID register was read correctly and the UniPHY IP cores are calibrated. |
ACL_PCIE_JTAG_CABLE |
Set this variable to override the default quartus_pgm argument that specifies the cable number. The default is cable 1. If there are multiple Intel® FPGA Download Cables, you can specify a particular cable by setting this variable. |
ACL_PCIE_JTAG_DEVICE_INDEX |
Set this variable to override the default quartus_pgm argument that specifies the FPGA device index. By default, this variable has a value of 1. If the FPGA is not the first device in the JTAG chain, you can customize the value. |
ACL_PCIE_USE_JTAG_PROGRAMMING |
Set this variable to force the MMD to reprogram the FPGA using the JTAG cable instead of partial reconfiguration. |
ACL_PCIE_DMA_USE_MSI |
Set this variable if you want to use MSI for direct memory access (DMA) transfers on Windows* OS. |
Diagnostic Tool for Intel® FPGA SDK for OpenCL™
The diagnostic tool for Intel FPGA SDK for OpenCL helps diagnose and resolve various installation/setup issues, hardware and software issues that come up while working with Intel FPGA SDK for OpenCL. The tool performs installation tests, device tests and link tests. For more information about the tool, refer to this presentation. To use the tool, download from here.
Other Debugging techniques
Due to a loop in the host program, users may experience the OpenCL™ system slowing down while running it. To know more details about such a scenario, refer to the Debugging Your OpenCL System That is Gradually Slowing Down section of the Intel® FPGA SDK for OpenCL Programming Guide.
The Intel Code Builder for OpenCL is a software development tool available as part of the Intel FPGA SDK for OpenCL. It provides a set of Microsoft* Visual Studio and Eclipse plug-ins that enable capabilities for creating, building, debugging, and analyzing Windows* and Linux* applications accelerated with OpenCL. For more information, refer to the Developing/Debugging OpenCL Applications Using Intel Code Builder for OpenCL section of the Intel FPGA SDK for OpenCL Programming Guide.
Knowledge Database Solution
Intel® Arria® 10 Devices
Intel® Stratix® 10 Devices
Additional Resources
Here are some additional links from the Intel FPGA Community for specific issues related to design and run stages:
4. Available Training
Training Courses
View the following OpenCL™ training courses:
- Introduction to Parallel Computing with OpenCL™ on Intel® FPGAs
- Writing OpenCL on Intel FPGAs
- Running OpenCL on Intel FPGAs
- Other OpenCL Training Courses
- Building an RTL Module for the Intel® FPGA SDK for OpenCL™
- Building Custom Platforms for Intel® FPGA SDK for OpenCL™: BSP Basics
- Building Custom Platforms for Intel® FPGA SDK for OpenCL™: Modifying a Reference Platform
OpenCL™ Quick Videos |
|
---|---|
Video Title |
Video Description |
How to Run Hello World and (Other Programs) with OpenCL™ on Cyclone® V SoC Using Windows* Part 1 |
This video describes the out-of-box procedure for running two applications, OpenCL™ HelloWorld and OpenCL fast Fourier transform (FFT) on the Cyclone® V SoC using a Windows* machine. |
How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 2 |
This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. |
How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 3 |
This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. |
How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 4 |
This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. |
How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 5 |
This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine. |
How to Package Custom Verilog Modules/Designs as OpenCL Libraries |
The video discusses why customers could potentially use this feature to have their custom processing blocks (RTL) in OpenCL kernel code. The video explains the design example, such as the makefiles and config files, and explains the compilation flow. The video also shows a demo of the design example. |
OpenCL on Altera® SoC FPGA (Linux* Host) – Part 1 – Tools Download and Setup |
This video shows you how to download, install, and configure the tools required to develop OpenCL kernels and host code targeting Altera® SoC FPGAs. |
OpenCL on Altera SoC FPGA (Linux Host) – Part 2 – Running the Vector Add Example with the Emulator |
This video shows you how to download and compile an example OpenCL application targeting the emulator that is built into the OpenCL. |
OpenCL on Altera SoC FPGA (Linux Host) – Part 3 – Kernel and Host Code Compilation for SoC FPGA |
This video shows you how to compile the OpenCL kernel and host code targeting the FPGA and processor of the Cyclone V SoC FPGA. |
OpenCL on Altera SoC FPGA (Linux Host) – Part 4 – Setup of the Runtime Environment |
This video shows you how to set up the Cyclone V SoC board to run the OpenCL example and execute the host code and kernel on the board. |