Developer Guide

Intel® oneAPI DPC++/C++ Compiler Handbook for FPGAs

ID 785441
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Optimizing Your Kernel

Debugging and testing your kernel as well as reviewing the FPGA Optimization Report can provide areas where you can try to improve the performance of you kernel. In general, the methods you use to improve the performance of your kernels should achieve the following results:

  • Increase the number of parallel operations.
  • Increase the memory bandwidth of the implementation.
  • Increase the number of operations per clock cycle that kernels can perform in hardware.

Areas of optimization are covered in separate chapters as follows:

  • RTL IP Core Kernel Interfaces (RTL IP components only)

    Your RTL IP core can have a variety of interfaces: from basic wires to streaming and memory-mapped host interfaces.

    Each interface type has different benefits. However, the system that surrounds your component might limit your choices. Keep your requirements in mind when determining the optimal interface for your component.

  • Loops

    Review the techniques in this chapter to optimize your loops to boost the performance of your component. Try to eliminate any dependencies in your loops that prevent the compiler from optimizing them.

    You can also provide explicit guidance to the compiler for optimizing loops by using the available loop pragmas and loop attributes.

  • Pipes

    Using global memory to communicate data between your kernels can constrain the performance of your design. Pipes provide a mechanism for passing data between kernels and synchronizing kernels with high efficiency and low latency. Pipes allow kernels to use on-device FIFO buffers to communicate directly with each other.

    Host pipes enable a similar communication method between your kernel and host.

  • Data Types and Arithmetic Operations

    The data types in your kernel and possible conversions or casting that they might undergo can significantly affect the performance and FPGA area usage of your kernel.

    After you optimize the algorithm bottlenecks of your design, you can fine-tune some data types in your component by using arbitrary precision data types to shrink data widths, which reduces FPGA area utilization.

    Because C++ automatically promotes smaller data types such as short or char to 32 bits for operations such as addition or bit-shifting, you must use the arbitrary precision data types if you want to create narrow data paths in your kernel.

  • Parallelism

    The Intel® oneAPI DPC++/C++ Compiler provides the following forms of parallelism for your kernel:

    • NDRange kernels tell your kernel to operate in parallel instances over a work-item index-space. This mode of operation is useful if your kernel describes multiple concurrent threads operating in a data-parallel manner.
    • The task_sequence class provides a way to define operations that you want to run asynchronously from the main flow of your kernel. This class is helpful when you want to express coarse-grained thread-level parallelism in your kernel.

  • Memories and Memory Operations

    The Intel® oneAPI DPC++/C++ Compiler infers efficient memory architectures (like memory width, number of banks and ports) in a kernel by adapting the architecture to the memory access patterns of your kernel. Review this section to learn how you can get the best memory architecture for your component from the compiler. In most cases, you can optimize the memory architecture by modifying the access pattern. However, the compiler also gives you some explicit control over the memory architecture.

  • Libraries

    With libraries, you can reuse functions without knowing the underlying hardware design or implementation details. Libraries let you take advantage of optimized designs developed by others. You can also develop your own libraries for reuse or sharing with others.