Visible to Intel only — GUID: GUID-617887EB-37BA-4BEC-A173-4B1D3AB21165
Visible to Intel only — GUID: GUID-617887EB-37BA-4BEC-A173-4B1D3AB21165
Minimize the Use of Expensive Functions
Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.
The following functions are expensive:
- Integer division and modulo (remainder) operators
- Most floating-point operators except addition, multiplication, absolute value, and comparison. For more information about optimizing floating-point operations, refer to the Optimize Floating-point Operation section.
- Atomic operations. For more information, refer to the Memory Model and Atomics in Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL* book and Atomic Operations topic in the SYCL* 2020 specification.
In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.
The following functions are inexpensive:
- Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
- Logical operations with one constant argument
- Shift by constant
- Integer multiplication and division by a constant that is a power of two
If an expensive function produces a new piece of data for every work item in a work group, it is beneficial to code it in a kernel.
On the contrary, the following code example depicts a case of an expensive floating-point operation (division) executed by every work item in the NDRange:
// this function is used in kernel code
void myKernel (accessor<int,
access::mode::read,
access::target::global_buffer> a,
accessor<int,
access::mode::read,
access::target::global_buffer> b,
sycl::id<1> wiID,
const float c,
const float d)
{
//inefficient since each work-item must calculate c divided by d
b[wiID ] = a[wiID ] * (c / d);
}
The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work items in the NDRange to use. The modified code is shown in the following:
void myKernel (accessor<int,
access::mode::read,
access::target::global_buffer> a,
accessor<int,
access::mode::read,
access::target::global_buffer> b,
sycl::id<1> wiID,
const float c_divided_by_d)
{
/*host calculates c divided by d once and passes it into
kernel to avoid redundant expensive calculations*/
b[wiID ] = a[wiID ] * c_divided_by_d;
}
The Intel® oneAPI DPC++/C++ Compiler consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work items. In the first code example, the Intel® oneAPI DPC++/C++ Compiler creates a single divider block shared by all work-items because division of c by d remains constant across all work-items. This optimization helps minimize the amount of redundant hardware.
However, implementing an integer division requires a significant amount of hardware resources. In this case, it is beneficial to offload the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.