Visible to Intel only — GUID: nlo1517935177881
Ixiasoft
Visible to Intel only — GUID: nlo1517935177881
Ixiasoft
3.3. Optimizing Floating-Point Operations
Tree Balancing
Order of operation rules apply in the OpenCL™ language. In the following example, the offline compiler performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:
result = (((A * B) + C) + (D * E)) + (F * G);
By default, the offline compiler creates an implementation that resembles a long vine for such computations:
Long, unbalanced operations lead to more expensive hardware. A more efficient hardware implementation is a balanced tree, as shown below:
In a balanced tree implementation, the offline compiler converts the long vine of floating-point adders into a tree pipeline structure. The offline compiler does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
If you want the offline compiler to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the -fp-relaxed option in the aoc command, as shown below:
aoc -fp-relaxed <your_kernel_filename>.cl
Rounding Operations
The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The offline compiler does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.
You can reduce the amount of hardware necessary to implement floating-point operations with the -fpc option of the aoc command. If your program can tolerate small differences in floating-point results, invoke the following command:
aoc -fpc <your_kernel_filename>.cl
The -fpc option directs the offline compiler to perform the following tasks:
- Remove floating-point rounding operations and conversions whenever possible.
If possible, the -fpc argument directs the offline compiler to round a floating-point operation only once—at the end of the tree of the floating-point operations.
- Carry additional mantissa bits to maintain precision.
The offline compiler carries additional precision bits through the floating-point calculations, and removes these precision bits at the end of the tree of floating-point operations.
This type of optimization results in hardware that performs a fused floating-point operation, and it is a feature of many new hardware processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to more accurate results. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The offline compiler can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.