Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Conversion Rules for ap_float

You can convert between different sizes of ap_float data types through assignment or by using the convert_to() function. For example,

using namespace ihc; 
ap_float<8, 32> myFloat = ...; 
ap_float<3, 18> myFloat2 = myFloat; // use rounding rules defined by ap_float type 

// use rounding rules defined in convert_to() function call
ap_float <3, 18> myFloat3 = myFloat.convert_to<3, 18, ihc::fp_config::FP_Round::RZERO>();

To convert between native types (for example, float, double) and ap_float data types, assign to or from the types. Type conversion in an assignment occurs according to the rules mentioned in Table 1.

For two ap_float variables in a binary operation, the ap_float variable with the larger exponent bit-width is considered to be the larger variable. If two variables have the same exponent bit width, the variable with the larger mantissa bit-width is considered to be the larger variable. The operands are then unified to the larger type before the binary operation occurs.

Native floating-point data types and ap_float data types are converted to ap_float data types according to the rules in Table 1.

The Intel® oneAPI DPC++/C++ Compiler also provides some operations that leave the precision of input types untouched and provide control over the output precision. For more details, refer to Operations with Explicit Precision Controls.

Default Conversion Rules for ap_float Variables
Data Type From ap_float To Data Type From Data Type To ap_float
ap_float with higher representable range

Keep exponent equivalent.

The mantissa is rounded according to the rounding mode of the target ap_float (with the higher representable range).

+-Inf if the source of the conversion is out of the representable range. Otherwise, keep exponent equivalent.

The mantissa is rounded according to the rounding mode of the target ap_float (with the smaller representable range).

float Convert original ap_float to ap_float<8, 23> with the previous ap_float rule, and then bit cast to float. Bit-cast float to ap_float<8, 23>, and then convert to target ap_float precision using the ap_float to ap_float rules described previously.
double Convert original ap_float to ap_float<11, 52> with earlier ap_float rule, and then bit cast to double. Bit-cast double to ap_float<11, 52>, and then convert to the target ap_float precision using the ap_float to ap_float rules described earlier.
long double

(emulation only)

(Linux only)

Convert the original ap_float to ap_float<15, 63> with the earlier ap_float rule, and then insert a 1-bit 1 to the MSB of fraction bits to get an approximate equivalent of 80-bit representation of a long double. Drop the explicit one fraction bit to convert long double to 79-bit ap_float<15, 63>.
C++ native integer types

Truncate towards zero. Converting from ap_float that is larger than the range of integer type is an undefined behavior.

Round to the nearest, tie breaks to even. If the integer value is too large, the ap_float value saturates to plus infinity.

NOTE:

You must avoid assigning the result of the convert_to function to another ap_float variable. This is because if the left-hand side of the assignment has a different exponent or mantissa widths than the ones specified in the convert_to function on the right-hand side, another conversion can occur.