DSP Builder for Intel® FPGAs (Advanced Blockset): Handbook
A newer version of this document is available. Customers should click here to go to the newest version.
Visible to Intel only — GUID: hco1423076792614
Ixiasoft
Visible to Intel only — GUID: hco1423076792614
Ixiasoft
11.2. DSP Builder Supported Floating-Point Data Types
Type Name | Sign Width s | Exponent Width e | Exponent Bias b | Mantissa Width m | Description |
---|---|---|---|---|---|
float16_m7 | 1 | 8 | 127 | 7 | Bfloat16 |
float16_m10 | 5 | 15 | 10 | Half-precision IEEE 754-2008) | |
float19_m10 | 8 | 127 | 10 | Also known as TF32 | |
float26_m17 | 8 | 127 | 17 | ||
float32_m23 | 8 | 127 | 23 | Single-precision IEEE 754 | |
float35_m26 | 8 | 127 | 26 | ||
float46_m35 | 10 | 511 | 35 | ||
float55_m44 | 10 | 511 | 44 | ||
float64_m52 | 11 | 1023 | 52 | Double-precision IEEE 754 |
DSP Builder represents the special values positive zero, negative zero, subnormals, and non-numbers in the standard IEEE 754 manner, namely:
- zero is m=0 and e=0 with s giving the sign.
- subnormal is m != 0 and e=0 with s giving the sign.
- infinity is m=0 and e=all ones with s giving the sign.
- not a number (NaN) is m != 0 and e=all ones.
Except for the preceding special values, the numerical value of a float type is given in terms of its bit-wise representation by:
f = (-1)s × 2(e-b) × (1 + (m / (2m_width)))
where:
- e, b, and m are the base-10 equivalents of the respective bit sequences
- the field widths for each of s, e and m and the value of b are given for each format in the table
For example, for a 32-bit single precision floating point number with a bit-wise representation of 0x40300000:
s = 0b = 0 e = 10000000b = 128 m = 01100000000000000000000b = 3145728
then:
f = (-1)^0 × 2^(128-127) × (1+(3145728/(2^23))) = 1 × 2 × (1+0.375) = 2.75