DSP Builder for Intel® FPGAs (Advanced Blockset): Handbook

ID 683337
Date 5/27/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

11.2. DSP Builder Supported Floating-Point Data Types

The supported floating-point types are either IEEE 754 formats (half, single and double precision) or custom IEEE 754-like formats with user-specified exponent and fraction-field widths .
Type Name Sign Width s Exponent Width e Exponent Bias b Mantissa Width m Description
float16_m7 1 8 127 7 Bfloat16
float16_m10 5 15 10 Half-precision IEEE 754-2008)
float19_m10 8 127 10 Also known as TF32
float26_m17 8 127 17
float32_m23 8 127 23 Single-precision IEEE 754
float35_m26 8 127 26
float46_m35 10 511 35
float55_m44 10 511 44
float64_m52 11 1023 52 Double-precision IEEE 754

DSP Builder represents the special values positive zero, negative zero, subnormals, and non-numbers in the standard IEEE 754 manner, namely:

  • zero is m=0 and e=0 with s giving the sign.
  • subnormal is m != 0 and e=0 with s giving the sign.
  • infinity is m=0 and e=all ones with s giving the sign.
  • not a number (NaN) is m != 0 and e=all ones.

Except for the preceding special values, the numerical value of a float type is given in terms of its bit-wise representation by:

f = (-1)s × 2(e-b) × (1 + (m / (2m_width)))

where:

  • e, b, and m are the base-10 equivalents of the respective bit sequences
  • the field widths for each of s, e and m and the value of b are given for each format in the table

For example, for a 32-bit single precision floating point number with a bit-wise representation of 0x40300000:

s = 0b                            = 0
e = 10000000b                     = 128
m = 01100000000000000000000b      = 3145728

then:

f = (-1)^0 × 2^(128-127) × (1+(3145728/(2^23)))

  = 1 × 2 × (1+0.375)

  = 2.75