Variable Precision DSP Blocks User Guide: Agilex™ 5 FPGAs and SoCs

ID 813968
Date 9/20/2024
Public
Document Table of Contents

2.3.4.1. Fixed-point to 32-bit Floating-point Conversion Examples

The following examples show the conversion algorithm of the fixed-point to floating-point converter.

Example 1: Convert 20-bit fixed-point dot product vector of 00010011111010101010 (81578 decimal value) to 32-bit floating-point. The exponent is adjusted by the shared_exponent[7:0] values.
  1. The most significant bit in the 20-bit fixed-point dot product vector result represents the sign of the number. In this case, the number is 0 and it represents a positive number.
  2. The floating-point format of the number = 00010011111010101010*20 or 0.0010011111010101010*219.
  3. Next, the value is shifted by 3 bits because it has three leading 0s. The exponent value is adjusted accordingly. The normalized result = 1.0011111010101010000*216.
  4. The result is then converted into 32-bit floating-point with the following formula:
    1. Exponent = 19-bit left shift value + shared_exponent[7:0] value - bias =143
    2. Mantissa = {0011111010101010000, 4'b0000}
    3. 32-bit floating-point operand = 0_10001111_00111110101010100000000
Example 2: Convert 20-bit fixed-point dot product vector of 11111010101010000000 (-21888 decimal value) to 32-bit floating-point. The exponent is adjusted by the shared_exponent[7:0] values. In this example, the shared_exponent[7:0] values is 0.
  1. The most significant bit in the 20-bit fixed-point dot product vector result represents the sign of the number. In this case, the number is 1 and it represents a negative number.
  2. The DSP block converts the number to a sign-magnitude format using the bitwise inversion method. The result is 00000101010110000000*20 or 0.0000101010110000000 * 219.
  3. Next, the value is shifted by 5 bits because it has five leading 0s. The exponent value is adjusted accordingly. The normalized result = 1.0101011000000000000*214.
  4. The result is then converted into 32-bit floating-point with the following formula:
    1. Exponent = 19-bit left shift value + shared_exponent[7:0] value - bias = 141
    2. Mantissa = {0101011000000000000, 4'b0000}
    3. 32-bit floating-point operand = 1_10001101_01010110000000000000000