Variable Precision DSP Blocks User Guide: Agilex™ 5 FPGAs and SoCs

ID 813968
Date 4/01/2024
Public
Document Table of Contents

3.3.1. Tensor Floating-point Mode

In tensor floating-point mode, two columns of 80-bit weights and 8 bit shared exponents can be preloaded to the ping-pong buffers by using one of the following methods:
  • Data input feed
  • Side input feed

The ping-pong buffers load the data to the two DOT product vector engines to start calculating the signed 20-bit fixed-point DOT product vectors simultaneously. DOT product can support 10 signed 8x8 multiplication. Next, the fixed-point to 32-bit floating-point converter converts the output of each DOT product into 32-bit floating-point operands that are adjusted by the shared_exponent[7:0] values. Then, the accumulator adds the two 32-bit floating-point values to either the data input from the cascade_data_in[63:0] bus or the previous cycle’s accumulation value. The accumulator outputs the data in FP32 data format to core fabric or the next DSP block in the chain through the cascade_data_out[63:0] bus.

Table 25.  Tensor Floating-point Mode Equations for Data Input Feed Method
Input Operands Cascade Input Enabled Accumulator Enabled
10-element signed 8x8

Column one = column one cascade_data_in[31:0] + 32-bit floating-point conversion of (data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (shared_exponent_in[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5, b6, b7, b8, b9, b10 are fed into the loading buffer by the same bandwidth of “data_in_1[7:0]data_in_10[10]

Note: shared_exponent_data are fed into the loading buffer by shared_exponent_in[7:0].

Column one = column one accumulator result + 32-bit floating-point conversion of (data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10 + shared_exponent_data[7:0])

Column two = column two cascade_data_in[63:32] + 32-bit floating-point conversion of (data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (shared_exponent_in[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5, b6, b7, b8, b9, b10 are fed into the loading buffer by the same bandwidth of “data_in_1[7:0]data_in_10[10]

Note: shared_exponent_data are fed into the loading buffer by shared_exponent_in[7:0].

Column two = column two accumulator result + 32-bit floating-point conversion of (data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10 + shared_exponent_data[7:0])

Table 26.  Tensor Floating-point Mode Equations for Side Input Feed Method
Input Operands Cascade Input Enabled Accumulator Enabled
10-element signed 8x8

Column one = column one cascade_data_in[31:0] + 32-bit floating-point conversion of [data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (side_in_2[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5 are feed in by shifting side_in_1[7:0]

b6, b7, b8, b9, b10 are feed in by shifting side_in_2[7:0]

Column one = column one accumulator result + 32-bit floating-point conversion of [data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (side_in_2[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5 are feed in by shifting side_in_1[7:0]

b6, b7, b8, b9, b10 are feed in by shifting side_in_2[7:0]

Column two = column two cascade_data_in[63:32] + 32-bit floating-point conversion of [data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (side_in_2[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5 are feed in by shifting side_in_1[7:0]

b6, b7, b8, b9, b10 are feed in by shifting side_in_2[7:0]

Column two = column two accumulator result + 32-bit floating-point conversion of [data_in_1[7:0]*b1 + data_in_2[7:0]*b2 + data_in_3[7:0]*b3 + data_in_4[7:0]*b4 + data_in_5[7:0]*b5 + data_in_6[7:0]*b6 + data_in_7[7:0]*b7 + data_in_8[7:0]*b8 + data_in_9[7:0]*b9 + data_in_10[7:0]*b10, (side_in_2[7:0] + shared_exponent_data[7:0]))

b1, b2, b3, b4, b5 are feed in by shifting side_in_1[7:0]

b6, b7, b8, b9, b10 are feed in by shifting side_in_2[7:0]

Figure 54. Tensor Floating-point Mode One Column Datapath