DSP Builder for Intel® FPGAs (Advanced Blockset): Handbook

ID 683337
Date 9/30/2024
Public
Document Table of Contents

6.7.12. Single-Precision Complex Floating-Point Matrix Multiply

This design example uses a similar flow control style to that in the floating-point Mandlebrot set design example. The design example uses a limited number of multiply-adds, set by the vector size, to perform a complex single precision matrix multiply.

A matrix multiplication must multiply row and column dot product for each output element. For 8×8 matrices A and B:


Equation 1. Matrix Multiply Equation


You may accumulate the adjacent partial results, or build adder trees, without considering any latency. However, to implement with a smaller dot product, consider resource usage folding, which uses a smaller number of multipliers rather than performing everything in parallel. Also split up the loop over k into smaller chunks. Then reorder the calculations to avoid adjacent accumulations.

A traditional implementation of a matrix multiply design is structured around a delay line and an adder tree:

A11B11 +A12B21 +A13B31 and so on.

The traditional implementation has the following features:

  • The length and size grow with folding size (typically 8 to 12)
  • Uses adder trees of 7 to 10 adders that are only used once every 10 cycles.
  • Each matrix size needs different length, so you must provide for the worst case

A better implementation is to use FIFO buffers to provide self-timed control. New data is accumulated when both FIFO buffers have data. This implementation has the following advantages:

  • Runs as fast as possible
  • Is not sensitive to latency of dot product on devices or fMAX
  • Is not sensitive to matrix size (hardware just stalls for small N)
  • Can be responsive to back pressure, which stops FIFO buffers emptying and full feedback to control

The model file is matmul_CS.mdl.