FPGA AI Suite: PCIe-based Design Example User Guide

ID 768977
Date 3/29/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

7.2. Hardware

This section describes the Example Design ( Arria® 10) in detail. However, many of the components close to the IP are shared in common with the Example Design ( Agilex™ 7).

A top-level view of the design example is shown in FPGA AI Suite Example Design Top Level.

There are two instances of FPGA AI Suite, shown on the right (dla_top.sv). All communication between the FPGA AI Suite IP systems and the outside occurs via the FPGA AI Suite DMA. The FPGA AI Suite DMA provides a CSR (which also has interrupt functionality) and reader/writer modules which read/write from DDR.

The host communicates with the board through PCIe* using the CCI-P protocol. The host can do the following things:

  1. Read and write the on-board DDR memory (these reads/writes do not go through FPGA AI Suite).
  2. Read/write to the FPGA AI Suite DMA CSR of both instances.
  3. Receive interrupt signals from the FPGA AI Suite DMA CSR of both instances.

Each FPGA AI Suite IP instance can do the following things:

  1. Read/write to its DDR bank.
  2. Send interrupts to the host through the interrupt interface.
  3. Receive reads/writes to its DMA CSR.

From the perspective of the FPGA AI Suite accelerator function (AF), external connections are to the CCI-P interface running over PCIe* , and to the on-board DDR4 memory. The DDR memory is connected directly to board.qsys block, while the CCI-P interface is converted into Avalon® memory mapped (MM) interfaces in bsp_logic.sv block for communication with the board.qsys block.

The board.qsys block arbitrates the connections to DDR memory between the reader/writer modules in FPGA AI Suite IP and reads/writes from the host. Each FPGA AI Suite IP instance in this design has access to only one of the two DDR banks. This design decision implies that no more than two simultaneous FPGA AI Suite IP instances can exist in the design. Adding an additional arbiter would relax this restriction and allow additional FPGA AI Suite IP instances.

Much of board.qsys operates using the Avalon® Memory-mapped (MM) interface protocol. The FPGA AI Suite DMA uses AXI protocol, and board.qsys has Avalon® MM interface to AXI adapters just before each interface is exported from the FPGA AI Suite IP (so that outside of the Platform Designer system it can be connected to FPGA AI Suite IP). Clock crossing are also handled inside of board.qsys. For example, the host interface must be brought to the DDR clock to talk with the FPGA AI Suite IP CSR.

There are three clock domains: host clock, DDR clock, and the FPGA AI Suite IP clock. The PCIe* logic runs on the host clock at 200Mhz. FPGA AI Suite DMA and the platform adapters run on the DDR clock. The rest of FPGA AI Suite IP runs on the FPGA AI Suite IP clock.

FPGA AI Suite IP protocols:

  • Readers and Writers: 512-bit data (width configurable), 32-bit address AXI4 interface, 16-word max burst (width fixed).
  • CSR: 32-bit data, 11-bit address
Figure 3.  FPGA AI Suite Example Design Top Level
Note: Arrows show host/agent relationships. Clock domains indicated with dashed lines.

The board.qsys interfaces between DDR memory, the readers/writers, and the host read/write channels. The internals of the board.qsys block are shown in Figure 4. This figure shows three Avalon® MM interfaces on the left and bottom: MMIO, host read, and host write.

  • Host read is used to read data from DDR memory and send it to the host.
  • Host write is used to read data from the host into DDR memory.
  • The MMIO interface performs several functions:
    • DDR read and write transactions are initiated by the host via the MMIO interface
    • Reading from the AFU ID block. The AFU ID block identifies the AFU with a unique identifier and is required for the OPAE driver.
    • Reading/writing to the DLA DMA CSRs where each instance has its own CSR base address.
Figure 4. The board.qsys Block, Showing Two DDR Connections and Two IP Instances
Note: Arrows indicate host/agent relationships (from host to agent).

The above figure also shows the ddr_board.qsys block. The three central blocks (address expander, msgdma_bbb.qsys (scatter-gather DMA), and msgdma_bbb.qsys) allow host direct memory access (DMA) to DDR. This DMA is distinct from the DMA module inside of the FPGA AI Suite IP, shown in Figure 3. Host reads and writes begin with the host sending a request via the MMIO interface to initiate a read or write. When requesting a read, the DMA gathers the data from DDR and sends it to the host via the host-read interface. When requesting a write, the DMA reads the data over the host-write interface and subsequently writes it to DDR.

Note that in board.qsys, a block for the Avalon® MM to AXI4 conversion is not explicitly instantiated. Instead, an Avalon® MM pipeline bridge connects to an AXI4 bridge. Platform Designer implicitly infers a protocol adapter between these two bridges.

Note: Avalon® MM/AXI4 adapters in Platform Designer might not close timing.

Platform Designer optimizes for area instead of fMAX by default, so you might need to change the interconnect settings for the inferred Avalon® MM/AXI4 adapter. For example, we made some changes as shown in the following figure.

Figure 5. Adjusting the Interconnect Settings for the Inferred Avalon® MM/AXI4 Adapter to Optimize for fMAX Instead of Area.
Note: This enables timing closure on the DDR clock.

To access the view in the above figure:

  • Within the Platform Designer GUI choose View -> Domains. This brings up the Domains tab in the top-right window.
  • From there, choose an interface (for example, ddr_0_axi).
  • For the selected interface, you can adjust the interconnect parameters, as shown on the bottom-right pane.
  • In particular, we needed to change Burst adapter implementation from Generic converter (slower, lower area) to Per-burst-type converter (faster, higher area) to close timing on the DDR clock.

This was the only change needed to close timing, however it took several rounds of experimentation to determine this was the setting of importance. Depending on your system, other settings might need to be tweaked.