Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 9/26/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

10.1.2. Using a Single Kernel to Describe Systolic Arrays

For an Intel® Stratix® 10 OpenCL design, Intel® recommends that you describe a systolic array as a single kernel, using a function for the processing element (PE), and a fully-unrolled loop or nested loop to represent the array.

Unoptimized multi-kernel systolic array pseudocode:

// data distribution network over an array of channels

channel int c[ROWS][COLS]; 
channel int d[ROWS][COLS]; 

attribute((num_compute_units(ROWS,COLS)) 
kernel void PE() {
   // get data values from my neighbors 
   while(1){
      x = read_channel_intel(c[ROWS-1][COLS]); 
      y = read_channel_inel(d[ROWS][COLS-1]);
 
 
      // some code that uses x and y 
      ... 
      // send the same data values to the next neighbors 
      write_channel_intel(c[ROWS][COLS], x); 
      write_channel_intel(d[ROWS][COLS], y); 
   }
}

Optimized single-kernel pseudocode:

kernel void allPEs() {
   while(1){
      int c[ROWS], d[COLS]; 
   
      #pragma unroll 
      for (int i = 0; i < ROWS; i++) 
         #pragma unroll 
         for (int j = 0; j < COLS; j++) {
            PE(c[i], d[j]); 
         } 
      }
   } 
}
Note: Instead of a kernel, the PE body becomes the function call PE(). Unrolling the loops results in an array of PEs, each of which uses a portion of the FPGA in a 2D array.
Depending on the size of the array, it can be challenging for the Intel® FPGA SDK for OpenCL™ Offline Compiler to generate hardware that distributes the same values c and d to all PEs on a row or column of the array within a single clock cycle. Doing so might cause fMAX to degrade. To remedy this problem, consider using the __fpga_reg() function to instruct the offline compiler to insert registers on c and d with every new PE. Intel® recommends that you only use the __fpga_reg() function when you know that the PEs are spatially separate from one another on the FPGA.
Note: The __fpga_reg() built-in function is an advanced feature. The offline compiler does not provide guidance on where you should insert the __fpga_reg() function calls. To help determine whether it is appropriate to insert the __fpga_reg() function call, you can experimentally quantify the impact additional registers might have on fMAX, and inspect the Intel® Quartus® Prime compilation reports.

Optimized pseudocode with the __fpga_reg() function:

kernel void allPEs() {
   int c[ROWS], d[COLS]; 
   
   while(1){
      #pragma unroll 
      for (int i = 0; i < ROWS; i++) 
         #pragma unroll 
         for (int j = 0; j < COLS; j++) {
            // compute and store outputs 
            PE(c[i], d[j]); 
            c[i] = __fpga_reg(c[i]); 
            d[j] = __fpga_reg(d[j]); 
         } 
      }
   } 
}

After the offline compiler unrolls the loop, there is one more register before every PE on both c and d, allowing the Intel® Quartus® Prime Pro Edition software to place the PEs apart. You may add more than one register by inserting multiple __fpga_reg() calls in your code. For example, the call __fpga_reg(__fpga_reg(x)) adds two registers on the data path. However, having excessive __fpga_reg() calls in your kernel increases the design area, and the congestion might result in fMAX degradation.