Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 9/26/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

7.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

If your replicated or vectorized OpenCL kernel does not fit in the FPGA, you can modify the kernel by both replicating the compute unit and vectorizing the kernel. Include the num_compute_units attribute to modify the number of compute units for the kernel, and include the num_simd_work_items attribute to take advantage of kernel vectorization.

Consider a case where a kernel with a num_simd_work_items attribute set to 16 does not fit in the FPGA. The kernel might fit if you modify it by duplicating a narrower SIMD kernel compute unit. Determining the optimal balance between the number of compute units and the SIMD width might require some experimentation. For example, duplicating a four lane-wide SIMD kernel compute unit three times might achieve better throughput than duplicating an eight lane-wide SIMD kernel compute unit twice.

The following example code shows how you can combine the num_compute_units and num_simd_work_items attributes in your OpenCL™ code:

__attribute__((num_simd_work_items(4)))
__attribute__((num_compute_units(3)))
__attribute__((reqd_work_group_size(8,8,1)))
__kernel void matrixMult(__global float * restrict C,
                         __global float * restrict A,
. . .

The figure below illustrates the data flow of the kernel described above. The num_compute_units implements three replicated compute units. The num_simd_work_items implements four SIMD vector lanes.

Figure 78. Optimizing Throughput by Combining Compute Unit Replication and Kernel SIMD Vectorization


Attention: You can also enable the resource-driven optimizer to determine automatically the best combination of num_compute_units and num_simd_work_items.
Important: It is more time-consuming to compile a hardware design that fills the entire FPGA than smaller designs. When you adjust your kernel optimizations, remove the increased number of SIMD vector lanes and compute units prior to recompiling the kernel.