Intel® FPGA SDK for OpenCL™ Pro Edition: Programming Guide

ID 683846
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

11.4.1. Customization of Replicated Kernels Using the get_compute_id() Function

To create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id() intrinsic function in a kernel that also uses the num_compute_units (X,Y,Z) attribute.
Attention: You can only use the get_compute_id() function in a kernel that also uses the autorun and max_global_work_dim(0) kernel attributes.

Retrieving compute IDs is a convenient alternative to replicating your kernel in source code and then adding specialized code to each kernel copy. When a kernel uses the num_compute_units(X,Y,Z) attribute and calls the get_compute_id() function, the Intel® FPGA SDK for OpenCL™ Offline Compiler assigns a unique compute ID to each compute unit. The get_compute_id() function then retrieves these unique compute IDs. You can use the compute ID to specify how the associated compute unit should behave differently from the other compute units that are derived from the same kernel source code. For example, you can use the return value of get_compute_id() to index into an array of channels to specify which channel each compute unit should read from or write to.

The num_compute_units attribute accepts up to three arguments (that is, num_compute_units(X,Y,Z)). In conjunction with the get_compute_id() function, this attribute allows you to create one-dimensional, two-dimensional, and three-dimensional logical arrays of compute units. An example use case of a 1D array of compute units is a linear pipeline of kernels (also called a daisy chain of kernels). An example use case of a 2D array of compute units is a systolic array of kernels.

Figure 37. Schematic Diagram of a 4x4 Array of Compute UnitsThe following example code specifies num_compute_units(4,4) in a single work-item kernel results in a 4x4 array that consists of 4 x 4 = 16 compute units.
__attribute__((max_global_work_dim(0)))
__attribute__((autorun))
__attribute__((num_compute_units(4,4)))
__kernel void PE() {

   row = get_compute_id(0);
   col = get_compute_id(1);

   …
}

For a 3D array of compute units, you can retrieve the X, Y, and Z coordinates of a compute unit in the logical compute unit array using get_compute_id(0), get_compute_id(1), and get_compute_id(2), respectively. In this case, the API is very similar to the API of the work-item's intrinsic functions (that is, get_global_id(), get_local_id(), and get_group_id()).

Global IDs, local IDs, and group IDs can vary at runtime based on how the host invokes the kernel. However, compute IDs are known at compilation time, allowing the offline compiler to generate optimized hardware for each compute unit.