Visible to Intel only — GUID: osy1517854537002
Ixiasoft
Visible to Intel only — GUID: osy1517854537002
Ixiasoft
6.3. Multiple Compute Units
To increase overall kernel throughput, the hardware scheduler in the FPGA dispatches work-groups to additional available compute units. A compute unit is available for work-group assignments as long as it has not reached its full capacity.
Assume each work-group takes the same amount of time to complete its execution. If the offline compiler implements two compute units, each compute unit executes half of the work-groups. Because the hardware scheduler dispatches the work-groups, you do not need to manage this process in your own code.
The offline compiler does not automatically determine the optimal number of compute units for a kernel. To increase the number of compute units for your kernel implementation, you must specify the number of compute units that the offline compiler should create using the num_compute_units attribute, as shown in the code sample below.
__attribute__((num_compute_units(2)))
__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
__global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid] = a[gid] + b[gid];
}
Increasing the number of compute units achieves higher throughput. However, as shown in the figure below, you do so at the expense of increasing global memory bandwidth among the compute units. You also increase hardware resource utilization.