Visible to Intel only — GUID: cfa1517854553870
Ixiasoft
Visible to Intel only — GUID: cfa1517854553870
Ixiasoft
6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization
Consider a case where a kernel with a num_simd_work_items attribute set to 16 does not fit in the FPGA. The kernel might fit if you modify it by duplicating a narrower SIMD kernel compute unit. Determining the optimal balance between the number of compute units and the SIMD width might require some experimentation. For example, duplicating a four lane-wide SIMD kernel compute unit three times might achieve better throughput than duplicating an eight lane-wide SIMD kernel compute unit twice.
The following example code shows how you can combine the num_compute_units and num_simd_work_items attributes in your OpenCL™ code:
__attribute__((num_simd_work_items(4)))
__attribute__((num_compute_units(3)))
__attribute__((reqd_work_group_size(8,8,1)))
__kernel void matrixMult(__global float * restrict C,
__global float * restrict A,
. . .
The figure below illustrates the data flow of the kernel described above. The num_compute_units implements three replicated compute units. The num_simd_work_items implements four SIMD vector lanes.