Visible to Intel only — GUID: GUID-9FB444AD-1883-4D84-AE67-4E8E3ED2261B
Visible to Intel only — GUID: GUID-9FB444AD-1883-4D84-AE67-4E8E3ED2261B
Work-Group Level Parallelism
Since work-groups are independent, they can execute concurrently on different hardware threads. So the number of work-groups should be not less than the number of logical cores. A larger number of work-groups results in more flexibility in scheduling, at the cost of task-switching overhead.
Notice that multiple cores of a CPU as well as multiple CPUs (in a multi-socket machine) constitute a single OpenCL device. Separate cores are compute units. The Device Fission extension enables you to control compute unit utilization within a compute device. You can find more information on the Device Fission in the Intel® Code Builder for OpenCL™ API - User Manual.
For the best performance and parallelism between work-groups, ensure that execution of a work-group takes at least 100,000 clocks. A smaller value increases the proportion of switching overhead compared to actual work.