Visible to Intel only — GUID: GUID-F461F5DC-27C9-47D4-8D4F-743E91CCB227
Visible to Intel only — GUID: GUID-F461F5DC-27C9-47D4-8D4F-743E91CCB227
Writing Kernels to Directly Target the Intel® Architecture Processors
Using the OpenCL™ vector data types is a straightforward way to directly utilize the Intel® Architecture vector instruction set. See the Using Vector Data Types section for more information. Consider the following code snippet:
float4 a, b; float4 c = a + b;
After compilation, it resembles the following C snippet in intrinsics:
__m128 a, b; __m128 c = _mm_add_ps(a, b);
Or in assembly:
movaps xmm0, [a] addps xmm0, [b] movaps [c], xmm0
However, in contrast to the code in intrinsics, an OpenCL kernel that uses float4 data type, transparently benefits from Intel® Advanced Vector Extensions (Intel® AVX) if the compiler promotes float4 to float8. The vectorization module can pack work items automatically, though it might be sometimes less efficient than manual packing.
If the native size for your kernel requires less than 128 bits and you want to benefit from the explicit vectorization, consider packing work items together manually.
For example, your kernel uses the float2 vector type. It receives (x, y) float coordinates, and shifts them by (dx, dy):
__kernel void shift_by(__global float2* coords, __global float2* deltas) { int tid = get_global_id(0); coords[tid] += deltas[tid]; }
To increase the kernel performance, you can manually pack pairs of work items:
//Assuming the target is Intel® AVX enabled CPU __kernel __attribute__((vec_type_hint(float8))) void shift_by(__global float2* coords, __global float2* deltas) { int tid = get_global_id(0); float8 my_coords = (float8)(coords[tid], coords[tid + 1], coords[tid + 2], coords[tid + 3]); float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1], deltas[tid + 2] , deltas[tid + 3]); my_coords += my_deltas; vstore8(my_coords, tid, (__global float*)coords); }
Every work item in this kernel does four times as much job as a work item in the previous kernel. Consequently, they require only one fourth of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly.
For vectors of 32-bit data types, for example, int4, int8, float4, and float8 data types use explicit vectorization to improve the performance. Other data types (for example, char3) may cause a behind-the-scene upcast of the input data, which has negative impact on performance.
For best performance for a given data type, the vector width should match the underlying SIMD width. This value differs for different architectures. For example, consider querying the recommended vector width using clGetDeviceInfo with CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT parameter. You get vector width of four for the 2nd Generation Intel® Core™ processors, but vector width of eight for higher versions of processors. Use int8 so that vector width fits both architectures. Similarly for floating point data types, you can use float8 data to cover many potential architectures.
You can target to a specific Intel Architecture processor using a conditional code with an OpenCL™ C predefined macro __INTEL_OPENCL_CPU_<CPUSIGN>.
The macro tunes the kernel for a specific CPU device microarchitecture. <CPUSIGN> is the CPU signature of a device.
You can specify one of the following values for this macro:
- __INTEL_OPENCL_CPU_SKL__ - Intel® microarchitecture code name Skylake
- __INTEL_OPENCL_CPU_SKX__ - Intel® microarchitecture code name Skylake on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_BDW__ - Intel® microarchitecture code name Broadwell
- __INTEL_OPENCL_CPU_BDW_XEON__ - Intel® microarchitecture code name Broadwell on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_HSW__ - Intel® microarchitecture code name Haswell
- __INTEL_OPENCL_CPU_HSW_XEON__ - Intel® microarchitecture code name Haswell on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_IVB__ - Intel® microarchitecture code name Ivy Bridge
- __INTEL_OPENCL_CPU_IVB_XEON__ - Intel® microarchitecture code name Ivy Bridge on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_SNB__ - Intel® microarchitecture code name Sandy Bridge
- __INTEL_OPENCL_CPU_SNB_XEON__ - Intel® microarchitecture code name Sandy Bridge on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_WST__ - Intel® microarchitecture code name Westmere
- __INTEL_OPENCL_CPU_WST_XEON__ - Intel® microarchitecture code name Westmere on Intel Xeon® processor family
- __INTEL_OPENCL_CPU_UNKNOWN__ - Unknown microarchitecture
To tune performance for your target CPU, you can use this macro with intel_vec_len_hint extension. For example:
// Kernel side. // Force vectorization with to 8 on BDW. // Runtime defines a macro corresponding to the device CPU signature. #ifdef __INTEL_OPENCL_CPU_BDW__ __attribute__((intel_vec_len_hint(8))) #endif //BDW __kernel void memcpy1(__global float* src, __global float* dst) { size_t gid = get_global_id(0); dst[gid] = src[gid]; }
For more information about intel_vec_len_hint attribute extension, refer to Vectorizer Knobs.
See Also
OpenCL 1.2 Specification at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf