Writing Kernels to Directly Target the Intel® Architecture Processors

OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon® Processors

Download PDF

ID 773005

Date 10/30/2018

Version 2018

Public

Writing Kernels to Directly Target the Intel® Architecture Processors

Using the OpenCL™ vector data types is a straightforward way to directly utilize the Intel® Architecture vector instruction set. See the Using Vector Data Types section for more information. Consider the following code snippet:

float4 a, b;
float4 c = a + b;

After compilation, it resembles the following C snippet in intrinsics:

__m128 a, b;
__m128 c = _mm_add_ps(a, b);

Or in assembly:

movaps xmm0, [a]
addps  xmm0, [b]
movaps [c], xmm0

However, in contrast to the code in intrinsics, an OpenCL kernel that uses float4 data type, transparently benefits from Intel® Advanced Vector Extensions (Intel® AVX) if the compiler promotes float4 to float8. The vectorization module can pack work items automatically, though it might be sometimes less efficient than manual packing.

If the native size for your kernel requires less than 128 bits and you want to benefit from the explicit vectorization, consider packing work items together manually.

For example, your kernel uses the float2 vector type. It receives (x, y) float coordinates, and shifts them by (dx, dy):

__kernel void shift_by(__global float2* coords, __global float2* deltas)
{
  int tid = get_global_id(0);
  coords[tid] += deltas[tid];
}

To increase the kernel performance, you can manually pack pairs of work items:

//Assuming the target is Intel® AVX enabled CPU
__kernel __attribute__((vec_type_hint(float8)))
void shift_by(__global float2* coords, __global float2* deltas)
{
  int tid = get_global_id(0);
  float8 my_coords = (float8)(coords[tid], coords[tid + 1],
                             coords[tid + 2], coords[tid + 3]);
  float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1],
                             deltas[tid + 2] , deltas[tid + 3]);
  my_coords += my_deltas;
  vstore8(my_coords, tid, (__global float*)coords);
}

Every work item in this kernel does four times as much job as a work item in the previous kernel. Consequently, they require only one fourth of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly.

For vectors of 32-bit data types, for example, int4, int8, float4, and float8 data types use explicit vectorization to improve the performance. Other data types (for example, char3) may cause a behind-the-scene upcast of the input data, which has negative impact on performance.

For best performance for a given data type, the vector width should match the underlying SIMD width. This value differs for different architectures. For example, consider querying the recommended vector width using clGetDeviceInfo with CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT parameter. You get vector width of four for the 2nd Generation Intel® Core™ processors, but vector width of eight for higher versions of processors. Use int8 so that vector width fits both architectures. Similarly for floating point data types, you can use float8 data to cover many potential architectures.

NOTE:

NOTE: Using scalar data types such as int or float is often the most “scalable” way to help the compiler do right vectorization job for the specific SIMD architecture underneath.

You can target to a specific Intel Architecture processor using a conditional code with an OpenCL™ C predefined macro __INTEL_OPENCL_CPU_<CPUSIGN>.

The macro tunes the kernel for a specific CPU device microarchitecture. <CPUSIGN> is the CPU signature of a device.

You can specify one of the following values for this macro:

__INTEL_OPENCL_CPU_SKL__ - Intel® microarchitecture code name Skylake
__INTEL_OPENCL_CPU_SKX__ - Intel® microarchitecture code name Skylake on Intel Xeon® processor family
__INTEL_OPENCL_CPU_BDW__ - Intel® microarchitecture code name Broadwell
__INTEL_OPENCL_CPU_BDW_XEON__ - Intel® microarchitecture code name Broadwell on Intel Xeon® processor family
__INTEL_OPENCL_CPU_HSW__ - Intel® microarchitecture code name Haswell
__INTEL_OPENCL_CPU_HSW_XEON__ - Intel® microarchitecture code name Haswell on Intel Xeon® processor family
__INTEL_OPENCL_CPU_IVB__ - Intel® microarchitecture code name Ivy Bridge
__INTEL_OPENCL_CPU_IVB_XEON__ - Intel® microarchitecture code name Ivy Bridge on Intel Xeon® processor family
__INTEL_OPENCL_CPU_SNB__ - Intel® microarchitecture code name Sandy Bridge
__INTEL_OPENCL_CPU_SNB_XEON__ - Intel® microarchitecture code name Sandy Bridge on Intel Xeon® processor family
__INTEL_OPENCL_CPU_WST__ - Intel® microarchitecture code name Westmere
__INTEL_OPENCL_CPU_WST_XEON__ - Intel® microarchitecture code name Westmere on Intel Xeon® processor family
__INTEL_OPENCL_CPU_UNKNOWN__ - Unknown microarchitecture

To tune performance for your target CPU, you can use this macro with intel_vec_len_hint extension. For example:

// Kernel side.
 
// Force vectorization with to 8 on BDW.
// Runtime defines a macro corresponding to the device CPU signature.
#ifdef __INTEL_OPENCL_CPU_BDW__
__attribute__((intel_vec_len_hint(8)))
#endif //BDW
__kernel void memcpy1(__global float* src, __global float* dst)
{
    size_t gid = get_global_id(0);
    dst[gid] = src[gid];
}

For more information about intel_vec_len_hint attribute extension, refer to Vectorizer Knobs.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon® Processors

Writing Kernels to Directly Target the Intel® Architecture Processors

See Also