Visible to Intel only — GUID: GUID-CBB96CBA-30CF-456D-9512-A20B5A00EDA2
Visible to Intel only — GUID: GUID-CBB96CBA-30CF-456D-9512-A20B5A00EDA2
Benefitting from Implicit Vectorization
Intel® SDK for OpenCL™ Applications includes an implicit vectorization module as part of the program build process. When it is beneficial performance-wise, this module packs several work items and executes them with SIMD instructions. This enables you to benefit from the vector units in the Intel® Architecture processors without writing explicit vector code.
The vectorization module transforms scalar data type operations by adjacent work-items into an equivalent vector operations. When vector operations already exist in the kernel source code, the module scalarizes (breaks down into component operations) and revectorizes them. This improves performance by transforming the memory access pattern of the kernel into a structure of arrays (SOA), which is often more cache-friendly than an array of structures (AOS).
You can find more details in the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and OpenCL™ Autovectorization in Intel SDK for OpenCL Applications version 1.5.
The implicit vectorization module works best for the kernels that operate on elements of four-byte width, such as float or int data types. You can define the computational width of a kernel using the OpenCL vec_type_hint attribute.
Since the default computation width is four-byte, kernels are vectorized by default. If your kernel uses certain vector, you can specify __attribute__((vec_type_hint(<typen>))) with typen of any vector type (for example, float3 or char4). This attribute indicates to the vectorization module apply only transformations that are useful for this type.
The performance benefit from the vectorization module might be lower for the kernels that include a complex control flow.
To benefit from vectorization, you do not need the for loops within kernels. For best results, let the kernel deal with a single data element and let the vectorization module take care of the rest. The more straightforward your OpenCL™ code is, the more optimization you get from vectorization.
Writing the kernel in the plain scalar code is what works best for efficient vectorization. This method of coding avoids many disadvantages potentially associated with explicit (manual) vectorization described in the Using Vector Data Types section.