Visible to Intel only — GUID: GUID-6021244B-C559-4AF8-8110-BBCAA2EA67DE
Visible to Intel only — GUID: GUID-6021244B-C559-4AF8-8110-BBCAA2EA67DE
Vectorization: SIMD Processing Within a Work Group
Intel® SDK for OpenCL™ Applications includes an automatic vectorization module as part of the OpenCL program build process. Depending on the kernel code, this operation might have some limitations. When it is beneficial performance-wise, the module automatically packs adjacent work-items (from dimension zero of the ND-range) and executes them with SIMD instructions.
When using SIMD instructions, vector registers store a group of data elements of the same data type, such as float or int. The number of data elements that fit in one register depends on the data type width, for example: Intel® Xeon® processor (formerly known Intel® processor code name Skylake) offers vector register width of 512 bits. Each vector register (zmm) can store sixteen float (or alternatively eight double) or sixteen 32-bit integer numbers, and these are the most natural data types to work with Intel Xeon processor. Smaller data types are also processed by 16 elements at a time with some conversions.
A work group is the finest granularity for thread-level parallelism. Different threads pick up different work groups. Thus, per-work-group amount of calculations coupled with right work-group size and the resulting number of work groups available for parallel execution are critical factors in achieving good scalability for Intel Xeon processor.
The vectorization module enables you to benefit from vector units without writing explicit vector code. Also, you do not need for loops within kernels to benefit from vectorization. For better results, process a single data element in the kernel and let the vectorization module take care of the rest. To get more performance gains from vectorization, make you OpenCL code as simple as possible.
The vectorization module works best for the kernels that operate on elements of float (double) or int data types. The performance benefit might be lower for the kernels that include a complicated control flow.
The vectorization module packs work items for dimension zero of NDRange. Consider the following code example:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k++) Kernel_Body;
After vectorization, the code example of the work group looping over work items appears as follows:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k+=SIMD_WIDTH) VECTORIZED_Kernel_Body;
Also note that the dimension zero is the innermost loop and is vectorized. For more information, refer to the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and Autovectorization in Intel® SDK for OpenCL™ Applications version 1.5.