Visible to Intel only — GUID: GUID-029CE20A-09A7-48EB-AE0E-7DA5711BFB8C
Visible to Intel only — GUID: GUID-029CE20A-09A7-48EB-AE0E-7DA5711BFB8C
Tips for Auto-Vectorization
Upon kernel compilation, the vectorization module often transforms the kernel memory access pattern from array of structures (AOS) to structure of arrays (SOA), which is SIMD friendly.
This transformation comes with a certain cost, specifically the transpose penalty. If you organize the input data in SOA instead of AOS, it reduces the transpose penalty.
For example, the following code suffers from transpose penalty:
__kernel void sum(__global float4* input, __global float* output) { int tid = get_global_id(0); output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w; }
While the following piece of code does not suffer from the transpose penalty:
__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw, __global float* output) { int tid = get_global_id(0); output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid]; }
Take care when dealing with branches. Particularly, avoid data loads and stores within the statements:
if (…) {//condition x = A[i1];// reading from A … // calculations B[i2] = y;// storing into B } else { q = A[i1];// reading from A with same index as in first clause … // different calculations B[i2] = w; // storing into B with same index as in first clause }
The following code avoids loading from and storing to memory within branches:
temp1 = A[i1]; //reading from A in advance if (…) {//condition x = temp1; … // some calculations temp2 = y; //storing into temporary variable } else { q = temp1; … //some calculations temp2 = w; //storing into temporary variable } B[i2] =temp2; //storing to B once