Visible to Intel only — GUID: mwh1391807514629
Ixiasoft
Visible to Intel only — GUID: mwh1391807514629
Ixiasoft
6.2. Good Design Practices for Single Work-Item Kernel
Avoid Pointer Aliasing
Insert the restrict keyword in pointer arguments whenever possible. Including the restrict keyword in pointer arguments prevents the offline compiler from creating unnecessary memory dependencies between non-conflicting read and write operations. Consider a loop where each iteration reads data from one array, and then it writes data to another array in the same physical memory. Without including the restrict keyword in these pointer arguments, the offline compiler might assume dependence between the two arrays, and extracts less pipeline parallelism as a result.
Construct "Well-Formed" Loops
A "well-formed" loop has an exit condition that compares against an integer bound, and has a simple induction increment of one per iteration. Including "well-formed" loops in your kernel improves performance because the offline compiler can analyze these loops efficiently.
The following example is a "well-formed" loop:
for (i = 0; i < N; i++) {
//statements
}
The following example is a "well-formed" nested loop structure:
for (i = 0; i < N; i++) {
//statements
for(j = 0; j < M; j++) {
//statements
}
}
Minimize Loop-Carried Dependencies
The loop structure below creates a loop-carried dependence because each loop iteration reads data written by the previous iteration. As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies decreases the extent of pipeline parallelism that the offline compiler can achieve, which reduces kernel performance.
for (int i = 0; i < N; i++) {
A[i] = A[i - 1] + i;
}
The offline compiler performs a static memory dependence analysis on loops to determine the extent of parallelism that it can achieve. In some cases, the offline compiler might assume dependence between two array accesses, and extracts less pipeline parallelism as a result. The offline compiler assumes loop-carried dependence if it cannot resolve the dependencies at compilation time because of unknown variables, or if the array accesses involve complex addressing.
To minimize loop-carried dependencies, following the guidelines below whenever possible:
- Avoid pointer arithmetic.
Compiler output is suboptimal when the kernel accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array in the following manner:
for (int i = 0; i < N; i++) { int t = *(A++); *A = t; }
- Introduce simple array indexes.
Avoid the following types of complex array indexes because the offline compiler cannot analyze them effectively, which might lead to suboptimal compiler output:
- Nonconstants in array indexes.
For example, A[K + i], where i is the loop index variable and K is an unknown variable.
- Multiple index variables in the same subscript location.
For example, A[i + 2 × j], where i and j are loop index variables for a double nested loop.
Note: The offline compiler can analyze the array index A[i][j] effectively because the index variables are in different subscripts. - Nonlinear indexing.
For example, A[i & C], where i is a loop index variable and C is a constant or a nonconstant variable.
- Nonconstants in array indexes.
- Use loops with constant bounds in your kernel whenever possible.
Loops with constant bounds allow the offline compiler to perform range analysis effectively.
Avoid Complex Loop Exit Conditions
The offline compiler evaluates exit conditions to determine if subsequent loop iterations can enter the loop pipeline. There are times when the offline compiler requires memory accesses or complex operations to evaluate the exit condition. In these cases, subsequent iterations cannot launch until the evaluation completes, decreasing overall loop performance.
Convert Nested Loops into a Single Loop
To maximize performance, combine nested loops into a single form whenever possible. Restructuring nested loops into a single loop reduces hardware footprint and computational overhead between loop iterations.
The following code examples illustrate the conversion of a nested loop into a single loop:
Nested Loop | Converted Single Loop |
---|---|
|
|
Avoid Conditional Loops
To maximize performance, avoid declaring conditional loops. Conditional loops are tuples of loops that are declared within conditional statements such that one and only one of the loops is expected to be reached. These loops cannot be efficiently parallelized and result in a serialized implementation.
Conditional Loops | Converted Loop |
---|---|
|
|
Declare Variables in the Deepest Scope Possible
To reduce the hardware resources necessary for implementing a variable, declare the variable prior to its use in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and hardware usage because the offline compiler does not need to preserve the variable data across loops that do not use the variables.
Consider the following example:
int a[N];
for (int i = 0; i < m; ++i) {
int b[N];
for (int j = 0; j < n; ++j) {
// statements
}
}
The array a requires more resources to implement than the array b. To reduce hardware usage, declare array a outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.