Visible to Intel only — GUID: nwd1517596338849
Ixiasoft
Visible to Intel only — GUID: nwd1517596338849
Ixiasoft
7.4. Improving Kernel Performance by Banking the Local Memory
The following code example depicts an 8 x 4 local memory system that is implemented in a single bank. As a result, no two elements in the system can be accessed in parallel.
local int lmem[8][4];
#pragma unroll
for(int i = 0; i<4; i+=2) {
lmem[i][x] = …;
}
To improve performance, you can add numbanks(N) and bankwidth(M) in your code to define the number of memory banks and the bank widths in bytes. The following code implements eight memory banks, each 16-bytes wide. This memory bank configuration enables parallel memory accesses down the 8 x 4 array.
local int __attribute__((numbanks(8),
bankwidth(16)))
lmem[8][4];
#pragma unroll
for (int i = 0; i < 4; i+=2) {
lmem[i][x & 0x3] = …;
}
To enable parallel access, you must mask the dynamic access on the lower array index. Masking the dynamic access on the lower array index informs the that x will not exceed the lower index bounds.
By specifying different values for the numbanks(N) and bankwidth(M) kernel attributes, you can change the parallel access pattern. The following code implements four memory banks, each 4-bytes wide. This memory bank configuration enables parallel memory accesses across the 8 x 4 array.
local int __attribute__((numbanks(4),
bankwidth(4)))
lmem[8][4];
#pragma unroll
for (int i = 0; i < 4; i+=2) {
lmem[x][i] = …;
}