Visible to Intel only — GUID: GUID-E642E8DE-036B-4CD3-AF73-CAA743F982C2
Visible to Intel only — GUID: GUID-E642E8DE-036B-4CD3-AF73-CAA743F982C2
Using the Blocking Technique
An important class of algorithmic changes involves changing the access pattern so that working set fits in cache. By organizing data memory accesses, you can load the cache with a small subset of a much larger data set. The idea is then to work on this block of data in cache. By using/reusing this data in cache we reduce the need to go to memory (memory bandwidth pressure).
Blocking is an optimization technique that can help to avoid memory bandwidth bottlenecks. Blocking enables you to exploit the inherent data reuse available in the application by ensuring that the data remains in cache across multiple uses. You can use the blocking technique on one-, two- or three-dimension spatial data structures. Some iterative applications can benefit from blocking over multiple iterations, which is called temporal blocking, and further mitigate bandwidth bottlenecks, which requires merging some kernels. In terms of code change, blocking typically involves loop splitting. In most application codes, blocking is best performed by parameterization and tuning of the block-factors.
For instance, the code snippet below shows an example of blocking with two loops (i1 and i2) iterating over entire data set. The original code streams through the entire set in the inner loop, and must load the data[i2] value from memory in each iteration. Assuming NUM is large, and compute is not arithmetic intensive, we have no reuse in cache so this application is memory-bandwidth bound.
The original code without applying the blocking technique appears as follows:
for (i1 = 0; i1 < NUM; i1 ++){ for (i2=0; i2 < NUM; i2++) { OUT[i1] += compute(data[i1], data[i2]); } }
The blocked code below is obtained by splitting the i2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner i22 loop, iterating over elements within the block; and interleaving the i1 and i2 loops. This code reuses a set of BLOCKi2 values across multiple iterations of the i1 loop. In case you choose BLOCK so that it makes the set of values fit in the L2 cache, the memory traffic is reduced by a factor of the BLOCK.
The modified code with use of one-dimensional blocking appears as follows:
for (i2 = 0; i2 < NUM; i2 += BLOCK) { for (i1=0; i1 < NUM; i1 ++) { for (i22=0; i22 < BLOCK; i22 ++) { OUT[i1] += compute(data[i1], data[i2 + i22]); } } }
For the more detailed example of tiling, refer to the General Matrix Multiply Sample .