Visible to Intel only — GUID: zco1508901510898
Ixiasoft
Visible to Intel only — GUID: zco1508901510898
Ixiasoft
5.2.3. Example: Loop Pipelining and Unrolling
1. #define ROWS 4
2. #define COLS 4
3.
4. component void dut(...) {
5. float a_matrix[COLS][ROWS]; // store in column-major format
6. float r_matrix[ROWS][COLS]; // store in row-major format
7.
8. // setup...
9.
10. for (int i = 0; i < COLS; i++) {
11. for (int j = i + 1; j < COLS; j++) {
12.
13. float dotProduct = 0;
14. for (int mRow = 0; mRow < ROWS; mRow++) {
15. dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
16. }
17. r_matrix[i][j] = dotProduct;
18. }
19. }
20.
21. // continue...
22.
23. }
You can improve the performance of this component by unrolling the loops that iterate across each entry of a particular column. If the loop operations are independent, then the compiler executes them in parallel.
- The multiplication operations can occur in parallel.
- The addition operations can be composed into an adder tree instead of an adder chain.
01: #define ROWS 4
02: #define COLS 4
03:
04: component void dut(...) {
05: float a_matrix[COLS][ROWS]; // store in column-major format
06: float r_matrix[ROWS][COLS]; // store in row-major format
07:
08: // setup...
09:
10: for (int i = 0; i < COLS; i++) {
11:
12: #pragma unroll
13: for (int j = 0; j < COLS; j++) {
14: float dotProduct = 0;
15:
16: #pragma unroll
17: for (int mRow = 0; mRow < ROWS; mRow++) {
18: dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
19: }
20:
21: r_matrix[i][j] = (j > i) ? dotProduct : 0; // predication
22: }
23: }
24: }
25:
26: // continue...
27:
28: }
Now the j-loop is fully unrolled. Because they do not have any dependencies, all four iterations run at the same time.
Refer to the resource_sharing_filter tutorial located at <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter for more details.
You could continue and also unroll the loop at line 10, but unrolling this loop would result in the area increasing again. By allowing the compiler to pipeline this loop instead of unrolling it, you can avoid increasing the area and pay about only four more clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details pane of the Loops Analysis page in the high-level design report (report.html) gives you tips on how to improve it.
- loop-carried dependencies
See the tutorial at <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency
- long critical loop path
- inner loops with a loop II > 1