4.2.3. Example: Loop Pipelining and Unrolling
1. #define ROWS 4
2. #define COLS 4
3.
4. component void dut(...) {
5. float a_matrix[COLS][ROWS]; // store in column-major format
6. float r_matrix[ROWS][COLS]; // store in row-major format
7.
8. // setup...
9.
10. for (int i = 0; i < COLS; i++) {
11. for (int j = i + 1; j < COLS; j++) {
12.
13. float dotProduct = 0;
14. for (int mRow = 0; mRow < ROWS; mRow++) {
15. dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
16. }
17. r_matrix[i][j] = dotProduct;
18. }
19. }
20.
21. // continue...
22.
23. }
You can improve the performance of this component by unrolling the loops that iterate across each entry of a particular column. If the loop operations are independent, then the compiler executes them in parallel.
Floating-point operations typically must be carried out in the same order that they are expressed in your source code to preserve numerical precision. However, you can use the --fp-relaxed compiler flag to relax the ordering of floating-point operations. With the order of floating-point operations relaxed, all of the multiplications in this loop can occur in parallel. To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ best_practices / floating_point_ops .
The compiler tries to unroll loops on its own when it thinks unrolling improves performance. For example, the loop at line 14 is automatically unrolled because the loop has a constant number of iterations, and does not consume much hardware (ROWS is a constant defined at compile-time, ensuring that this loop has a fixed number of iterations).
01: #define ROWS 4
02: #define COLS 4
03:
04: component void dut(...) {
05: float a_matrix[COLS][ROWS]; // store in column-major format
06: float r_matrix[ROWS][COLS]; // store in row-major format
07:
08: // setup...
09:
10: for (int i = 0; i < COLS; i++) {
11:
12: #pragma unroll
13: for (int j = 0; j < COLS; j++) {
14: float dotProduct = 0;
15:
16: #pragma unroll
17: for (int mRow = 0; mRow < ROWS; mRow++) {
18: dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
19: }
20:
21: r_matrix[i][j] = (j > i) ? dotProduct : 0; // predication
22: }
23: }
24: }
25:
26: // continue...
27:
28: }
Now the j-loop is fully unrolled. Because they do not have any dependencies, all four iterations run at the same time.
Refer to the resource_sharing_filter tutorial located at <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter for more details.
You could continue and also unroll the loop at line 10, but unrolling this loop would result in the area increasing again. By allowing the compiler to pipeline this loop instead of unrolling it, you can avoid increasing the area and pay about only four more clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details pane of the Loops Analysis page in the high-level design report (report.html) gives you tips on how to improve it.
- loop-carried dependencies
See the tutorial at <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency
- long critical loop path
- inner loops with a loop II > 1