Intel® High Level Synthesis Compiler Pro Edition: Reference Manual

ID 683349
Date 12/04/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

A.3. Cholesky Decomposition Library

The Cholesky Decomposition library provided with the Intel HLS Compiler provides an FPGA-optimized templated library to factor a Hermitian (symmetric) positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose .

Header File

To include the Cholesky decomposition library in your component, add the following line to your component:
#include ”HLS/cholesky_decompose.h”
The header file is self-documented. You can review the header file to learn how to use the Cholesky decomposition library. For high performance, use the –ffp-reassociate command flag when compiling your component.

Variants

You can call four different variants. All variants are in the ihc::cholesky namespace.

Cholesky decomposition iterates on the matrix heavily. For the best component performance, the component need low latency access to the matrix iterated on, and the matrix must be vectorized. However, this requirement is not always easy to meet, so two of the variants allow you to use separate pieces of memory for the iterative decomposition and recording the final output.

The variants provided are as follows:

  • cholesky_decompose_real(A_input, L_iter, n);

    Used for real-valued matrices.

  • cholesky_decompose_real(A_input, L_output, L_iter, n);

    Used for real-valued matrices with separate memory for iterative decomposition and final output.

  • cholesky_decompose_complex(A_input, L_iter, n);

    Used for complex-valued matrices.

  • cholesky_decompose_complex(A_input, L_output, L_iter, n);

    Used for complex-valued matrices with separate memory for iterative decomposition and final output.

Arguments

A_input

A pointer to an array of size MATRIX_SIZE * MATRIX_SIZE, for providing input A matrix.

Its data width and alignment should match that of a single element in the matrix.

L_iter

A pointer to an array of size MATRIX_SIZE * MATRIX_SIZE, for holding the L matrix that the program can iterate on.

In the 3-argument variants, this is the same place for recording the result. Its data width and alignment should match that of a vector of elements in the matrix, with specified vectorization width.

If a pointer to component memory is given, the compiler should be able to optimize that automatically.

L_output

A pointer to an array of size MATRIX_SIZE * MATRIX_SIZE, for recording the final matrix result in the 4-argument variants.

Its data width and alignment should match that of a single element in the matrix.

n

The actual matrix size.

It should not be greater than MATRIX_SIZE

Template Arguments

FP_T
The floating-point data type. Can be float or double.
VEC_SIZE_PWR
The logarithm (base 2) of the vectorization width. Vectorization with a width of is used for the access of L_iter matrix and the dot product computation.
This header uses the Intel® HLS Compiler ivdep pragma with its safelen() clause, Dummy iterations matching the safelen() values are inserted to manage memory access dependency more efficiently.
The ivdep pragma with its safelen() clause are used for the following template arguments in the Cholesky decomposition algorithm:
  • INNER_SAFELEN value applies to the partial_dot array,
  • OUTER_SAFELEN_OVERWRITE value applies to the L_iter matrix
Although a reasonable estimate of the safelen() value is provided, their value might require some tuning for different devices, clock targets, precisions, and memory arrangements.
Use an iterative approach for finding their optimal values. First, try large conservative estimates until the compilation result does not demonstrate any II issues and gives you a a satisfactory fMAX. Then, use the Function Viewer (part of the High-Level Design Reports) to examine the schedule of the load and store nodes for the memory the safelen() value applies to. The difference of start cycle of the load and store should approximately match the optimal value to use for that safelen() value.
INNER_SAFELEN
The INNER_SAFELEN value applies to the partial_dot array. Its value roughly matches the latency of index calculation and a floating point addition in the algorithm. This safelen() value does not grow with matrix size. A default value of 16 is given, which should be ideal for single precision. For double precision, increase it accordingly.
OUTER_SAFELEN_OVERWRITE
The OUTER_SAFELEN_OVERWRITE value applies to the L_iter matrix, and is related to intercolumn dependencies. An estimate of this safelen() value is provided and along with an estimation of how this value grows along with matrix size. However, the optimal value for this template argument depends on clock target values, target device families, and the precision used.

Use this parameter to overwrite the set safelen() value.

If OUTER_SAFELEN_OVERWRITE=-1, the safelen() value is tuned for single-precision float data types on an Intel® Arria® 10 device with a default clock target. The tuning also assumes that the L_iter matrix is in component memory with proper vectorization provided.